On the swap_out() path, the radix-tree pagecache is allocating its
nodes with PF_MEMALLOC set, which allows it to completely exhaust the
free page lists(*). This is fairly easy to trigger with swap-intensive
loads.
It would be better to make those node allocations fail at an earlier
time. When this happens, the radix-tree can still obtain nodes from its
mempool, and we leave some memory available for the I/O layer.
(Assuming that the I/O is being performed under PF_MEMALLOC, which it
is).
So the patch simply drops PF_MEMALLOC while adding nodes to the
swapcache's tree.
We're still performing atomic allocations, so the rat is still biting
pretty deeply into the page reserves - under heavy load the amount of
free memory is less than half of what it was pre-rat.
It is unfortunate that the page allocator overloads !__GFP_WAIT to also
mean "try harder". It would be better to separate these concepts, and
to allow the radix-tree code (at least) to perform atomic allocations,
but to not go below pages_min. It seems that __GFP_TRY_HARDER will be
pretty straightforward to implement. Later.
The patch also impements a workaround for the mempool list_head
problem, until that is sorted out.
(*) The usual result is that the SCSI layer dies at scsi_merge.c:82.
It would be nice to have a fix for that - it's going BUG if 1-order
allocations fail at interrupt time. That happens pretty easily.