Andrew Morton [Tue, 13 Apr 2004 02:21:59 +0000 (19:21 -0700)]
[PATCH] kbuild: Create .tmp_versions when building external modules
From: Sam Ravnborg <sam@ravnborg.org>
When building external modules the $PWD/.tmp_versions directory is used.
The .tmp_versions directory in the kernel tree cannot be used because this
would clutter up the kernel tree especially when more than one external
module is being build for the same kernel tree.
This patch make sure to create $PWD/.tmp_versions, and to delete it during
make clean. It also removes warning about 'messed with SUBDIRS', this is
no longer relevant when .tmp_versions is made outside the kernel tree.
M68k TLB fixes from Roman Zippel:
- Check current->active_mm for currently active mm
- Set correct context to flush the right ATC entry
This is especially important for kswapd to correctly flush unmapped entries (it
caused random segfaults during large compiles)
Randy Dunlap [Mon, 12 Apr 2004 19:03:14 +0000 (20:03 +0100)]
[ARM] use errno #defines in assembly
Patch from: Randy Dunlap
From: Danilo Piazzalunga
Some assembly code (on various archs) either
1. uses hardcoded errno numbers instead of the canonical macro
names, or
2. defines them locally, instead of including the appropriate
header (while including other headers).
This patch "fixes" such usage in
- getuser.S for arm
- putuser.S for arm
Andrew Morton [Mon, 12 Apr 2004 08:07:11 +0000 (01:07 -0700)]
[PATCH] Oprofile: ARM/XScale PMU driver
From: Zwane Mwaikambo <zwane@linuxpower.ca>
The following patch adds support for the XScale performance monitoring unit
to OProfile. It uses not only the performance monitoring counters, but
also the clock cycle counter (CCNT) allowing for upto 5 usable counters.
The code has been developed and tested on an IOP331 (hardware courtesy of
Intel) therefore i haven't been able to test it on XScale PMU1 systems.
Testing on said systems would be appreciated, and if done, please uncomment
the #define DEBUG line at the top of op_model_xscale.c
OProfile userspace support has already been committed and should be
available via CVS.
Andrew Morton [Mon, 12 Apr 2004 08:06:06 +0000 (01:06 -0700)]
[PATCH] Add CONFIG_SYSFS
From: Patrick Mochel <mochel@digitalimplant.org>
Here is a patch to make sysfs optional. Note that with CONFIG_SYSFS=n you
must specify the boot device's major:minor on the kernel boot command line
with
root=03:01
For embedded systems, it will save a significant amount of memory during
runtime. And, it saves 4k from the built kernel image for me.
Andrew Morton [Mon, 12 Apr 2004 08:05:52 +0000 (01:05 -0700)]
[PATCH] parport: no procfs warning fix
drivers/parport/procfs.c: In function `parport_default_proc_unregister':
drivers/parport/procfs.c:529: warning: `return' with a value, in function returning void
Andrew Morton [Mon, 12 Apr 2004 08:05:40 +0000 (01:05 -0700)]
[PATCH] kbuild: external module support
From: Sam Ravnborg <sam@ravnborg.org>
Based on initial patch from Andreas Gruenbacher there is now better support
for building external modules with kbuild.
The preferred syntax is now:
make -C $KERNELSRC M=$PWD
but the old syntax:
make -C $KERNELSRC SUBDIRS=$PWD modules
will remain supported.
The major differences compared to before are that:
1) No attempt is made to neither check nor update any files in $KERNELSRC
2) Module versions are now supported
During stage 2 of kernel compilation where the modules are built, a new file
Module.symvers is created. This file contains the version for all symbols
exported by the kernel and any module compiled within the kernel tree.
When the external module is build the Module.symvers file is being read and
symbol versions are used from that file.
The purpose of avoiding any updates in the kernel src is that usually in a
distribution the kernel src will be read-only, and there is no need to try to
update it. And when building an external module the focus is on the module,
not the kernel.
I expect the distributions will start using something like this:
kernel src - with no generated files. Not even .config:
/usr/src/linux-<version>
Output from build:
/lib/modules/linux-<version>/build
where build is a real directory with relevant output files and the
appropriate .config.
I have some Documentation in the pipe-line, but wants to see how this
approach is received before completing it.
This patch is made on top of the previously posted patch to divide
make clean in three steps.
And you may need to edit the following line in the patch to make it apply:
%docs: scripts_basic FORCE
to
%docs: scripts FORCE
Andrew Morton [Mon, 12 Apr 2004 08:05:26 +0000 (01:05 -0700)]
[PATCH] kbuild: cleaning in three steps
From: Sam Ravnborg <sam@ravnborg.org>
Previously 'make clean' deleted all automatically generated files. The
following patch revert this behaviour, and now 'make clean' leaves enough
behind to allow external modules to be built.
The cleaning is now done in three steps:
make clean - delete everything not needed for building external modules
make mrproper - delete all generated files, including .config
make distclean - delete all temporary files such as *.orig, *~, *.rej etc.
This fixes reports about nvidia and vmware build issues.
Andrew Morton [Mon, 12 Apr 2004 08:05:14 +0000 (01:05 -0700)]
[PATCH] Make %docs depend on scripts_basic
From: Sam Ravnborg <sam@ravnborg.org>
From: Herbert Xu <herbert@gondor.apana.org.au>
It seems that the %docs targets only needs scripts_basic. The following
patch does just that. This removes its dependency on the existence of a
.config file.
Andrew Morton [Mon, 12 Apr 2004 08:04:59 +0000 (01:04 -0700)]
[PATCH] fb_copy_cmap() fix
From: Arjan van de Ven <arjanv@redhat.com>
fb_copy_cmap() takes an argument about wether to do memcpy, copy_from_user or
copy_to_user. 0 is memcpy, 2 is copy_to_user. In the ioctl you want
copy_to_user for copying the colormap to userspace.
where it's clear that the & in the first copy_from_user is utterly bogus
since the destination is the content of the newly allocated buffer, and not
the pointer to it as the code does.
Andrew Morton [Mon, 12 Apr 2004 08:04:34 +0000 (01:04 -0700)]
[PATCH] BSD accounting oops fix
oopses have been reported in do_acct_process(), with premption enabled, when
threaded applications are exitting.
It appears that we're racing with another thread which is nulling out
current->tty. I think this race is still there after we moved current->tty
into current->signal->tty, so let's take the needed lock.
Andrew Morton [Mon, 12 Apr 2004 08:04:21 +0000 (01:04 -0700)]
[PATCH] tpqic02 warnings
drivers/char/tpqic02.c: In function `rdstatus':
drivers/char/tpqic02.c:700: warning: int format, different type arg (arg 2)
drivers/char/tpqic02.c:700: warning: int format, different type arg (arg 2)
Andrew Morton [Mon, 12 Apr 2004 08:04:08 +0000 (01:04 -0700)]
[PATCH] applicom warnings and usercopy-in-cli fix
drivers/char/applicom.c: In function `ac_write':
drivers/char/applicom.c:363: warning: int format, different type arg (arg 2)
drivers/char/applicom.c:363: warning: int format, different type arg (arg 3)
drivers/char/applicom.c:363: warning: int format, different type arg (arg 2)
drivers/char/applicom.c:363: warning: int format, different type arg (arg 3)
drivers/char/applicom.c:523:2: warning: #warning "Je suis stupide. DW. - copy*user in cli"
drivers/char/applicom.c: In function `ac_read':
drivers/char/applicom.c:546: warning: int format, different type arg (arg 2)
drivers/char/applicom.c:546: warning: int format, different type arg (arg 3)
drivers/char/applicom.c:546: warning: int format, different type arg (arg 2)
drivers/char/applicom.c:546: warning: int format, different type arg (arg 3)
Andrew Morton [Mon, 12 Apr 2004 08:03:56 +0000 (01:03 -0700)]
[PATCH] policydb printk warnings
security/selinux/ss/policydb.c:1160: warning: signed size_t format, different type arg (arg 3)
security/selinux/ss/policydb.c:1160: warning: signed size_t format, different type arg (arg 3)
Andrew Morton [Mon, 12 Apr 2004 08:03:42 +0000 (01:03 -0700)]
[PATCH] i2c-dev warning fixes
drivers/i2c/i2c-dev.c: In function `i2cdev_read':
drivers/i2c/i2c-dev.c:140: warning: int format, different type arg (arg 3)
drivers/i2c/i2c-dev.c: In function `i2cdev_write':
drivers/i2c/i2c-dev.c:168: warning: int format, different type arg (arg 3)
Andrew Morton [Mon, 12 Apr 2004 08:03:29 +0000 (01:03 -0700)]
[PATCH] Rename bitmap_clear to bitmap_zero, remove CLEAR_BITMAP
From: Rusty Russell <rusty@rustcorp.com.au>
clear_bit(n, addr) clears the nth bit.
test_and_clear_bit(n, addr) clears the nth bit.
cpu_clear(n, cpumask) clears the nth bit (vs. cpus_clear()).
bitmap_clear(bitmap, n) clears out all the bits up to n.
Moreover, there's a CLEAR_BITMAP() in linux/types.h which bitmap_clear() is
a wrapper for.
Rename bitmap_clear to bitmap_zero, which is harder to confuse (yes, it bit
me), and make everyone use it.
Andrew Morton [Mon, 12 Apr 2004 08:03:15 +0000 (01:03 -0700)]
[PATCH] Fix More Problems Introduced By Module Structure Added in modpost.c
From: Rusty Russell <rusty@rustcorp.com.au>
Sam Ravnborg found these.
1) have_vmlinux is a global, and should not be reset every time.
2) We pretend every module needs cleanup_module so it gets versioned,
but that isn't defined for CONFIG_MODULE_UNLOAD=n.
3) The visible effect of this is that modpost will start complaning about
undefined symbols - previously this happened only when the module was
isntalled.
Andrew Morton [Mon, 12 Apr 2004 08:03:03 +0000 (01:03 -0700)]
[PATCH] do_fork() error path memory leak
From: <john.l.byrne@hp.com>
In do_fork(), if an error occurs after the mm_struct for the child has been
allocated, it is never freed. The exit_mm() meant to free it increments
the mm_count and this count is never decremented. (For a running process
that is exitting, schedule() takes care this; however, the child process
being cleaned up is not running.) In the CLONE_VM case, the parent's
mm_struct will get an extra mm_count and so it will never be freed.
This patch should fix both the CLONE_VM and the not CLONE_VM case; the test
of p->active_mm prevents a panic in the case that a kernel-thread is being
cloned.
Andrew Morton [Mon, 12 Apr 2004 08:02:37 +0000 (01:02 -0700)]
[PATCH] fix for potential integer overflow in zoran driver
From: "Ronald S. Bultje" <R.S.Bultje@students.uu.nl>
Attached patch fixes a potential integer overflow in zoran_procs.c (part of
the zr36067 driver). Bug was detected by Ken Ashcraft with the Stanford
checker.
Andrew Morton [Mon, 12 Apr 2004 08:02:23 +0000 (01:02 -0700)]
[PATCH] ext3fs sb= mount option fix
From: <achurch@achurch.org> (Andrew Church)
The following patch fixes a bug in the processing of the sb= (alternate
superblock) mount option for ext3: when changing the device block size, the
given superblock is ignored and the code reverts to using block 1.
Andrew Morton [Mon, 12 Apr 2004 08:02:11 +0000 (01:02 -0700)]
[PATCH] ext2fs sb= mount option fix
From: <achurch@achurch.org> (Andrew Church)
The following patch fixes a bug in the processing of the sb= (alternate
superblock) mount option for ext2: when changing the device block size, the
given superblock is ignored and the code reverts to using block 1.
Andrew Morton [Mon, 12 Apr 2004 08:01:57 +0000 (01:01 -0700)]
[PATCH] fix test_and_change_bit comment
From: Paul Jackson <pj@sgi.com>
I've read over the code in each case, built and ran a test case for i386 in
particular, and studied the other uses and definitions of
test_and_change_bit(). Everything I see recommends this change.
- Fix test_and_change_bit() comment: returns old value, not new one.
Andrew Morton [Mon, 12 Apr 2004 08:01:45 +0000 (01:01 -0700)]
[PATCH] make ibmasm driver uart support depend on SERIAL_8250
From: Max Asbock <masbock@us.ibm.com>
This patch makes serial line registration in the ibmasm service processor
driver depend on CONFIG_SERIAL_8250. Previously the driver wouldn't
compile when serial driver support wasn't enabled.
that the right side of the & does not get extended correctly when the
constant is promoted to the sector_t type. I have CONFIG_LBD turned on so
sector_t should be 64bits wide. This fails to properly mask the value of 4294967296 (2TB/512) to 4294967296. in my case it was coming out 0. this
cause the loop following this code to read from 0 to 4294967296 blocks so
it could write one character.
As you might imagine this makes a format of a 3.5TB filesystem take a very
long time.
Andrew Morton [Mon, 12 Apr 2004 07:59:59 +0000 (00:59 -0700)]
[PATCH] saa7134 - Add two inputs for Asus TV FM
From: Martin Hicks <mort@bork.org>
I just bought an ASUS TV FM capture card, based on the saa7134 chip. It only
had one input specified, coax. This patch adds the Composite and S-Video
inputs. It seems to work correctly for me.
Andrew Morton [Mon, 12 Apr 2004 07:59:45 +0000 (00:59 -0700)]
[PATCH] Fix parportbook build again
From: Herbert Xu <herbert@gondor.apana.org.au>
The previous fix causes a syntax error when building:
Working on: /home/gondolin/herbert/src/debian/work/kernel/build/2.6/kernel-source-2.6.5-2.6.5/Documentation/DocBook/parportbook.sgml
jade:/home/gondolin/herbert/src/debian/work/kernel/build/2.6/kernel-source-2.6.5-2.6.5/Documentation/DocBook/parportbook.sgml:4059:2:E: invalid comment declaration: found character "!" outside comment but inside comment declaration
jade:/home/gondolin/herbert/src/debian/work/kernel/build/2.6/kernel-source-2.6.5-2.6.5/Documentation/DocBook/parportbook.sgml:4058:0: comment declaration started here
jade:/home/gondolin/herbert/src/debian/work/kernel/build/2.6/kernel-source-2.6.5-2.6.5/Documentation/DocBook/parportbook.sgml:4059:4:E: character data is not allowed here
This patch removes the offending line completely since that file is probably
not coming back anyway.
Andrew Morton [Mon, 12 Apr 2004 07:59:33 +0000 (00:59 -0700)]
[PATCH] QD65xx I/O ports fix
From: Geert Uytterhoeven <geert@linux-m68k.org>
I/O port numbers can be larger than 8-bit on many platforms (this caused a
warning when {out,in}b() cast reg to a pointer on platforms with memory
mapped I/O)
Andrew Morton [Mon, 12 Apr 2004 07:58:14 +0000 (00:58 -0700)]
[PATCH] get_user_pages shortcut for anonymous pages
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
The patch avoids the instantiation of pagetables for not-present pages in
get_user_pages(). Without this, the coredump code can cause total memory
exhaustion in pagetables. Consider a store to current stack - 1TB. The
stack vma is extended to include this address because of VM_GROWSDOWN. If
such a process dies (which is likely for a defunc process) then the elf core
dumper will cause the system to hang because of too many page tables.
We especially recognise this situation and simply return a ref to the zero
page.
Andrew Morton [Mon, 12 Apr 2004 07:57:48 +0000 (00:57 -0700)]
[PATCH] Swsusp should not wake up stopped processes
From: Pavel Machek <pavel@suse.cz>
If you stop process with ^Z, then suspend, process is awakened. Thats a
bug. Solution is to simply leave already stopped processes alone. Plus we
no longer use TASK_STOPPED for processes in refrigerator. Userland might
see us and get confused.
Bill Irwin did some work on this. It makes swsusp behave correctly w.r.t.
discontingmem, and adds highmem handling (very simple-minded, but should work
ok with 1GB). It now should behave correctly w.r.t. more than one swap
device, and fixes double restoring of console.
Andrew Morton [Mon, 12 Apr 2004 07:57:08 +0000 (00:57 -0700)]
[PATCH] i386 probe_roms(): fixes
From: Rene Herman <rene.herman@keyaccess.nl>
This patch tries to improve the i386/mach-default probe_roms(). This also
c99ifies the data, adds an IORESOURCE_IO flag for the I/O port resources,
an IORESOURCE_MEM flag for the VRAM resource, IORESOURCE_READONLY |
IORESOURCE_MEM for the ROM resources and adds two additional "adapter ROM
slots" (for a total of 6) since it now also scans the 0xe0000 segment.
Andrew Morton [Mon, 12 Apr 2004 07:56:55 +0000 (00:56 -0700)]
[PATCH] i386 probe_roms(): preparation
From: Rene Herman <rene.herman@keyaccess.nl>
The i386 probe_roms() function has a fair number of problems currently:
- When you actually have an adapter ROM in the machine, your video ROM
disappears. This is due to the pc9800 subarch merge that split it up in
probe_video_rom(int roms) and probe_extension_roms(int roms), but expects a
"roms++" in probe_video_roms() to have an effect outside of that function.
- The majority of VGA adapters these days host a ROM larger then 32K, yet
the current code hardcodes a 32K ROM. The VGA BIOS "length" byte is
normally valid (it in fact needs to be for a regular mainboard BIOS to
accept it) and I've verified on a few dozen very new to very old VGAs that
it is. However, assuming someone actually did not check for the length and
checksum there for a reason, the safe thing to do here is accept the length
byte when we also get a valid checksum.
- The current code scans 0xc0000 to 0xdffff for a video ROM while the
standard PC thing to do (that which the BIOS does) is only scan for a video
ROM starting between 0xc0000 and 0xc7fff. This means that on a headless-
(or BIOS-less monochrome adapter-) box, the first adapter ROM found
triggers the registration of a 32K "Video ROM" at hardcoded address
0xc0000, even when _nothing_ is present between 0xc0000 and 0xc7fff.
- The current adapter ROM scan stops at 0xdffff, whether or not an
extension ROM is present at 0xe0000. The PC thing to do is scan 0xc8000
upto 0xdffff if an extension ROM is present, and upto 0xeffff when it's not
(it's not/hardly ever).
- Adapter ROMs are called "Extension ROM", but the latter term is really
better reserved for a motherboard extension ROM.
- Currently, the code happily starts scanning through a ROM it just
registered looking for the next one (just does += 2048, even when that's
inside the previous ROM) which is at least silly.
Unfortunately, this code is "subarched" between mach-default and
mach-pc9800, meaning the patch got a bit involved. Currently all this
code, and gobs of data, is defined (not just declared) in the header:
which isn't nice. That .h really wants to be a .c. The first patch, in
the next message, does not change any code but only undoes the
probe_video_rom / probe_extension_roms split and moves the code to a new
file
arch/i386/mach-{default,pc9800}/std_resources.c
with a header
include/asm-i386/std_resources.h
for the prototypes only. The second patch overhauls the code itself for
mach-default. Please see comments on top of that patch for (yet more)
comments. It's tested on various machines, with and without adapter ROMs.
I haven't touched pc9800. Nothing should have changed though. The pc9800
author, as given in the code, is CCed.
Also, x86-64 inherits the probe_roms() code from 2.4, and while it
doesn't have the subarch specific problems, it has all others. I'll
convert it to if this i386 version is deemed desirable.
This patch doesn't change any code, just moves stuff from the
"mach_resources.h" header to a "std_resources.c" subarch specific file, and
introduces a "std_resources.h" header for the prototypes.
Andrew Morton [Mon, 12 Apr 2004 07:55:49 +0000 (00:55 -0700)]
[PATCH] jbd: b_transaction zeroing cleanup
Almost everywhere where JBD removes a buffer from the transaction lists the
caller then nulls out jh->b_transaction. Sometimes, the caller does that
without holding the locks which are defined to protect b_transaction. This
makes me queazy.
So change things so that __journal_unfile_buffer() nulls out b_transaction
inside both j_list_lock and jbd_lock_bh_state().
We're seeing heavy contention against j_list_lock on 8-way in
do_get_write_access().
We actually don't need j_list_lock in there except for one little case - the
per-bh jbd_lock_bh_state() is sufficient to protect this buffer's internal
state.
On some nice quick LVM array Ram Pai measured an overall 3x speedup from this
patch:
the script took the following time on 265mm1
real 0m57.504s
user 0m0.400s
sys 7m29.867s
and with the 2patches it took
real 0m19.983s
user 0m0.438s
sys 1m55.896s
The cyclades.c driver was marked BROKEN_ON_SMP during early 2.6. It was
fixed later on but the tag was left in Kconfig.
The driver is not very smart wrt SMP locking, it can be improved. There is
only one spinlock per card which guarantees command block ordering and
protects different shared data, which can be held for long periods.
_But_ the locking works reliably, so remove the BROKEN_ON_SMP tag.
Andrew Morton [Mon, 12 Apr 2004 07:54:44 +0000 (00:54 -0700)]
[PATCH] rename page_to_nodenum()
From: "Martin J. Bligh" <mbligh@aracnet.com>
I'd prefer we renamed this to page_to_nid() before anyone starts using it.
This fits with the naming convention of everything else (pfn_to_nid, etc).
Nobody uses it right now - I grepped the whole tree.
Andrew Morton [Mon, 12 Apr 2004 07:54:31 +0000 (00:54 -0700)]
[PATCH] rmap 3 arches + mapping_mapped
From: Hugh Dickins <hugh@veritas.com>
Some arches refer to page->mapping for their dcache flushing: use
page_mapping(page) for safety, to avoid confusion on anon pages, which will
store a different pointer there - though in most cases flush_dcache_page is
being applied to pagecache pages.
arm has a useful mapping_mapped macro: move that to generic, and add
mapping_writably_mapped, to avoid explicit list_empty checks on i_mmap and
i_mmap_shared in several places.
Very tempted to add page_mapped(page) tests, perhaps along with the
mapping_writably_mapped tests in do_generic_mapping_read and
do_shmem_file_read, to cut down on wasted flush_dcache effort; but the
serialization is not obvious, too unsafe to do in a hurry.
Andrew Morton [Mon, 12 Apr 2004 07:54:17 +0000 (00:54 -0700)]
[PATCH] rw_swap_page_sync fixes
Fix up the rw_swap_page_sync() gorrors by fully decoupling this function
from the VM - it is now just a helper function which reads a page from or
writes a page to swap.
Andrew Morton [Mon, 12 Apr 2004 07:54:03 +0000 (00:54 -0700)]
[PATCH] rmap 2 anon and swapcache
From: Hugh Dickins <hugh@veritas.com>
Tracking anonymous pages by anon_vma,pgoff or mm,address needs a
pointer,offset pair in struct page: mapping,index the natural choice. But
swapcache uses those for &swapper_space,swp_entry_t.
It's trivial to separate swapcache from pagecache with radix tree; most of
swapper_space is actually unused, just a fiction to pretend swap like file;
and page->private is a good place to keep swp_entry_t, now that swap never
uses bufferheads.
Define PG_anon bit, page_add_rmap SetPageAnon and put an oopsable address in
page->mapping to test that we're not confused by it. Define
page_mapping(page) macro to give NULL when PageAnon, whatever may be in
page->mapping. Define PG_swapcache bit, deduce swapper_space from that in
the few places we need it.
add_to_swap_cache now distinct from add_to_page_cache. Separating the caches
somewhat simplifies the tmpfs swizzling in swap_state.c, now the page can
briefly be in both caches.
The rmap method remains pte chains, no change to that yet. But one small
functional difference: the use of PageAnon implies that a page truncated
while still mapped will no longer be found and freed (swapped out) by
try_to_unmap, will only be freed by exit or munmap. But normally pages are
unmapped by vmtruncate: this should only affect nonlinear mappings, and a
later patch not in this batch will fix that.
Andrew Morton [Mon, 12 Apr 2004 07:53:50 +0000 (00:53 -0700)]
[PATCH] rmap 1 linux/rmap.h
From: Hugh Dickins <hugh@veritas.com>
First of a batch of three rmap patches: this initial batch of three paving
the way for a move to some form of object-based rmap (probably Andrea's, but
drawing from mine too), and making almost no functional change by itself. A
few days will intervene before the next batch, to give the struct page
changes in the second patch some exposure before proceeding.
rmap 1 create include/linux/rmap.h
Start small: linux/rmap-locking.h has already gathered some declarations
unrelated to locking, and the rest of the rmap declarations were over in
linux/swap.h: gather them all together in linux/rmap.h, and rename the
pte_chain_lock to rmap_lock.
Andrew Morton [Mon, 12 Apr 2004 07:16:32 +0000 (00:16 -0700)]
[PATCH] Correct unplugs on nr_queued
From: Jens Axboe <axboe@suse.de>
There's a small discrepancy in when we decide to unplug a queue based on
q->unplug_thresh. Basically it doesn't work for tagged queues, since
q->rq.count[READ] + q->rq.count[WRITE] is just the number of allocated
requests, not the number of requests stuck in the io scheduler. We could
just change the nr_queued == to a nr_queued >=, however that is still
suboptimal.
This patch adds accounting for requests that have been dequeued from the io
scheduler, but not freed yet. These are q->in_flight. allocated_requests
- q->in_flight == requests_in_scheduler. So the condition correctly
becomes
if (requests_in_scheduler == q->unplug_thresh)
instead. I did a quick round of testing, and for dbench on a SCSI disk the
number of timer induced unplugs was reduced from 13 to 5 :-). Not a huge
number, but there might be cases where it's more significant. Either way,
it gets ->unplug_thresh always right, which the old logic didn't.
Andrew Morton [Mon, 12 Apr 2004 07:16:17 +0000 (00:16 -0700)]
[PATCH] unplugging: md update
From: Neil Brown <neilb@cse.unsw.edu.au>
I've made a bunch of changes to the 'md' bits - largely moving the
unplugging into the individual personalities which know more about which
drives are actually in use.
Andrew Morton [Mon, 12 Apr 2004 07:16:04 +0000 (00:16 -0700)]
[PATCH] Use BIO_RW_SYNC in swap write page
From: Jens Axboe <axboe@suse.de>
Dog slow software suspend found this one. If WB_SYNC_ALL, then you need
to mark the bio as sync as well.
This is because swap_writepage() does a remove_exclusive_swap_page() (going
to __delete_from_swap_cache -> __remove_from_page_cache) which can kill
page->mapping, thus aops->sync_page() has nothing to work with for unplugging
the address space.
Andrew Morton [Mon, 12 Apr 2004 07:15:51 +0000 (00:15 -0700)]
[PATCH] per-backing dev unplugging
From: Jens Axboe <axboe@suse.de>,
Chris Mason,
me, others.
The global unplug list causes horrid spinlock contention on many-disk
many-CPU setups - throughput is worse than halved.
The other problem with the global unplugging is of course that it will cause
the unplugging of queues which are unrelated to the I/O upon which the caller
is about to wait.
So what we do to solve these problems is to remove the global unplug and set
up the infrastructure under which the VFS can tell the block layer to unplug
only those queues which are relevant to the page or buffer_head whcih is
about to be waited upon.
We do this via the very appropriate address_space->backing_dev_info structure.
Most of the complexity is in devicemapper, MD and swapper_space, because for
these backing devices, multiple queues may need to be unplugged to complete a
page/buffer I/O. In each case we ensure that data structures are in place to
permit us to identify all the lower-level queues which contribute to the
higher-level backing_dev_info. Each contributing queue is told to unplug in
response to a higher-level unplug.
To simplify things in various places we also introduce the concept of a
"synchronous BIO": it is tagged with BIO_RW_SYNC. The block layer will
perform an immediate unplug when it sees one of these go past.
Andrew Morton [Mon, 12 Apr 2004 07:15:25 +0000 (00:15 -0700)]
[PATCH] Implement queue congestion callout for device mapper
From: Miquel van Smoorenburg <miquels@cistron.nl>
Joe Thornber <thornber@redhat.com>
This implements the queue congestion callout for DM stacks. To make
bdi_read/write_congested() return correct information.
- md->lock protects all fields in md _except_ md->map
- md->map_lock protects md->map
- Anyone who wants to read md->map should use dm_get_table() which
increments the tables reference count.
This means the spin lock is now only held for the duration of a
reference count increment.
Udpate:
dm.c: protect md->map with a rw spin lock rather than the md->lock
semaphore. Also ensure that everyone accesses md->map through
dm_get_table(), rather than directly.
Andrew Morton [Mon, 12 Apr 2004 07:15:12 +0000 (00:15 -0700)]
[PATCH] Add queue congestion callout
From: Miquel van Smoorenburg <miquels@cistron.nl>
The VM and VFS use the address_space_backing_dev_info to track the realtime
status of the device which backs the mapping. The read_congested and
write_congested fields are used to determine whether a read or write
against that device may block.
We use this infrastructure to
a) allow pdflush to service many queues in parallel (by not getting
stuck on any particular one) and
b) to avoid undesirable and uncontrolled latencies in places such as
page reclaim and
c) To avoid blocking in readahead operations
The current code only supports simple disk queues (and I have a patch here
for NFS). Stacked queues (MD and DM) don't get this information right and
problems were expected. Efficiency problems have now been noted and it's
time to fix it.
This patch lays down the infrastructure which permits the queue
implementation to get control when someone at a higher level is querying
the queue's congestion state. So DM (for example) can run around and
examine all the queues which contribute to the higher-level queue.
It also adds bdi_rw_congested() for code in xfs and ext2 that calls both
bdi_read_congested() and bdi_write_congested() in a row, and it was "free"
anyway.
Currently set_rte() changes RTE without iosapic_lock held. I guess it
assumes to be called only at the boot time. But set_rte() can be
called by PCI driver not only at the boot time. So I think set_rte()
should get iosapic_lock.
[PATCH] ia64: Allow IO port space without EFI RT attribute
Some firmware does not require run-time mapping of the legacy IO port
space. (It may not need to perform any IO port operations, or it may
do them with translation disabled.)
(efi_get_iobase): Don't require that IO port space be marked RT, since
there's no reason the firmware should require mappings for it.
Thanks to Greg Albrecht for noticing this.
Also, allow attributes in addition to EFI_MEMORY_UC. I can't
think of another current attribute that makes sense, but the
kernel only depends on being able to use UC.
Andrew Morton [Mon, 12 Apr 2004 06:43:34 +0000 (23:43 -0700)]
[PATCH] s390: zfcp log messages part 2
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
zfcp host adapter log message cleanup part 2:
- Shorten log output.
- Increase log level for some messages.
- Always print leading zeroes for wwpn and fcp-lun.