Alexander Viro [Sat, 7 Sep 2002 10:05:09 +0000 (03:05 -0700)]
[PATCH] (24/25) disk capacity helpers
new helpers - get_capacity(gendisk)/set_capacity(gendisk, sectors).
Drivers switched to these; that eliminates most of the accesses to
disk->part[]... in the drivers (and makes code more readable, while
we are at it). That had caught several bugs when minor had been
used in place of minor>>minor_shift (acsi.c is especially nasty in
that respect; I don't know if it had ever been used with multiple
devices...)
Alexander Viro [Sat, 7 Sep 2002 10:05:04 +0000 (03:05 -0700)]
[PATCH] (23/25) move pointer to gendisk from hwif to drive
ide switched from hwif->gd[i] to hwif->drive[i]->disk - IOW, instead
of array of two pointers to gendisks refered from hwif, we keep these pointers
in relevant drives. Cleaned up.
Alexander Viro [Sat, 7 Sep 2002 10:04:42 +0000 (03:04 -0700)]
[PATCH] (18/25) pcd.c - cleanup, killed used of cdi->dev
pcd.c cleaned up, uses of cdi->dev eliminated, abuse of macros killed
(it used to have
#define PCD pcd[unit]
#define PI PCD.pi
and expected 'unit' to be local variable in each function that used these
(== almost every function in there)).
Alexander Viro [Sat, 7 Sep 2002 10:04:20 +0000 (03:04 -0700)]
[PATCH] (13/25) sbpcd.c - beginning of cleanup
sbpcd.c - sigh... It used to have a global variable inventively called
'd'. Current disk number. Tons of uses, 99% of them being D_S[d].<blah>.
Added a new variable - current_drive. Said animal is equal to D_S + d -
it's reassigned at the same place as d.
Alexander Viro [Sat, 7 Sep 2002 10:04:11 +0000 (03:04 -0700)]
[PATCH] (11/25) sr.c naming cleanup
Global search'n'replace job - 'SCp' (Scsi_CD pointer - I'm not kidding;
and yes, they spell it "Scsi") replaced with 'cd' (sr.c, sr_ioctl.c,
sr_vendor.c).
Alexander Viro [Sat, 7 Sep 2002 10:04:07 +0000 (03:04 -0700)]
[PATCH] (10/25) sr.c device name handling
sr.c: we set SCp->cdi.name from the very beginning, which allows
to kill passing minors in many cases (we can use "%s...", SCp->cd.name instead
of "sr%d...", minor and that turns out to be the majority of places where
we use minors at all).
Alexander Viro [Sat, 7 Sep 2002 10:04:02 +0000 (03:04 -0700)]
[PATCH] (9/25) update_partition()
new helper - update_partition(disk, partition_number); does the
right thing wrt devfs and driverfs (un)registration of partition entries.
BLKPG ioctls fixed - now they call that beast rather than calling only
devfs side. New helper - rescan_partitions(disk, bdev); does all work
with wiping/rereading/etc. and fs/block_dev.c now uses it instead of
check_partition(). The latter became static.
Each hd_struct used to have int number; in it. It's used _only_
in disk->part[0] - disk->part[n].number is never assigned/checked for any
positive n. Moved from hd_struct to gendisk (disk->part[0].number to
disk->number).
disk->driverfs_dev_arr is either NULL or consists of exactly one
element. Same change as above (struct device ** -> struct device *); old
"is the pointer to array itself NULL or not?" replaced with a flag (in
disk->flags).
Alexander Viro [Sat, 7 Sep 2002 10:03:44 +0000 (03:03 -0700)]
[PATCH] (5/25) Removing bogus arrays - ->flags[]
Seeing that now disk->flags[] always consists of one element, we
replace char *flags with int flags, remove the junk from places that used
to allocate these "arrays" and do obvious updates of the code
(s/->flags[0]/->flags/).
Alexander Viro [Sat, 7 Sep 2002 10:03:36 +0000 (03:03 -0700)]
[PATCH] (3/25) Removing useless minor arguments
driverfs_remove_partitions(), devfs_register_partitions(),
driverfs_create_partitions(), devfs_register_partition(), devfs_register_disc(),
had lost 'minor' argument - it's always disk->first_minor these days.
disk_name() takes partition number instead of minor now. Callers of
wipe_partitions() in fs/block_dev.c expanded. Remaining caller passes
gendisk instead of kdev_t now.
Alexander Viro [Sat, 7 Sep 2002 10:03:31 +0000 (03:03 -0700)]
[PATCH] (2/25) Removing ->nr_real
Since ->nr_real is always 1 now, we can remove that field completely.
Removed the last remnants of switch in disk_name() (it could be killed
a long time ago, I just forgot to remove the last two cases when md and i2o
got converted). Collapsed several instances of
disk->part[minor - disk->first_minor] - in cases when we know that we deal
with disk->part[0].
Alexander Viro [Sat, 7 Sep 2002 10:03:26 +0000 (03:03 -0700)]
[PATCH] (1/25) Unexporting helper functions
wipe_partitions() and driverfs_register_partitions(..., 1) (i.e.
unregistering them) pulled into del_gendisk() and removed from callers.
grok_partitions() merged with register_disk(). devfs_register_partitions(),
grok_partitions() and wipe_partitions() not exported anymore.
Luca Barbieri [Sat, 31 Aug 2002 03:21:42 +0000 (20:21 -0700)]
[PATCH] Fix panic if pnpbios is enabled and speed up its check in
This fixes the pnpbios CS check to check for the correct values (it
wasn't up to date with the various GDT reshuffles), moves it inside the
kernel mode check, modifies it so that it takes less instructions and
marks it with unlikely().
Note that the 2.5.32 version of this check will cause the kernel to
always panic since it checks for the kernel segments and will thus
decide to jump to the pnpbios fault handler without being in pnpbios.
pnpbios_core.c instead seems to use the correct values.
Neil Brown [Fri, 30 Aug 2002 09:00:00 +0000 (02:00 -0700)]
[PATCH] PATCH - kNFSd - More small fixes for TCP nfsd
sk_inuse should be bigger than "char" as we can
have more than 255 server threads. Due to the way the count
is used, this is unlikely to actually cause a problem, but it
should nonetheless be fixed.
Also, two printk generate more noise than we would like,
so turn them into dprintk (debugging printk).
Chuck Lever [Fri, 30 Aug 2002 08:58:33 +0000 (01:58 -0700)]
[PATCH] sock_writeable not appropriate for TCP sockets, for 2.5.32
sock_writeable determines whether there is space in a socket's output
buffer. socket write_space callbacks use it to determine whether to wake
up those that are waiting for more output buffer space.
however, sock_writeable is not appropriate for TCP sockets. because the
RPC layer's write_space callback uses it for TCP sockets, the RPC layer
hammers on sock_sendmsg with dozens of write requests that are only a few
hundred bytes long when it is trying to send a large write RPC request.
this patch adds logic to the RPC layer's write_space callback that
properly handles TCP sockets.
Chuck Lever [Fri, 30 Aug 2002 08:58:29 +0000 (01:58 -0700)]
[PATCH] prevent oops in xprt_lock_write, against 2.5.32
when several RPC requests want to reconnect a TCP transport socket at
once, xprt_lock_write serializes the tasks to prevent multiple socket
connects. however, TCP connects are always done by a RPC child task that
has no request slot. xprt_lock_write can oops if there is no request slot
allocated to the invoking RPC task. reviewed and accepted by Trond.
the xprt_lock_write changes are not yet in 2.4, so this patch does not
apply to 2.4.
Ingo Molnar [Fri, 30 Aug 2002 08:56:01 +0000 (01:56 -0700)]
[PATCH] scheduler fixes, 2.5.32-BK
This adds two scheduler related fixes:
- changes the migration code to use struct completion. Andrew pointed out
that there might be a small window in where the up() touches the
semaphore while the waiting task goes on and frees its stack. And
completion is more suited for this kind of stuff anyway.
- removes two unneeded exports, pointed out by Andrew.
Ingo Molnar [Fri, 30 Aug 2002 08:55:56 +0000 (01:55 -0700)]
[PATCH] clone-cleanup 2.5.32-BK
This moves CLONE_SETTID and CLONE_CLEARTID handling into kernel/fork.c,
where it belongs. [the CLONE_SETTLS is x86-specific and thus remains in
the per-arch process.c] This makes support for these two new flags much
easier: architectures only have to pass in the user_tid pointer.
Andrew Morton [Fri, 30 Aug 2002 08:49:31 +0000 (01:49 -0700)]
[PATCH] O_DIRECT for ext3
O_DIRECT support for ext3.
It works OK in all journalling modes.
Updates to the file metadata and inode are journalled as usual.
If the system crashes during an appending O_DIRECT write then journal
recovery will truncate the written-to file back to the length which it
had on entry to that write.
If the system crashes during a file overwrite to existing blocks then
the file contents will be an unknown mixture of old and new.
If the system crashes during a file overwrite which instantiates new
blocks in the middle of the file then there is a possibility of
uninitialised disk blocks being present in the file post-recovery.
Andrew Morton [Fri, 30 Aug 2002 08:49:26 +0000 (01:49 -0700)]
[PATCH] fix an ext3 deadlock
mpage_writepages() does a lock_page() on pages to be written back, even
when it is being used for page reclaim writeback.
This is normally OK, because the page is unlocked quickly - pages are
unlocked during writeback and nobody should be performing __GFP_FS
allocations inside lock_page().
But it has introduced a ranking problem in ext3:
generic_file_write
->lock_page
->ext3_prepare_write
->journal_start (waits for a commit)
versus
ext3_create()
->journal_start()
->ext3_new_inode(GFP_KERNEL)
->page reclaim
->mpage_writepages
->lock_page (locks up, transaction is held open)
Maybe sometime, I'll have to turn mpage_writepages' lock_page into a
trylock if the caller is PF_MEMALLOC. But for now, let's make ext3's
inside-transaction allocations use GFP_NOFS. There is only one of them.
Andrew Morton [Fri, 30 Aug 2002 08:49:22 +0000 (01:49 -0700)]
[PATCH] writeback correctness and efficiency changes
This is a performance and correctness fix against the writeback paths.
The writeback code has competing requirements. Sometimes it is used
for "memory cleansing": kupdate, bdflush, writer throttling, page
allocator writeback, etc. And sometimes this same code is used for
data integrity pruposes: fsync, msync, fdatasync, sync, umount, various
other kernel-internal uses.
The problem is: how to handle a dirty buffer or page which is currently
under writeback.
For memory cleansing, we just want to skip that buffer/page and go onto
the next one. But for sync, we must wait on the old writeback and then
start new writeback.
mpage_writepages() is current correct for cleansing, but incorrect for
sync. block_write_full_page() is currently correct for sync, but
inefficient for cleansing.
The fix is fairly simple.
- In mpage_writepages(), don't skip the page is it's a sync
operation.
- In block_write_full_page(), skip the buffer if it is a sync
operation. And return -EAGAIN to tell the caller that the writeout
didn't work out. The caller must then set the page dirty again and
move it onto mapping->dirty_pages.
This is an extension of the writepage API: writepage can now return
EAGAIN. There are only three callers, and they have been updated.
fail_writepage() and ext3_writepage() were actually doing this by
hand. They have been changed to return -EAGAIN. NTFS will want to
be able to return -EAGAIN from its writepage as well.
- A sticky question is: how to tell the writeout code which mode it
is operating in? Cleansing or sync?
It's such a tiny code change that I didn't have the heart to go and
propagate a `mode' argument down every instance of writepages() and
writepage() in the kernel. So I passed it in via current->flags.
Incidentally, the occurrence of a locked-and-dirty buffer in
block_write_full_page() is fairly rare: normally the collision avoidance
happens at the address_space level, via PageWriteback. But some
mappings (blockdevs, ext3 files, etc) have their dirty buffers written
out via submit_bh(). It is these buffers which can stall
block_write_full_page().
This wart will be pretty intrusive to fix. ext3 needs to become fully
page-based (ugh. It's a block-based journalling filesystem, and pages
are unnatural). blockdev mappings are still written out by buffers
because that's how filesystems use them. Putting _all_ metadata
(indirects, inodes, superblocks, etc) into standalone address_spaces
would fix that up.
- filemap_fdatawrite() sets PF_SYNC. So filemap_fdatawrite() is the
kernel function which will start writeback against a mapping for
"data integrity" purposes, whereas the unexported, internal-only
do_writepages() is the writeback function which is used for memory
cleansing. This difference is the reason why I didn't consolidate
those functions ages ago...
- Lots of code paths had a bogus extra call to filemap_fdatawait(),
which I previously added in a moment of weak-headedness. They have
all been removed.
Andrew Morton [Fri, 30 Aug 2002 08:49:17 +0000 (01:49 -0700)]
[PATCH] batched freeing of anon pages
A reworked version of the batched page freeing and lock amortisation
for VMA teardown.
It walks the existing 507-page list in the mmu_gather_t in 16-page
chunks, drops their refcounts in 16-page chunks, and de-LRUs and
frees any resulting zero-count pages in up-to-16 page chunks.
Ingo Molnar [Fri, 30 Aug 2002 08:45:57 +0000 (01:45 -0700)]
[PATCH] MAINTAINERS patch
please apply this patch (Robert ACK-ed it). While there is a preemptible
kernel entry already, i think listing this at the scheduler entry is
justfied, preemption has a number of scheduler interactions.
Ingo Molnar [Fri, 30 Aug 2002 08:45:53 +0000 (01:45 -0700)]
[PATCH] ldt-fix-2.5.32-A3
this is an updated version of the LDT fixes. It fixes the following kinds
of problems:
- fix a possible gcc optimization causing a race causing the loading of a
corrupt LDT descriptor upon context switch. [this fix got simplified
over previous versions.]
- remove an unconditional OOM printk, and there's no need to set ->size
in the OOM path.
- fix preemption bugs, load_LDT()/clear_LDT() was not preemption-safe,
when it was used outside of spinlocks.
the context-switch race is the following. 'LDT modification' is the
following operation: the seg->ldt pointer is modified, then seg->size is
modified. In theory gcc is free to reschedule the two modifications, and
first modify ->size, then ->ldt. Thus if this modification is not
synchronized with context-switches, another thread might see a temporary
state of the new ->size [which was increased], but still the old pointer.
Ie.:
Vojtech Pavlik [Fri, 30 Aug 2002 13:22:23 +0000 (15:22 +0200)]
Ignore error 0xff - 'general error' in AUX wire test in i8042.c,
some mainboards (Andrew Morton's Dell) report that even everything
is okay with AUX. Also remove a check for very old AMI i8042's, which
could generate false positives on modern buggy mainboards.
Dave Kleikamp [Fri, 30 Aug 2002 08:21:28 +0000 (03:21 -0500)]
Proper implementation of jfs_get_blocks
jfs_get_blocks should return up to the number of blocks in the
extent rather than limiting itself to one block, as the initial,
trivial implementation did. This greatly reduces the overhead of
O_DIRECT reads and writes.
Submitted by Badari Pulavarty (pbadari@us.ibm.com)
David Brownell [Fri, 30 Aug 2002 07:52:18 +0000 (00:52 -0700)]
[PATCH] show pci_pool stats in driverfs]
This patch exposes basic allocation statistics for pci pools,
very much like /proc/slabinfo but applying to DMA-consistent
memory. A file "pools" is created in the driverfs directory
for the relevant pci device when the first pool is created, and
removed when the last pool is destroyed.
Please merge to 2.5.latest. If it matters, DaveM said it
looks fine. It produces sane output for all the 2.5.30
USB host controller drivers.
Matthew Dobson [Fri, 30 Aug 2002 07:27:12 +0000 (00:27 -0700)]
[PATCH] PCI Cleanup
The patch removes the pci_confN_(read|write)_config_(byte|word|dword) mess and
pares it down to pci_confN_(read|write). This change is reflected in the
pci_ops structure, which only has read and write function pointers rather than
the byte, word, and dword versions. These changes happen in the pci_conf(1|2)
and pci_bios read and write calls.
This patch also removes the pci_config_(read|write) function pointers. People
shouldn't be using these (I don't think) and should be using the pci_ops
structure linked through the pci_dev structure. These end up calling the same
functions that the pci_config_(read|write) pointers refer to anyway.