Russell King [Wed, 17 Apr 2002 19:44:06 +0000 (20:44 +0100)]
This changeset adds a new feature to ARM - the ability to load the
kernel zImage almost anywhere in RAM and call it directly, without
having to copy it to a specific address. This removes a dependency
between the boot loader and the kernel.
- Expand configure help options a bit
- Fix xconfig bug
- Decrease queue depth if a command takes too long to complete
- Test master/slave stuff. It works, but one device can heavily starve
another. This is the simple approach right now, means that one device
will wait until the other is completely idle before starting any
commands This is not necessary since we can have queued commands on
both devices at the same time. TODO.
- Add proc output for oldest command, just for testing.
- pci_dev compile fixes.
- Make sure ide-disk doesn't BUG if TCQ is not used, basically this was
fixed by off-loading the using_tcq setting to ide-tcq.
- Remove warning about 'queued feature set not supported'
- Abstract ide_tcq_wait_dataphase() into a function
Martin Dalecki [Mon, 15 Apr 2002 03:21:46 +0000 (20:21 -0700)]
[PATCH] 2.5.8 IDE 34
- Synchronize with 2.5.8.
- Eliminate the cdrom_log_sense() function.
- Pass a struct request to cdrom_analyze_sense_data() since this is the entity
this function is working on. This shows nicely that this function is broken.
- Use CDROM_PACKET_SIZE where appropriate.
- Kill the obfuscating cmd_buf and cmd_len local variables from
cdrom_transfer_packet_command(). This made it obvious that the parameters of
this function where not adequate - to say the least. Fix this.
- Pass a packed command array directly to cdrom_queue_packed_command(). This
is reducing the number of places where we have to deal with the c member of
struct packet_command.
- Never pass NULL as sense to cdrom_lockdoor().
- Eliminate cdrom_do_block_pc().
- Eliminate the c member of struct packet_command. Pass them through struct
request cmd member.
- Don't enable TCQ unconditionally if there is a TCQ queue depth defined.
- Fix small think in ide_cmd_ioctl() rewrite. (My appologies to everyone who
has to use hdparm to setup his system...)
the IRQ balancing feature is based on the following requirements:
- irq handlers should be cache-affine to a large degree, without the
explicit use of /proc/irq/*/smp_affinity.
- idle CPUs should be preferred over busy CPUs when directing IRQs towards
them.
- the distribution of IRQs should be random, to avoid all IRQs going to
the same CPU, and to avoid 'heavy' IRQs from loading certain CPUs
unfairly over CPUs that handle 'light' IRQs. The IRQ system has no
knowledge about how 'heavy' an IRQ handler is in terms of CPU cycles.
here is the design and implementation:
- we make per-irq decisions about where the IRQ will go to next. Right
now it's a fastpath and a slowpath, the real stuff happens in the slow
path. The fastpath is very lightweight.
- [ i decided not to measure IRQ handler overhead via RDTSC - it ends up
being very messy, and if we want to be 100% fair then we also need to
measure softirq overhead, and since there is no 1:1 relationship
between softirq load and hardirq load, it's impossible to do
correctly. So the IRQ balancer achieves fairness via randomness. ]
- we stay affine in the micro timescale, and we are loading the CPUs
fairly in the macro timescale. The IO-APIC's lowest priority
distribution method rotated IRQs between CPUs once per IRQ, which was
the worst possible solution for good cache-affinity.
- to achieve fairness and to avoid lock-step situations some real
randomness is needed. The IRQs will wander in the allowed CPU group
randomly, in a brownean motion fashion. This is what the 'move()'
function accomplishes. The IRQ moves one step forward or one step
backwards in the allowed CPU mask. [ Note that this achieves a level of
NUMA affinity as well, nearby CPUs are more likely to be NUMA-affine. ]
- the irq balancer has some knowledge about 'how idle' a single CPU is.
The idle task updates the idle_timestamp. Since this update is in the
idle-to-be codepath, it does not increase the latency of idle-wakeup,
the overhead should be zero in all cases that matter. The idle-balancing
happens the following way: when searching for the next target CPU after
a 'IRQ tick' has expired, we first search 'idle enough' CPUs in the
allowed set. If this does not succeed then we search all CPUs.
- the patch is fully compatible with the /proc/irq/*/smp_affinity
interface as well, everything works as expected.
note that the current implementation can be expressed equivalently in
terms of timer-interrupt-driven IRQ redirection. But i wanted to get some
real feedback before removing the possibility to do finer grained
decisions - and the per-IRQ overhead is very small anyway.
Alexander Schulz [Sun, 14 Apr 2002 18:41:48 +0000 (19:41 +0100)]
[PATCH] 1107/1: Shark: defconfig and updates
This patch updates the defconfig for the Shark and adds an
extern and a define so that the kernel compiles for the Shark.
[PATCH] 1101/1: Make armksyms.c compile again with gcc 3.0.2
Make arch/arm/kernel/armksyms.c compile again with gcc 3.0.2 because of new EXPORT_SYMBOL_NOVERS(abort); in patch-2_4_18-rmk3.gz. See my mail "EXPORT_SYMBOL_NOVERS(abort) in armksyms.c?" in linux-arm-kernel list from Mon, 25 Mar 2002.
For !CONFIG_SMP we want the empty inline setup_per_cpu_areas().
If CONFIG_SMP is set, we never want the empty inline. If we use the
generic implementation, we have it here, if not the arch has it somwhere
else (hopefully).
This makes the cpqfc driver recognize the HP Tachyon. I moved the
device list to an __initdata structure so the driver doesn't build it at
runtime and changed it to use the proper PCI_DEVICE_ID_* names.
With this patch applied, the driver happily detects the disks attached
to my HP Tachyon.
Martin Dalecki [Sun, 14 Apr 2002 04:18:50 +0000 (21:18 -0700)]
[PATCH] 2.5.8-pre3 IDE 33
- Kill unneded parameters to ide_cmd_ioctl() and ide_task_ioctl().
- Apply Petr Vendrovecs fix for 32bit ver 16bit transfers.
- Make CD-ROM usable again by guarding the generic routines against request
field abuse found there. We will try to convert this driver to the just to be
finished struct ata_request after the generic changes stabilize a bit.
The strcut ata_taskfile and struct ata_request merge to be more preciese.
Hans Reiser [Wed, 10 Apr 2002 06:45:38 +0000 (23:45 -0700)]
[PATCH] ReiserFS inode cleanup
This patch fixes a problem that was created during inode structure
cleanup/ private parts separation. This fix was made by Chris Mason.
This is very critical bugfix. Without it, filesystem corruption
happens on savelinks processing and possibly in some other cases.
Hans Reiser [Wed, 10 Apr 2002 06:44:39 +0000 (23:44 -0700)]
[PATCH] ReiserFS journal replay
This patch is to fix journal replay bug where old code would replay
transactions with mount_id != mount_id recorded in journal header.
Fixed by Chris Mason.
Hans Reiser [Wed, 10 Apr 2002 06:43:52 +0000 (23:43 -0700)]
[PATCH] ReiserFS get_block fix
This patch is to convert pap14030 panic into warning. While doing this,
a bug was uncovered, that when get_block() returns a failure, buffer
is still marked as mapped, and on subsequent access to this buffer
get_block() was not called anymore. This is also fixed.
Martin Dalecki [Wed, 10 Apr 2002 06:36:56 +0000 (23:36 -0700)]
[PATCH] 2.5.8-pre3 IDE 31
- Integrate the TCQ stuff from Jens Axboe. Deal with the conflicts, apply some
cosmetic changes. We are still not at a stage where we could immediately
integrate ata_request and ata_taskfile but we are no longer far away.
- Clean up the data transfer function in ide-disk to use ata_request structures
directly.
- Kill useless leading version information in ide-disk.c
- Replace the ATA_AR_INIT macro with inline ata_ar_init() function.
- Replace IDE_CLEAR_TAG with ata_clear_tag().
- Replace IDE_SET_TAG with ata_set_tag().
- Kill georgeous ide_dmafunc_verbose().
- Fix typo in ide_enable_queued() (ide-tcq.c!)
Apparently there still problems with a TCQ enabled device and a not enabled
device on the same channel, but let's first synchronize up with Jens.
Martin Dalecki [Wed, 10 Apr 2002 06:36:44 +0000 (23:36 -0700)]
[PATCH] 2.5.8-pre3 IDE 30
- Eliminate ide_task_t and rename struct ide_task_s to struct ata_taskfile.
This should become the entity which is holding all data for a request in the
future. If this turns out to be the case, we will just rename it to
ata_request.
- Reduce the number of arguments for the ata_taskfile() function. This helps to
wipe quite a lot of code out as well.
This stage is not sensitive, so let's make a patch before we start to integrate
the last work of Jens Axboe.
Paul Mackerras [Thu, 11 Apr 2002 08:31:10 +0000 (18:31 +1000)]
PPP updates and fixes. This fixes the various SMP races, deadlocks
and scheduling-in-interrupt problems we had, and also makes it
much faster when handling large numbers (100s or more) of PPP units.
Andy Grover [Wed, 10 Apr 2002 06:20:28 +0000 (23:20 -0700)]
[PATCH] redo patch clobbered by ACPI
The latest ACPI merge accidentally clobbered another change in pci-irq.c.
Here's the original patch again (applies fine except for an offset)
Thanks -- Andy
Alexander Viro [Wed, 10 Apr 2002 04:32:22 +0000 (21:32 -0700)]
[PATCH] jffs2_get_sb() fixes
Fixes races in jffs2_get_sb() - current code has a window when two
mounts of the same mtd device can miss each other, resulting in two
active instances of jffs2 fighting over the same device.
Alexander Viro [Wed, 10 Apr 2002 04:32:08 +0000 (21:32 -0700)]
[PATCH] cramfs cleanup
All places where we do blkdev_size_in_bytes(sb->s_dev) are bogus - we
can get the same information from ->s_bdev without messing with kdev_t,
major/minor, etc.
There will be more patches of that kind - in the long run I'd expect
only one caller of blkdev_size_in_bytes() to survive. One if
fs/block_dev.c, that is - called when we open device.
Andrew Morton [Wed, 10 Apr 2002 04:30:06 +0000 (21:30 -0700)]
[PATCH] replace kupdate and bdflush with pdflush
Pretty simple.
- use a timer to kick off a pdflush thread every five seconds
to run the kupdate code.
- wakeup_bdflush() kicks off a pdflush thread to run the current
bdflush function.
There's some loss of functionality here - the ability to tune
the writeback periods. The numbers are hardwired at present.
But the intent is that buffer-based writeback disappears
altogether. New mechanisms for tuning the writeback will
need to be introduced.
Andrew Morton [Wed, 10 Apr 2002 04:29:59 +0000 (21:29 -0700)]
[PATCH] use pdflush for unused inode writeback
This is pdflush's first application! The writeback of
the unused inodes list by keventd is removed, and a
pdflush thread is dispatched instead.
There is a need for exclusion - to prevent all the
pdflush threads from working against the same request
queue. This is implemented locally. And this is a
problem, because other pdflush threads can be dispatched
to writeback other filesystem objects, and they don't
know that there's already a pdflush thread working that
request queue.
So moving the exclusion into the request queue itself
is on my things-to-do-list. But the code as-is works
OK - under a `dbench 100' load the number of pdflush
instances can grow as high as four or five. Some fine
tuning is needed...
Andrew Morton [Wed, 10 Apr 2002 04:29:47 +0000 (21:29 -0700)]
[PATCH] writeback daemons
This patch implements a gang-of-threads which are designed to
be used for dirty data writeback. "pdflush" -> dirty page
flush, or something.
The number of threads is dynamically managed by a simple
demand-driven algorithm.
"Oh no, more kernel threads". Don't worry, kupdate and
bdflush disappear later.
The intent is that no two pdflush threads are ever performing
writeback against the same request queue at the same time.
It would be wasteful to do that. My current patches don't
quite achieve this; I need to move the state into the request
queue itself...
The driver for implementing the thread pool was to avoid the
possibility where bdflush gets stuck on one device's get_request_wait()
queue while lots of other disks sit idle. Also generality,
abstraction, and the need to have something in place to perform
the address_space-based writeback when the buffer_head-based
writeback disappears.
There is no provision inside the pdflush code itself to prevent
many threads from working against the same device. That's
the responsibility of the caller.
The main API function, `pdflush_operation()' attempts to find
a thread to do some work for you. It is not reliable - it may
return -1 and say "sorry, I didn't do that". This happens if
all threads are busy.
One _could_ extend pdflush_operation() to queue the work so that
it is guaranteed to happen. If there's a need, that additional
minor complexity can be added.
Andrew Morton [Wed, 10 Apr 2002 04:29:40 +0000 (21:29 -0700)]
[PATCH] page->buffers abstraction
page->buffers is a bit of a layering violation. Not all address_spaces
have pages which are backed by buffers.
The exclusive use of page->buffers for buffers means that a piece of
prime real estate in struct page is unavailable to other forms of
address_space.
This patch turns page->buffers into `unsigned long page->private' and
sets in place all the infrastructure which is needed to allow other
address_spaces to use this storage.
This change alows the multipage-bio-writeout patches to use
page->private to cache the results of an earlier get_block(), so
repeated calls into the filesystem are not needed in the case of file
overwriting.
Devlopers should think carefully before calling try_to_free_buffers()
or block_flushpage() or writeout_one_page() or waitfor_one_page()
against a page. It's only legal to do this if you *know* that the page
is buffer-backed. And only the address_space knows that.
Arguably, we need new a_ops for writeout_one_page() and
waitfor_one_page(). But I have more patches on the boil which
obsolete these functions in favour of ->writepage() and wait_on_page().
The new PG_private page bit is used to indicate that there
is something at page->private. The core kernel does not
know what that object actually is, just that it's there.
The kernel must call a_ops->releasepage() to try to make
page->private go away. And a_ops->flushpage() at truncate
time.
Andrew Morton [Wed, 10 Apr 2002 04:29:32 +0000 (21:29 -0700)]
[PATCH] readahead
I'd like to be able to claim amazing speedups, but
the best benchmark I could find was diffing two
256 megabyte files, which is about 10% quicker. And
that is probably due to the window size being effectively
50% larger.
Fact is, any disk worth owning nowadays has a segmented
2-megabyte cache, and OS-level readahead mainly seems
to save on CPU cycles rather than overall throughput.
Once you start reading more streams than there are segments
in the disk cache we start to win.
Still. The main motivation for this work is to
clean the code up, and to create a central point at
which many pages are marshalled together so that
they can all be encapsulated into the smallest possible
number of BIOs, and injected into the request layer.
A number of filesystems were poking around inside the
readahead state variables. I'm not really sure what they
were up to, but I took all that out. The readahead
code manages its own state autonomously and should not
need any hints.
- Unifies the current three readahead functions (mmap reads, read(2)
and sys_readhead) into a single implementation.
- More aggressive in building up the readahead windows.
- More conservative in tearing them down.
- Special start-of-file heuristics.
- Preallocates the readahead pages, to avoid the (never demonstrated,
but potentially catastrophic) scenario where allocation of readahead
pages causes the allocator to perform VM writeout.
- Gets all the readahead pages gathered together in
one spot, so they can be marshalled into big BIOs.
- reinstates the readahead ioctls, so hdparm(8) and blockdev(8)
are working again. The readahead settings are now per-request-queue,
and the drivers never have to know about it. I use blockdev(8).
It works in units of 512 bytes.
- Identifies readahead thrashing.
Also attempts to handle it. Certainly the changes here
delay the onset of catastrophic readahead thrashing by
quite a lot, and decrease it seriousness as we get more
deeply into it, but it's still pretty bad.
Andrew Morton [Wed, 10 Apr 2002 04:29:24 +0000 (21:29 -0700)]
[PATCH] Velikov/Hellwig radix-tree pagecache
Before the mempool was added, the VM was getting many, many
0-order allocation failures due to the atomic ratnode
allocations inside swap_out. That monster mempool is
doing its job - drove a 256meg machine a gigabyte into
swap with no ratnode allocation failures at all.
So we do need to trim that pool a bit, and also handle
the case where swap_out fails, and not just keep
pointlessly calling it.
Rusty Russell [Wed, 10 Apr 2002 04:25:36 +0000 (21:25 -0700)]
[PATCH] 2.5.8-pre3 set_bit cleanup IV
This changes everything arch specific PPC and i386 which should have
been unsigned long (it doesn't *matter*, but bad habits get copied to
where it does matter).
Rusty Russell [Wed, 10 Apr 2002 04:25:21 +0000 (21:25 -0700)]
[PATCH] 2.5.8-pre3 set_bit cleanup II
This changes over some bogus casts, and converts the ext2, hfs and
minix set-bit macros. Also changes pte and open_fds to hand the
actual bitfield rather than whole structure.
Robert Love [Tue, 9 Apr 2002 11:02:50 +0000 (04:02 -0700)]
[PATCH] cpu affinity syscalls
This patch implements the following calls to set and retrieve a task's
CPU affinity:
int sched_setaffinity(pid_t pid, unsigned int len,
unsigned long *new_mask_ptr)
int ched_getaffinity(pid_t pid, unsigned int len,
unsigned long *user_mask_ptr)
a) part of open_namei() done after we'd found vfsmount/dentry of
the object we want to open had been split into a helper - may_open().
b) do_open() in fs/nfsctl.c didn't do any permission checks on
the nfsd file it was opening - sudden idiocy attack on my part (I missed
the fact that dentry_open() doesn't do permission checks - open_namei()
does). Fixed by adding obvious may_open() calls.
Cosmetic change: x86_capability. Makes it an unsigned long, and
removes the gratuitous & operators (it is already an array). These
produce warnings when set_bit() etc. takes an unsigned long * instead
of a void *.
i810_rng: add support for other i8xx chipsets to the Random Number Generator module.
This is being done by adding the detection of the 82801BA(M) and 82801CA(M) I/O Controller Hub's.
Martin Dalecki [Tue, 9 Apr 2002 04:08:04 +0000 (21:08 -0700)]
[PATCH] 2.5.8-pre2 IDE 29b
- Eliminate the mate member of the ata_channel structure. The information
provided by it is already present. This patch may have undesirable
effects on the ns87415.c and trm290.c host chip drivers, but it's worth
for structural reasons to have it.
- Kill unused code, which was "fixing" interrupt routing from ide-pci.c Don't
pass any "mate" between the functions there.
- Don't define SUPPORT_VLB_SYNC unconditionally in ide-taskfile.c
- Apply Vojtech Pavliks fix for piix host-chip driver crashes.
- Add linux/types.h to ide-pnp.c.
- Apply latest sis5513 host chip driver patch from by Lionel Bouton by hand.
- Apply patch by Paul Macerras for power-mac.
- Try to make the ns87415 driver a bit more reentrant.
David Brownell [Mon, 8 Apr 2002 08:01:34 +0000 (01:01 -0700)]
This patch is a more complete fix for the device refcount
sanity checking and cleanup on device disconnect.
- Splits apart usb_dec_dev_use(), for driver use, and
usb_free_dev(), for hub/hcd use. Both now have
kerneldoc, and will BUG() if the refcount and the
device tree get out of sync. (Except for cleanup of
root hub init errors, refcount must go to zero only
at the instant disconnect processing completes.)
- More usbcore-internal function declarations are
now moved out of <linux/usb.h> into hcd.h
- Driver-accessible refcounting is now inlined; minor
code shrinkage, it's using atomic inc/dec instructions
not function calls.
<note from greg k-h, there is still some work to be done with USB device
reference counting, but this patch is a step in the right direction.>
Richard Gooch [Mon, 8 Apr 2002 05:24:38 +0000 (22:24 -0700)]
[PATCH] devfs patch for 2.5.8-pre2
- Documentation updates
- BKL removal (devfs doesn't need the BKL)
- Changed <devfs_rmdir> to allow later additions if not yet empty
- Added calls to <devfs_register_partitions> in drivers/block/blkpc.c
<add_partition> and <del_partition>
- Bug fixes in unique number and devnum allocators.
Brian Gerst [Mon, 8 Apr 2002 05:22:40 +0000 (22:22 -0700)]
[PATCH] Clean up x86 interrupt entry code
This patch moves the generation of the asm interrupt entry stubs from
i8259.c to entry.S. This allows it to be done with less code and
without needing duplicate definitions of SAVE_ALL, GET_CURRENT, etc.
It is a step on the road to removal of the arrays.
It also solves other things, like the fact that Linux
is unable to read the last sector of a disk or partition
with an odd number of sectors.
Anton Blanchard [Mon, 8 Apr 2002 04:31:16 +0000 (21:31 -0700)]
[PATCH] increase dynamic proc entries for ppc64
Unfortunately the proc filesystem has a limit on the number of dynamic
proc entries it can create. On large systems we can exhaust the default
(4096) very quickly. The following patch increases the default to
something more reasonable.