Scott Feldman [Mon, 7 Jul 2003 21:57:00 +0000 (17:57 -0400)]
[e1000] s/int/unsigned int/ for descriptor ring indexes
* Perf cleanup: s/int/unsigned int/ for descriptor ring indexes
[suggestion by Jeff Garzik].
* Perf cleanup: cache references to ring elements using local pointer
Scott Feldman [Mon, 7 Jul 2003 21:54:51 +0000 (17:54 -0400)]
[e1000] h/w workaround for mis-fused parts
* h/w workaround: several 10's of thousands of 82547 controllers where
mis-fused during manufacturing, resulting in PHY Tx amplitude to be
too high and out of spec. This workaround detects those parts, and
compensates the Tx amplitude by subtracting ~80mV.
Andrew Morton [Sat, 5 Jul 2003 02:38:20 +0000 (19:38 -0700)]
[PATCH] fix rfcomm oops
From: ilmari@ilmari.org (Dagfinn Ilmari Mannsaker)
It turns out that net/bluetooth/rfcomm/sock.c (and
net/bluetooth/hci_sock.c) had been left out when net_proto_family gained an
owner field, here's a patch that fixes them both.
Andrew Morton [Sat, 5 Jul 2003 02:38:06 +0000 (19:38 -0700)]
[PATCH] fix current->user->__count leak
From: Arvind Kandhare <arvind.kan@wipro.com>
When switch_uid is called, the reference count of the new user is
incremented twice. I think the increment in the switch_uid is done because
of the reparent_to_init() function which does not increase the __count for
root user.
But if switch_uid is called from any other function, the reference count is
already incremented by the caller by calling alloc_uid for the new user.
Hence the count is incremented twice. The user struct will not be deleted
even when there are no processes holding a reference count for it. This
does not cause any problem currently because nothing is dependent on timely
deletion of the user struct.
Andrew Morton [Sat, 5 Jul 2003 02:37:46 +0000 (19:37 -0700)]
[PATCH] after exec_mmap(), exec cannot fail
If de_thread() fails in flush_old_exec() then we try to fail the execve().
That is a bad move, because exec_mmap() has already switched the current
process over to the new mm. The new process is not yet sufficiently set up
to handle the error and the kernel doublefaults and dies. exec_mmap() is the
point of no return.
Change flush_old_exec() to call de_thread() before running exec_mmap() so the
execing program sees the error. I added fault injection to both de_thread()
and exec_mmap() - everything now survives OK.
Andrew Morton [Sat, 5 Jul 2003 02:37:26 +0000 (19:37 -0700)]
[PATCH] block request batching
From: Nick Piggin <piggin@cyberone.com.au>
The following patch gets batching working how it should be.
After a process is woken up, it is allowed to allocate up to 32 requests
for 20ms. It does not stop other processes submitting requests if it isn't
submitting though. This should allow less context switches, and allow
batches of requests from each process to be sent to the io scheduler
instead of 1 request from each process.
tiobench sequential writes are more than tripled, random writes are nearly
doubled over mm1. In earlier tests I generally saw better CPU efficiency
but it doesn't show here. There is still debug to be taken out. Its also
only on UP.
Andrew Morton [Sat, 5 Jul 2003 02:37:12 +0000 (19:37 -0700)]
[PATCH] block batching fairness
From: Nick Piggin <piggin@cyberone.com.au>
This patch fixes the request batching fairness/starvation issue. Its not
clear what is going on with 2.4, but it seems that its a problem around this
area.
Anyway, previously:
* request queue fills up
* process 1 calls get_request, sleeps
* a couple of requests are freed
* process 2 calls get_request, proceeds
* a couple of requests are freed
* process 2 calls get_request...
Now as unlikely as it seems, it could be a problem. Its a fairness problem
that process 2 can skip ahead of process 1 anyway.
With the patch:
* request queue fills up
* any process calling get_request will sleep
* once the queue gets below the batch watermark, processes
start being worken, and may allocate.
This patch includes Chris Mason's fix to only clear queue_full when all tasks
have been woken. Previously I think starvation and unfairness could still
occur.
With this change to the blk-fair-batches patch, Chris is showing some much
improved numbers for 2.4 - 170 ms max wait vs 2700ms without blk-fair-batches
for a dbench 90 run. He didn't indicate how much difference his patch alone
made, but it is an important fix I think.
Andrew Morton [Sat, 5 Jul 2003 02:36:59 +0000 (19:36 -0700)]
[PATCH] allow the IO scheduler to pass an allocation hint to
From: Nick Piggin <piggin@cyberone.com.au>
This patch implements a hint so that AS can tell the request allocator to
allocate a request even if there are none left (the accounting is quite
flexible and easily handles overallocations).
elv_may_queue semantics have changed from "the elevator does _not_ want
another request allocated" to "the elevator _insists_ that another request is
allocated". I couldn't see any harm ;)
Now in practice, AS will only allow _1_ request over the limit, because as
soon as the request is sent to AS, it stops anticipating.
Andrew Morton [Sat, 5 Jul 2003 02:36:51 +0000 (19:36 -0700)]
[PATCH] blk_congestion_wait threshold cleanup
From: Nick Piggin <piggin@cyberone.com.au>
Now that we are counting requests (not requests free), this patch changes
the congested & batch watermarks to be more logical. Also a minor fix to
the sysfs code.
Andrew Morton [Sat, 5 Jul 2003 02:36:37 +0000 (19:36 -0700)]
[PATCH] Use kblockd for running request queues
Using keventd for running request_fns is risky because keventd itself can
block on disk I/O. Use the new kblockd kernel threads for the generic
unplugging.
Andrew Morton [Sat, 5 Jul 2003 02:36:30 +0000 (19:36 -0700)]
[PATCH] anticipatory I/O scheduler
From: Nick Piggin <piggin@cyberone.com.au>
This is the core anticipatory IO scheduler. There are nearly 100 changesets
in this and five months work. I really cannot describe it fully here.
Major points:
- It works by recognising that reads are dependent: we don't know where the
next read will occur, but it's probably close-by the previous one. So once
a read has completed we leave the disk idle, anticipating that a request
for a nearby read will come in.
- There is read batching and write batching logic.
- when we're servicing a batch of writes we will refuse to seek away
for a read for some tens of milliseconds. Then the write stream is
preempted.
- when we're servicing a batch of reads (via anticipation) we'll do
that for some tens of milliseconds, then preempt.
- There are request deadlines, for latency and fairness.
The oldest outstanding request is examined at regular intervals. If
this request is older than a specific deadline, it will be the next
one dispatched. This gives a good fairness heuristic while being simple
because processes tend to have localised IO.
Just about all of the rest of the complexity involves an array of fixups
which prevent most of teh obvious failure modes with anticipation: trying to
not leave the disk head pointlessly idle. Some of these algorithms are:
- Process tracking. If the process whose read we are anticipating submits
a write, abandon anticipation.
- Process exit tracking. If the process whose read we are anticipating
exits, abandon anticipation.
- Process IO history. We accumulate statistical info on the process's
recent IO patterns to aid in making decisions about how long to anticipate
new reads.
Currently thinktime and seek distance are tracked. Thinktime is the
time between when a process's last request has completed and when it
submits another one. Seek distance is simply the number of sectors
between each read request. If either statistic becomes too high, the
it isn't anticipated that the process will submit another read.
The above all means that we need a per-process "io context". This is a fully
refcounted structure. In this patch it is AS-only. later we generalise it a
little so other IO schedulers could use the same framework.
- Requests are grouped as synchronous and asynchronous whereas deadline
scheduler groups requests as reads and writes. This can provide better
sync write performance, and may give better responsiveness with journalling
filesystems (although we haven't done that yet).
We currently detect synchronous writes by nastily setting PF_SYNCWRITE in
current->flags. The plan is to remove this later, and to propagate the
sync hint from writeback_contol.sync_mode into bio->bi_flags thence into
request->flags. Once that is done, direct-io needs to set the BIO sync
hint as well.
- There is also quite a bit of complexity gone into bashing TCQ into
submission. Timing for a read batch is not started until the first read
request actually completes. A read batch also does not start until all
outstanding writes have completed.
AS is the default IO scheduler. deadline may be chosen by booting with
"elevator=deadline".
There are a few reasons for retaining deadline:
- AS is often slower than deadline in random IO loads with large TCQ
windows. The usual real world task here is OLTP database loads.
- deadline is presumably more stable.
- deadline is much simpler.
The tunable per-queue entries under /sys/block/*/iosched/ are all in
milliseconds:
* read_expire
Controls how long until a request becomes "expired".
It also controls the interval between which expired requests are served,
so set to 50, a request might take anywhere < 100ms to be serviced _if_ it
is the next on the expired list.
Obviously it can't make the disk go faster. Result is basically the
timeslice a reader gets in the presence of other IO. 100*((seek time /
read_expire) + 1) is very roughly the % streaming read efficiency your disk
should get in the presence of multiple readers.
* read_batch_expire
Controls how much time a batch of reads is given before pending writes
are served. Higher value is more efficient. Shouldn't really be below
read_expire.
* write_ versions of the above
* antic_expire
Controls the maximum amount of time we can anticipate a good read before
giving up. Many other factors may cause anticipation to be stopped early,
or some processes will not be "anticipated" at all. Should be a bit higher
for big seek time devices though not a linear correspondance - most
processes have only a few ms thinktime.
Andrew Morton [Sat, 5 Jul 2003 02:36:09 +0000 (19:36 -0700)]
[PATCH] Create `kblockd' workqueue
keventd is inappropriate for running block request queues because keventd
itself can get blocked on disk I/O. Via call_usermodehelper()'s vfork and,
presumably, GFP_KERNEL allocations.
So create a new gang of kernel threads whose mandate is for running low-level
disk operations. It must ever block on disk IO, so any memory allocations
should be GFP_NOIO.
We mainly use it for running unplug operations from interrupt context.
Andrew Morton [Sat, 5 Jul 2003 02:36:03 +0000 (19:36 -0700)]
[PATCH] bring back the batch_requests function
From: Nick Piggin <piggin@cyberone.com.au>
The batch_requests function got lost during the merge of the dynamic request
allocation patch.
We need it for the anticipatory scheduler - when the number of threads
exceeds the number of requests, the anticipated-upon task will undesirably
sleep in get_request_wait().
And apparently some block devices which use small requests need it so they
string a decent number together.
This patch proposes a performance fix for the current IPC semaphore
implementation.
There are two shortcoming in the current implementation:
try_atomic_semop() was called two times to wake up a blocked process,
once from the update_queue() (executed from the process that wakes up
the sleeping process) and once in the retry part of the blocked process
(executed from the block process that gets woken up).
A second issue is that when several sleeping processes that are eligible
for wake up, they woke up in daisy chain formation and each one in turn
to wake up next process in line. However, every time when a process
wakes up, it start scans the wait queue from the beginning, not from
where it was last scanned. This causes large number of unnecessary
scanning of the wait queue under a situation of deep wait queue.
Blocked processes come and go, but chances are there are still quite a
few blocked processes sit at the beginning of that queue.
What we are proposing here is to merge the portion of the code in the
bottom part of sys_semtimedop() (code that gets executed when a sleeping
process gets woken up) into update_queue() function. The benefit is two
folds: (1) is to reduce redundant calls to try_atomic_semop() and (2) to
increase efficiency of finding eligible processes to wake up and higher
concurrency for multiple wake-ups.
We have measured that this patch improves throughput for a large
application significantly on a industry standard benchmark.
This patch is relative to 2.5.72. Any feedback is very much
appreciated.
Both number of function calls to try_atomic_semop() and update_queue()
are reduced by 50% as a result of the merge. Execution time of
sys_semtimedop is reduced because of the reduction in the low level
functions.
Andrew Morton [Sat, 5 Jul 2003 02:35:49 +0000 (19:35 -0700)]
[PATCH] PCI domain scanning fix
From: Matthew Wilcox <willy@debian.org>
ppc64 oopses on boot because pci_scan_bus_parented() is unexpectedly
returning NULL. Change pci_scan_bus_parented() to correctly handle
overlapping PCI bus numbers on different domains.
If a signal is sent via kill() or tkill() the kernel fills in the wrong
PID value in the siginfo_t structure (obviously only if the handler has
SA_SIGINFO set).
POSIX specifies the the si_pid field is filled with the process ID, and
in Linux parlance that's the "thread group" ID, not the thread ID.
When forcing through a signal for some thread-synchronous
event (ie SIGSEGV, SIGFPE etc that happens as a result of a
trap as opposed to an external event), if the signal is
blocked we will not invoce a signal handler, we will just
kill the thread with the signal.
This is equivalent to what we do in the SIG_IGN case: you
cannot ignore or block synchronous signals, and if you try,
we'll just have to kill you.
We don't want to handle endless recursive faults, which the
old behaviour easily led to if the stack was bad, for example.
Marc Zyngier [Fri, 4 Jul 2003 10:00:47 +0000 (03:00 -0700)]
[PATCH] EISA: avoid unnecessary probing
- By default, do not try to probe the bus if the mainboard does not
seems to support EISA (allow this behaviour to be changed through a
command-line option).
Marc Zyngier [Fri, 4 Jul 2003 10:00:12 +0000 (03:00 -0700)]
[PATCH] EISA: core changes
- Now reserves I/O ranges according to EISA specs (four 256 bytes
regions instead of a single 4KB region).
- By default, do not try to probe the bus if the mainboard does not
seems to support EISA (allow this behaviour to be changed through a
command-line option).
- Use parent bridge device dma_mask as default for each discovered
device.
- Allow devices to be enabled or disabled from the kernel command line
(useful for non-x86 platforms where the firmware simply disable
devices it doesn't know about...).
Carl-Daniel Hailfinger suggest adding a paranoid incoming
trigger as per the "bk help triggers" suggestion, so that
we'll see any new triggers showing up in the tree.
- Make the VFS pass the struct nameidata as an optional argument
to the create inode operation.
- Patch vfs_create() to take a struct nameidata as an optional
argument.
[PATCH] Add open intent information to the 'struct nameidata'
- Add open intent information to the 'struct nameidata'.
- Pass the struct nameidata as an optional parameter to the
lookup() inode operation.
- Pass the struct nameidata as an optional parameter to the
d_revalidate() dentry operation.
- Make link_path_walk() set the LOOKUP_CONTINUE flag in nd->flags instead
of passing it as an extra parameter to d_revalidate().
- Make open_namei(), and sys_uselib() set the open()/create() intent
data.
Jeff Garzik [Thu, 3 Jul 2003 13:23:39 +0000 (06:23 -0700)]
[PATCH] fix via irq routing
Via irq routing has a funky PIRQD location. I checked my datasheets
and, yep, this is correct all the way back to via686a.
This bug existed for _ages_. I wonder if I created it, even...
Re-organize "ext3_get_inode_loc()" and make it easier to
follow by splitting it into two functions: one that calculates
the position, and the other that actually reads the inode
block off the disk.
Add an asynchronous buffer read-ahead facility. Nobody
uses it for now, but I needed it for some tuning tests,
and it is potentially useful for others.
John Stultz [Thu, 3 Jul 2003 09:39:18 +0000 (02:39 -0700)]
[PATCH] jiffies include fix
This patch fixes a bad declaration of jiffies in timer_tsc.c and
timer_cyclone.c, replacing it with the proper usage of jiffies.h.
Caught by gregkh.
Adam Belay [Thu, 3 Jul 2003 15:39:09 +0000 (15:39 +0000)]
[PNP] Handle Disabled Resources Properly
Some devices will allow for individual resources to be disabled,
even when the device as a whole is active. The current PnP
resource manager is not handling this situation properly. This
patch corrects the issue by detecting disabled resources and then
flagging them. The pnp layer will now skip over any disabled
resources. Interface updates have also been included so that we
can properly display resource tables when a resource is disabled.
Also note that a new flag "IORESOURCE_DISABLED" has been added to
linux/ioports.h.
Matthew Wilcox [Thu, 3 Jul 2003 08:52:14 +0000 (01:52 -0700)]
[PATCH] PCI config space in sysfs
- Fix a couple of bugs in sysfs's handling of binary files (my fault).
- Implement pci config space reads and writes in sysfs
Matthew Wilcox [Thu, 3 Jul 2003 08:51:30 +0000 (01:51 -0700)]
[PATCH] PCI: Remove pci_bus_exists
Convert all callers of pci_bus_exists() to call pci_find_bus() instead.
Since all callers of pci_find_bus() are __init or __devinit, mark it as
__devinit too.
Matthew Wilcox [Thu, 3 Jul 2003 08:50:59 +0000 (01:50 -0700)]
[PATCH] PCI: arch/i386/pci/direct.c can use __init, not __devinit
pci_sanity_check() is only called from functions marked __init, so it
can be __init too.
Matthew Wilcox [Thu, 3 Jul 2003 08:50:39 +0000 (01:50 -0700)]
[PATCH] PCI: Improve documentation
Fix some grammar problems
Add a note about Fast Back to Back support
Change the slot_name recommendation to pci_name().
Rusty Russell [Wed, 2 Jul 2003 17:38:29 +0000 (10:38 -0700)]
[PATCH] Make ksoftirqd a normal per-cpu variable.
This moves the ksoftirqd pointers out of the irq_stat struct, and uses a
normal per-cpu variable. It's not that time critical, nor referenced in
assembler. This moves us closer to making irq_stat a per-cpu variable.
Because some archs have hardcoded asm references to offsets in this
structure, I haven't touched non-x86. The __ksoftirqd_task field is
unused in other archs, too.
Rusty Russell [Wed, 2 Jul 2003 17:32:49 +0000 (10:32 -0700)]
[PATCH] Remove cpu arg from cpu_raise_irq
The function cpu_raise_softirq() takes a softirq number, and a cpu number,
but cannot be used with cpu != smp_processor_id(), because there's no
locking around the pending softirq lists. Since noone does this, remove
that arg.
As per Linus' suggestion, names changed:
raise_softirq(int nr)
cpu_raise_softirq(int cpu, int nr) -> raise_softirq_irqoff(int nr)
__cpu_raise_softirq(int cpu, int nr) -> __raise_softirq_irqoff(int nr)
Andrew Morton [Wed, 2 Jul 2003 15:50:11 +0000 (08:50 -0700)]
[PATCH] Set limits on CONFIG_LOG_BUF_SHIFT
From: bert hubert <ahu@ds9a.nl>
Attached patch adds a range check to LOG_BUF_SHIFT and clarifies the
configuration somewhat. I managed to build a non-booting kernel because I
thought 64 was a nice power of two, which lead to the kernel blocking when
it tried to actually use or allocate a 2^64 buffer.
Andrew Morton [Wed, 2 Jul 2003 15:50:04 +0000 (08:50 -0700)]
[PATCH] ext3: fix journal_release_buffer() race
CPU0 CPU1
journal_get_write_access(bh)
(Add buffer to t_reserved_list)
journal_get_write_access(bh)
(It's already on t_reserved_list:
nothing to do)
(We decide we don't want to
journal the buffer after all)
journal_release_buffer()
(It gets pulled off the transaction)
journal_dirty_metadata()
(The buffer isn't on the reserved
list! The kernel explodes)
Simple fix: just leave the buffer on t_reserved_list in
journal_release_buffer(). If nobody ends up claiming the buffer then it will
get thrown away at start of transaction commit.
Andrew Morton [Wed, 2 Jul 2003 15:49:50 +0000 (08:49 -0700)]
[PATCH] fix double mmdrop() on exec path
If load_elf_binary() (and the other binary handlers) fail after
flush_old_exec() (for example, in setup_arg_pages()) then do_execve() will go
through and do mmdrop(bprm.mm).
But bprm.mm is now current->mm. We've just freed the current process's mm.
The kernel dies in a most ghastly manner.
Fix that up by nulling out bprm.mm in flush_old_exec(), at the point where we
consumed the mm. Handle the null pointer in the do_execve() error path.
Also: don't open-code free_arg_pages() in do_execve(): call it instead.
Andrew Morton [Wed, 2 Jul 2003 15:49:43 +0000 (08:49 -0700)]
[PATCH] ext2: inode allocation race fix
ext2's inode allocator will call find_group_orlov(), which will return a
suitable blockgroup in which the inode should be allocated. But by the time
we actually try to allocate an inode in the blockgroup, other CPUs could have
used them all up.
ext2 will bogusly fail with "ext2_new_inode: Free inodes count corrupted in
group NN".
To fix this we just advance onto the next blockgroup if the rare race
happens. If we've scanned all blockgroups then return -ENOSPC.
(This is a bit inaccurate: after we've scanned all blockgroups, there may
still be available inodes due to inode freeing activity in other blockgroups.
This cannot be fixed without fs-wide locking. The effect is a slightly
early ENOSPC in a nearly-full filesystem).
Andrew Morton [Wed, 2 Jul 2003 15:49:35 +0000 (08:49 -0700)]
[PATCH] Security hook for vm_enough_memory
From: Stephen Smalley <sds@epoch.ncsc.mil>
This patch against 2.5.73 replaces vm_enough_memory with a security hook
per Alan Cox's suggestion so that security modules can completely replace
the logic if desired.
Note that the patch changes the interface to follow the convention of the
other security hooks, i.e. return 0 if ok or -errno on failure (-ENOMEM in
this case) rather than returning a boolean. It also exports various
variables and functions required for the vm_enough_memory logic.
Andrew Morton [Wed, 2 Jul 2003 15:49:26 +0000 (08:49 -0700)]
[PATCH] cleanup and generalise lowmem_page_address
From: William Lee Irwin III <wli@holomorphy.com>
This patch allows architectures to micro-optimize lowmem_page_address() at
their whims. Roman Zippel originally wrote and/or suggested this back when
dependencies on page->virtual existing were being shaken out. That's
long-settled, so it's fine to do this now.
Andrew Morton [Wed, 2 Jul 2003 15:49:14 +0000 (08:49 -0700)]
[PATCH] fix lost-tick compensation corner-case
From: john stultz <johnstul@us.ibm.com>
This patch catches a corner case in the lost-tick compensation code.
There is a check to see if we overflowed between reads of the two time
sources, however should the high res time source be slightly slower then
what we calibrated, its possible to trigger this code when no ticks have
been lost.
This patch adds an extra check to insure we have seen more then one tick
before we check for this overflow. This seems to resolve the remaining
"time doubling" issues that I've seen reported.
Andrew Morton [Wed, 2 Jul 2003 15:49:07 +0000 (08:49 -0700)]
[PATCH] fix lost_tick detector for speedstep
From: john stultz <johnstul@us.ibm.com>
The patch tries to resolve issues caused by running the TSC based lost
tick compensation code on CPUs that change frequency (speedstep, etc).
Should the CPU be in slow mode when calibrate_tsc() executes, the kernel
will assume we have so many cycles per tick. Later when the cpu speeds up,
the kernel will start noting that too many cycles have past since the last
interrupt. Since this can occasionally happen, the lost tick compensation
code then tries to fix this by incrementing jiffies. Thus every tick we
end up incrementing jiffies many times, causing timers to expire too
quickly and time to rush ahead.
This patch detects when there has been 100 consecutive interrupts where we
had to compensate for lost ticks. If this occurs, we spit out a warning
and fall back to using the PIT as a time source.
I've tested this on my speedstep enabled laptop with success, and others
laptop users seeing this problem have reported it works for them. Also to
ensure we don't fall back to the slower PIT too quickly, I tested the code
on a system I have that looses ~30 ticks about every second and it can
still manage to use the TSC as a good time source.
This solves most of the "time doubling" problems seen on laptops.
Additionally this revision has been modified to use the cleanups made in
rename-timer_A1.
Andrew Morton [Wed, 2 Jul 2003 15:48:52 +0000 (08:48 -0700)]
[PATCH] Report detached thread exit to the debugger
From: Daniel Jacobowitz <dan@debian.org>
Right now, CLONE_DETACHED threads silently vanish from GDB's sight when
they exit. This patch lets the thread report its exit to the debugger, and
then be auto-reaped as soon as it is collected, instead of being reaped as
soon as it exits and not reported at all.
GDB works either way, but this is more correct and will be useful for some
later GDB patches.
Andrew Morton [Wed, 2 Jul 2003 15:48:26 +0000 (08:48 -0700)]
[PATCH] remove lock_kernel() from file_ops.flush()
Rework the file_ops.flush() API sothat it is no longer called under
lock_kernel(). Push lock_kernel() down to all impementations except CIFS,
which doesn't want it.
Andrew Morton [Wed, 2 Jul 2003 15:48:18 +0000 (08:48 -0700)]
[PATCH] procfs: remove some unneeded lock_kernel()s
From: William Lee Irwin III <wli@holomorphy.com>
Remove spurious BKL acquisitions in /proc/. The BKL is not required to
access nr_threads for reporting, and get_locks_status() takes it
internally, wrapping all operations with it.