David Mosberger [Tue, 15 Apr 2003 15:51:27 +0000 (08:51 -0700)]
[PATCH] module symbol fix
Fix for trivial typo. Without it, you can't insert anything on top of
agpgart.ko because the agp_register_driver() will erroneously pick up
the symbol version from agp_backend_acquire().
Now that the kernel provides code user programs are executing directly
(I mean the vsyscall code on x86) it is necessary to add unwind
information for that code as well. The unwind information is used not
only in C++ code.
This patch adds a AT_SYSINFO_EH_FRAME ELF aux-table value that points to
the unwinding block description for the sysinfo frame, and makes sure
the AT_* value is passed to applications. It defines the static data
for the unwind blocks (two, one for int80 and the other for sysenter),
and finally adds code to copy the data in place.
Ivan Kokshaysky [Tue, 15 Apr 2003 03:58:21 +0000 (20:58 -0700)]
[PATCH] alpha: move_initrd fix (from Jeff Wiedemeier)
While testing our upcoming kernel update for 7.2 alpha, I've encountered
a problem with move_initrd. It allocates a page-aligned chunk to move
the initrd into, but it doesn't allocate the entire last
page. Subsequent bootmem allocations can then be filled from the last
page used be the initrd. This then becomes a problem when the initrd
memory is released.
Ivan Kokshaysky [Tue, 15 Apr 2003 03:56:48 +0000 (20:56 -0700)]
[PATCH] alpha: execve() fix
The 2.5 kernels may hang on execve(). Most easily this can be reproduced
by submitting forms in mozilla, apparently because it does execve with
very long argument strings.
That's what happens in do_execve, I suppose:
bprm.mm = mm_alloc();
...
init_new_context(current, bprm.mm); here we update current ptbr
with new mm->pgd
...
copy_strings;
interrupt -> do_softirq -> switch to ksoftirqd
...
switch back to do_execve;
copy_strings - immediate page fault in copy_user that we can't
handle because the new ptbr has been activated
after context switch and current->mm is not
valid anymore.
The fix is to not update ptbr for current task in init_new_context(),
as we do it later in activate_mm() anyway.
The patch fixes some problems with NFS under heavy writeout.
NFS pages can be in a clean but unreclaimable state. They are unreclaimable
because the server has not yet acked the write - we may need to "redirty"
them if the server crashes.
These are referred to as "unstable" pages. We need to count them alongside
dirty and writeback pages when making flushing and throttling decisions.
Otherwise the machine can be flooded with these pages and the VM has
problems.
Andrew Morton [Mon, 14 Apr 2003 13:10:43 +0000 (06:10 -0700)]
[PATCH] Posix timer hang fix
From: george anzinger <george@mvista.com>
The MAJOR problem was a hang in the kernel if a user tried to delete a
repeating timer that had a signal delivery pending. I was putting the
task in a loop waiting for that same task to pick up the signal. OUCH!
A minor issue relates to the need by the glibc folks, to specify a
particular thread to get the signal. I had this code in all along,
but somewhere in 2.5 the signal code was made POSIX compliant, i.e.
deliver to the first thread that doesn't have it masked out.
This now uses the code from the above mentioned clean up. Most
signals go to the group delivery signal code, however, those
specifying THREAD_ID (an extension to the POSIX standard) are sent to
the specified thread. That thread MUST be in the same thread group as
the thread that creates the timer.
Andrew Morton [Mon, 14 Apr 2003 13:09:45 +0000 (06:09 -0700)]
[PATCH] flush_work_queue() fixes
The workqueue code currently has a notion of a per-cpu queue being "busy".
flush_scheduled_work()'s responsibility is to wait for a queue to be not busy.
Problem is, flush_scheduled_work() can easily hang up.
- The workqueue is deemed "busy" when there are pending delayed
(timer-based) works. But if someone repeatedly schedules new delayed work
in the callback, the queue will never fall idle, and flush_scheduled_work()
will not terminate.
- If someone reschedules work (not delayed work) in the work function, that
too will cause the queue to never go idle, and flush_scheduled_work() will
not terminate.
So what this patch does is:
- Create a new "cancel_delayed_work()" which will try to kill off any
timer-based delayed works.
- Change flush_scheduled_work() so that it is immune to people re-adding
work in the work callout handler.
We can do this by recognising that the caller does *not* want to wait
until the workqueue is "empty". The caller merely wants to wait until all
works which were pending at the time flush_scheduled_work() was called have
completed.
The patch uses a couple of sequence numbers for that.
So now, if someone wants to reliably remove delayed work they should do:
/*
* Make sure that my work-callback will no longer schedule new work
*/
my_driver_is_shutting_down = 1;
/*
* Kill off any pending delayed work
*/
cancel_delayed_work(&my_work);
/*
* OK, there will be no new works scheduled. But there may be one
* currently queued or in progress. So wait for that to complete.
*/
flush_scheduled_work();
The patch also changes the flush_workqueue() sleep to be uninterruptible.
We cannot legally bale out if a signal is delivered anyway.
Andrew Morton [Mon, 14 Apr 2003 13:09:35 +0000 (06:09 -0700)]
[PATCH] Fix oprofile on hyperthreaded P4's
From: Philippe Elie <phil.el@wanadoo.fr>
- oprofile is currently only profiling one sibling. Fix that with
appropriate register settings.
- fix an oops which could occur if the userspace driver were to request a
non-existent resource.
- in NMI handler counter_config[i].event is accessible from user space so
user can change the event during profiling by echo xxx >
/dev/oprofile/event
- event mask was wrong, the bit field is 6 bits length not 5, events
SSE_INPUT_ASSIST and X87_SIMD_MOVES_UOP was affected by masking high bit of
event number.
Amiga keyboard: fix default keyboard mappings:
- Map the parentheses keys on the numeric keypad to KPLEFTPAREN and
KPRIGHTPAREN (was: NUMLOCK and SCROLLLOCK)
- Map the Help key to HELP (was: F11)
- Map the Amiga keys to LEFTMETA and RIGHTMETA (was: RESERVED)
s390 dasd driver fixes:
- Take request queue lock in dasd_end_request.
- Make it work with CONFIG_DEVFS_FS=y.
- Properly wait for the root device.
- Cope with requests killed due to failed channel path.
- Improve reference counting.
- Remove devno from struct dasd_device.
- Remove unnecessary bdget/bdput calls.
Common i/o layer fixes:
- Fix for path no operational condition in cio_start.
- Fix handling of user interruption parameter.
- Add code to wait for devices in init_ccw_bus_type.
- Move qdio states out of main cio state machine.
- Reworked chsc data structures.
- Add ccw_device_start_timeout.
- Handle path verification required flag.
s390 fixes:
- Initialize timing related variables first and then enable the timer interrupt.
- Normalize nano seconds to micro seconds in do_gettimeofday.
- Add types for __kernel_timer_t and __kernel_clockid_t.
- Fix ugly bug in switch_to: set prev to the return value of resume, otherwise
prev still contains the previous process at the time resume was called and
not the previous process at the time resume returned. They differ...
- Add missing include to get the kernel compiled.
- Get a closer match with the i386 termios.h file.
- Cope with INITIAL_JIFFIES.
- Define cpu_relax to do a cpu yield on VM and LPAR.
- Don't reenable interrupts in program check handler.
- Add pte_file definitions.
- Fix PT_IEEE_IP special case in ptrace.
- Use compare and swap to release the lock in _raw_spin_unlock.
- Introduce invoke_softirq to switch to async. interrupt stack.
Russell King [Mon, 14 Apr 2003 04:05:37 +0000 (21:05 -0700)]
[PATCH] flush_cache_mm in zap_page_range
unmap_vmas() eventually calls tlb_start_vma(), where most architectures
flush caches as necessary. The flush here seems to make the
flush_cache_range() in zap_page_range() redundant, and therefore can be
removed.
Kai Mäkisara [Mon, 14 Apr 2003 03:33:35 +0000 (20:33 -0700)]
[PATCH] SCSI tape EOT write fixes
This contains the following changes:
- EOT detection fixed when writing in fixed block mode
- asynchronous writes in fixed block mode and write threshold removed
to enable the EOT fixes (the parameter accepted for compatibility)
Kai Mäkisara [Mon, 14 Apr 2003 03:33:25 +0000 (20:33 -0700)]
[PATCH] SCSI tape ILI and timeout fixes
This contains the following changes:
- ILI fixed to work with really old drives
- message printed in case block larger than read()
- long timeout used when creating a tape partition
I managed to add a bug to the local APIC NMI watchdog's
resume procedure in the driver model conversion for 2.5.67.
The problem is that the resume procedure simply calls the
enable procedure. If the NMI watchdog has been disabled by
another driver (like oprofile or perfctr), then the NMI
watchdog will incorrectly be re-enabled.
I discovered this when updating the perfctr driver for 2.5.67
and seeing unexpected NMIs after a resume from apm --suspend.
We can fix this by unregistering the NMI watchdog from the
driver model when disabling it (like the code did before the
driver model changes), or by remembering the previous state
at suspend and checking it at resume. The patch below uses
the second, simpler, approach. Tested, please apply.
James Bottomley [Mon, 14 Apr 2003 04:18:41 +0000 (23:18 -0500)]
fix scsi queue plugging behaviour
Following recent changes removing blk_queue_empty(), we were
incorrectly plugging the queue some times (most often as part of
the SCSI scan process). This was causing a non-deterministic panic
in the scan code because a destroyed queue was sometimes being
unplugged and run.
Cam Mayor [Sun, 13 Apr 2003 19:22:24 +0000 (20:22 +0100)]
[ARM PATCH] 1453/1: fix clps711x framebuffer "use SRAM?" range
Patch from cam mayor
when setting up the framebuffer on the clps711x platform, the code checks to see if your allocated memory area is less than 38400 bytes. If it is, a comment is sent to the kernel output suggesting it could be placed into SRAM. This patch modifies the check so that it is suggested if the allocated memory area is less than OR EQUAL TO 38400 bytes. This value is important as 38400 bytes is exactly the size of a 320 x 240 x 4bpp screen.
Please see mail thread '[patch] Cleanup of head.S?' from 25 Feb 2003. Let us remove the third part now. The mapping set by this code is done already.
The comment of rmk was
'I suspect we can kill (3) without hurting stuff that's merged into the
-rmk tree, although I'm sure there's a reason it existed. I'll have
to check my mail archives, but I think there was a machine that required,
but it appears not to be merged.'
Neil Brown [Sat, 12 Apr 2003 20:04:42 +0000 (13:04 -0700)]
[PATCH] md: Fix raid1 oops
From: Angus Sawyer <angus.sawyer@dsl.pipex.com>
When the last device in a raid1 array is failed (or missing) the r1bio
structure can be released (especially on very fast devices) before
make_request has finished using it.
This patch gets and puts an extra reference to the r1_bio around the
submission loop, and uses the status in r1_bio to maintain the request status
if the last refernce is held by make_request.
This is also more correct for write requests, as a write should succeed
if any write succeeded, not only if the last write succceeded.
Neil Brown [Sat, 12 Apr 2003 20:04:22 +0000 (13:04 -0700)]
[PATCH] kNFSd: NFSD binary compatibility breakage
The removal of "struct nfsctl_uidmap" from "nfsctl_fdparm" broke
binary compatiblity on 64-bit platforms (strictly speaking: on all
platforms with alignof(void *) > alignof(int)). The problem is that
nfsctl_uidmap contained a "char *", which forced the alignment of the
entire union to be 64 bits. With the removal of the uidmap, the
required alignment drops to 32 bits. Since the first member is only
32 bits in size, this breaks compatibility with user-space. Patch
below fixes the problem.
Neil Brown [Sat, 12 Apr 2003 20:04:11 +0000 (13:04 -0700)]
[PATCH] kNFSd: Return correct result for ACCESS(READ) on eXecute-only file.
Currently, an NFSv3 ACCESS check for READ permission on an
eXecute-only file will succeed where it should fail.
This is because nfsd_permission allows READ access to eXecute only
files so that mode 711 executables can be loaded and run, and
nfsd_access simply uses nfsd_permission.
This patch changes nfsd_permission to only map eXecute permission to
read permission of MAY_OWNER_OVERRIDE was set. This is only set
when trying to read from a file, so ACCESS will no longer be tricked.
This change will only affect callers of nfsd_permission that specify
MAY_READ and not MAY_OWNER_OVERRIDE, and nfsd_access is the only
routine that calls nfsd_permission (via fh_verify) that way.
Neil Brown [Sat, 12 Apr 2003 20:04:02 +0000 (13:04 -0700)]
[PATCH] kNFSd: nfsd/export.c tidyup and add missing exp_put
There was a missing exp_put in export.c so that after a client
mounts an exported filesystem, the server would never be able to
unmount, even after trying to unexport. This is fixed by the last
chunk of this patch.
Also assorted cleanups to the code found while hunting.
Andrew Morton [Sat, 12 Apr 2003 20:00:40 +0000 (13:00 -0700)]
[PATCH] use spinlocking in the ext2 block allocator
From Alex Tomas and myself
ext2 currently uses lock_super() to protect the filesystem's in-core block
allocation bitmaps.
On big SMP machines the contention on that semaphore is causing high context
switch rates, large amounts of idle time and reduced throughput.
The context switch rate can also worsen block allocation: if several tasks
are trying to allocate blocks inside the same blockgroup for different files,
madly rotating between those tasks will cause the files' blocks to be
intermingled.
On SDET and dbench-style worloads (lots of tasks doing lots of allocation)
this patch (and a similar one for the inode allocator) improve throughout on
an 8-way by ~15%. On 16-way NUMAQ the speedup is 150%.
What wedo isto remove the lock altogether and just rely on the atomic
semantics of test_and_set_bit(): if the allocator sees a block was free it
runs test_and_set_bit(). If that fails, then we raced and the allocator will
go and look for another block.
Of course, we don't really use test_and_set_bit() because that
isn'tendian-dependent. New atomic endian-independent functions are
introduced: ext2_set_bit_atomic() and ext2_clear_bit_atomic(). We do not
need ext2_test_bit_atomic(), since even if ext2_test_bit() returns the wrong
result, that error will be detected and naturally handled in the subsequent
ext2_set_bit_atomic().
For little-endian machines the new atomic ops map directly onto the
test_and_set_bit(), etc.
For big-endian machines we provide the architecture's impementation with the
address of a spinlock whcih can be taken around the nonatomic ext2_set_bit().
The spinlocks are hashed, and the hash is scaled according to the machine
size. Architectures are free to implement optimised versions of
ext2_set_bit_atomic() and ext2_clear_bit_atomic().
Andrew Morton [Sat, 12 Apr 2003 20:00:09 +0000 (13:00 -0700)]
[PATCH] blockgroup_lock: hashed spinlocks for ext2 and ext3
ext2 and ext3 per-blockgroup metadata needs locking. An fs-wide lock is
expensive, and a per-blockgroup lock consumes too much storage (up to 32768
blockgroups per filesystem). We need something in-between.
blockgroup_locks are very simple hashed spinlocks which provide this
compromise. The size of the lock is scaled by NR_CPUS to implement an
additional speed/space tradeoff.
These locks are actually fairly generic. However I presented it as something
which is specific to ext2 and ext3 so that people wouldn't go using them all
over the place. They consume a lot of storage.
Andrew Morton [Sat, 12 Apr 2003 19:59:51 +0000 (12:59 -0700)]
[PATCH] percpu_counters: approximate but scalable counters
Several places in ext2 and ext3 are using filesystem-wide counters which use
global locking. Mainly for the orlov allocator's heuristics.
To solve the contention which this causes we can trade off accuracy against
speed.
This patch introduces a "percpu_counter" library type in which the counts are
per-cpu and are periodically spilled into a global counter. Readers only
read the global counter.
These objects are *large*. On a 32 CPU P4, they are 4 kbytes. On a 4 way
p3, 128 bytes.
Andrew Morton [Sat, 12 Apr 2003 19:59:04 +0000 (12:59 -0700)]
[PATCH] vmalloc stats in /proc/meminfo
From: Matt Porter <porter@cox.net>
There was a thread a while back on lkml where Dave Hansen proposed this
simple vmalloc usage reporting patch. The thread pretty much died out as
most people seemed focused on what VM loading type bugs it could solve. I
had posted that this type of information was really valuable in debugging
embedded Linux board ports. A common example is where people do arch
specific setup that limits there vmalloc space and then they find modules
won't load. ;) Having the Vmalloc* info readily available is real useful in
helping folks to fix their kernel ports.
Andrew Morton [Sat, 12 Apr 2003 19:58:42 +0000 (12:58 -0700)]
[PATCH] /proc/interrupts allocates too much memory
From: David Mosberger <davidm@napali.hpl.hp.com>
interrupts_open() can easily try to kmalloc() more memory than
supported by kmalloc. E.g., with 16KB page size and NR_CPUS==64, it
would try to allocate 147456 bytes.
The workaround below is to allocate 4KB per 8 CPUs. Not really a
solution, but the fundamental problem is that /proc/interrupts
shouldn't use a fixed buffer size in the first place. I suppose
another solution would be to use vmalloc() instead. It all feels like
bandaids though.
Andrew Morton [Sat, 12 Apr 2003 19:57:41 +0000 (12:57 -0700)]
[PATCH] architecture hooks for mem_map initialization
From: Christoph Hellwig <hch@lst.de>
This patch is from the IA64 tree, with minor cleanups from me.
Split out initialization of pgdat->node_mem_map into a separate function
and allow architectures to override it. This is needed for HP IA64
machines that have a virtually mapped memory map to support big
memory holes without having to use discontigmem.
(memmap_init_zone is non-static to allow the IA64 code to use it -
I did that instead of passing it's address into the arch hook as
it is done currently in the IA64 tree)
Andrew Morton [Sat, 12 Apr 2003 19:57:13 +0000 (12:57 -0700)]
[PATCH] bootmem speedup from the IA64 tree
From: Christoph Hellwig <hch@lst.de>
This patch is from the IA64 tree, with some minor cleanups by me.
David described it as:
This is a performance speed up and some minor indendation fixups.
The problem is that the bootmem code is (a) hugely slow and (b) has
execution that grow quadratically with the size of the bootmap bitmap.
This causes noticable slowdowns, especially on machines with (relatively)
large holes in the physical memory map. Issue (b) is addressed by
maintaining the "last_success" cache, so that we start the next search
from the place where we last found some memory (this part of the patch
could stand additional reviewing/testing). Issue (a) is addressed by
using find_next_zero_bit() instead of the slow bit-by-bit testing.
Andrew Morton [Sat, 12 Apr 2003 19:55:43 +0000 (12:55 -0700)]
[PATCH] don't clear PG_uptodate on ENOSPC
If get_block() returns -ENOSPC __block_write_full_page() is currently
clearing PG_uptodate.
Tht doesn't make any sense - failure to allocate space (or an IO error) does
not make the page not uptodate. It will create pages which are dirty, mapped
into pagetables and not uptodate, which is a nonsensical state.
Andrew Morton [Sat, 12 Apr 2003 19:55:21 +0000 (12:55 -0700)]
[PATCH] Fix deadlock with ext3+quota
From: Jan Kara <jack@ucw.cz>
Fixes a deadlock-causing lock-ranking bug between dqio_sem and
journal_start().
It sets up the needed infrastructure so that the quota code's sync_dquot()
operation can call into ext3 and arrange for the transaction start to be
nested outside the taking of dqio_sem.
Andrew Morton [Sat, 12 Apr 2003 19:55:02 +0000 (12:55 -0700)]
[PATCH] Remove flush_page_to_ram()
From: Hugh Dickins <hugh@veritas.com>
This patch removes the long deprecated flush_page_to_ram. We have
two different schemes for doing this cache flushing stuff, the old
flush_page_to_ram way and the not so old flush_dcache_page etc. way:
see DaveM's Documentation/cachetlb.txt. Keeping flush_page_to_ram
around is confusing, and makes it harder to get this done right.
All architectures are updated, but the only ones where it amounts
to more than deleting a line or two are m68k, mips, mips64 and v850.
I followed a prescription from DaveM (though not to the letter), that
those arches with non-nop flush_page_to_ram need to do what it did
in their clear_user_page and copy_user_page and flush_dcache_page.
Dave is consterned that, in the v850 nb85e case, this patch leaves its
flush_dcache_page as was, uses it in clear_user_page and copy_user_page,
instead of making them all flush icache as well. That may be wrong:
I'm just hesitant to add cruft blindly, changing a flush_dcache macro
to flush icache too; and naively hope that the necessary flush_icache
calls are already in place. Miles, please let us know which way is
right for v850 nb85e - thanks.
Andrew Morton [Sat, 12 Apr 2003 19:54:20 +0000 (12:54 -0700)]
[PATCH] radix_tree_delete API improvement
radix_tree_delete() currently returns 0 on success, -ENOENT if there was
nothing to delete.
But it is more useful to return the address of the deleted item on success
and NULL if there was no matching item. It can potentially save a
lookup+delete operation.