Greg Ungerer [Sun, 25 May 2003 13:34:37 +0000 (06:34 -0700)]
[PATCH] m68knommu/5206e check timer irq pending
Add a function to allow checking if the timer interrupt is pending for
the m68knommu/5206e CPU. This is used by the architecture timer code
for microsecond accurate time calculations.
Jens Axboe [Sun, 25 May 2003 12:28:57 +0000 (05:28 -0700)]
[PATCH] bio splitting
So here it is, easy split support for md and dm. Neil, the changes over
your version are merely:
- Make a global bio split pool instead of requring device setup of one.
Will waste 8 * sizeof(struct bio_pair) of RAM, but... For 2.6 at least
it has to be a core functionality.
- Various style changes to follow the kernel guide lines.
Greg Ungerer [Sun, 25 May 2003 11:41:45 +0000 (04:41 -0700)]
[PATCH] m68knommu/5407 check timer irq pending
Add a function to allow checking if the timer interrupt is pending for
the m68knommu/5407 CPU. This is used by the architecture timer code for
microsecond accurate time calculations.
Greg Ungerer [Sun, 25 May 2003 11:41:38 +0000 (04:41 -0700)]
[PATCH] m68knommu/5272 check timer irq pending
Add a function to allow checking if the timer interrupt is pending for
the m68knommu/5272 CPU. This is used by the architecture timer code for
microsecond accurate time calculations.
Greg Ungerer [Sun, 25 May 2003 11:41:29 +0000 (04:41 -0700)]
[PATCH] m68knommu/5307 check timer irq pending
Add a function to allow checking if the timer interrupt is pending for
the m68knommu/5307 CPU. This is used by the architecture timer code for
microsecond accurate time calculations.
Greg Ungerer [Sun, 25 May 2003 11:20:56 +0000 (04:20 -0700)]
[PATCH] update m68knommu link script with 5282 support
This patch does a couple of things to the m68knommu common linker
script:
- adds support for the 5282 ColdFire CPU
- fixes broken setup for the Dragon Engine board
(i) The prototypes for free_vfsmnt(), alloc_vfsmnt(), do_kern_mount()
so far occurred in several individual c files. Now they are in
<linux/mount.h>.
(ii) do_kern_mount() has a third argument name that is typically a
constant. It is called with "rootfs", "nfsd", type->name,
"capifs", "usbdevfs", "binfmt_misc" etc. So, it should have a
prototype that expresses this:
do_kern_mount(const char *fstype, int flags, const char *name, void *data);
go away. Now do_kern_mount() calls type->get_sb(), so also get_sb()
must have a const third argument. That is what the patch below does.
If I am not mistaken, precisely two filesystems do not treat this
argument as a constant, namely afs and cifs. A separate patch
gives some cleanup there.
Greg Ungerer [Sun, 25 May 2003 10:33:22 +0000 (03:33 -0700)]
[PATCH] fix ColdFire 5407 cache flushing
This fixes some ColdFire 5407 cache bogosity. Previous code was pushing
all cache lines and the invalidating all of the cache. The push should
be enough, and now with underlying fixes the the cache setup registers
it is. Removed the whole invalidate cycle.
Paul Mackerras [Sun, 25 May 2003 16:00:29 +0000 (12:00 -0400)]
[PATCH] module refcounts for airport driver
This patch takes out the MOD_INC/DEC_USE_COUNT in the airport (Apple
wireless ethernet) driver. The driver already does SET_MODULE_OWNER
on the netdevice, so the MOD_INC/DEC_USE_COUNT are unnecessary and
just cause warnings.
Paul Mackerras [Sun, 25 May 2003 16:00:05 +0000 (12:00 -0400)]
[PATCH] module owner for ppp_synctty.c
This patch fixes ppp_synctty.c (used for doing PPP over some synchronous
serial HDLC links) so that it sets the owner field of the tty line
discipline it exports, rather than using MOD_INC/DEC_USE_COUNT. This
is more or less from Stephen Hemminger.
Ben Collins [Sun, 25 May 2003 09:19:58 +0000 (02:19 -0700)]
[PATCH] fs/* conversions for strlcpy
I only converted the cases where it was obvious that the intent was to
truncate on overflow. Lots of places for maxpath/readlink type stuff I
left alone.
Ben Collins [Sun, 25 May 2003 09:19:48 +0000 (02:19 -0700)]
[PATCH] sound/* strncpy conversion
This does a lot of cleanup for strncpy->strlcpy, replaces some
sprintf/snprintf's aswell. There were only two places where things
weren't straight forward. All-in-all a good cleanup though.
Andrew Morton [Sun, 25 May 2003 08:14:58 +0000 (01:14 -0700)]
[PATCH] Change mmu_gathers into per-cpu data
From: Martin Hicks <mort@wildopensource.com>
Here is a patch that changes mmu_gathers into a per-cpu resource. It
includes the changes for all arches except ia64. I've sent a separate patch
to David Mosberger for ia64.
Andrew Morton [Sun, 25 May 2003 08:14:18 +0000 (01:14 -0700)]
[PATCH] Better fix for ia32 subarch circular dependencies
From: john stultz <johnstul@us.ibm.com>
This is a rework of John's recent change which resolved a circular include
dependency: a function in mach_apic.h requires hard_smp_processor_id() and
hard_smp_processor_id() requires macros from mach_apic.h
So this patch (against bk-current) reverts the previous, and fixes the same
circular dependency in a much cleaner way, by moving a piece of the circular
chain into its own .h file, rather then removing hard_smp_processor_id() and
accessing the apic by hand.
Andrew Morton [Sun, 25 May 2003 08:14:07 +0000 (01:14 -0700)]
[PATCH] rd.c: separate queue per disk
From: Maneesh Soni <maneesh@in.ibm.com>
Provides a separate request queue for each ramdisk instance. Without this,
the kernel oopses when the block later tries to unregister the same set of
kobject things multiple times. This makes rd.c consistent with all other
"disk" devices.
Gerd Knorr noticed a small use of floating point math in the cpia driver
updates for 2.5.x I sent you a while ago, and this is not allowed in the
kernel.
This was in some code taken essentially verbatim from the Windows cpia
driver released under the GPL by STM inc. that had been incorporated in
the later versions of the cpia drivera sourceforge.
It turns out that the use of floating point was quite inessential, and I've
reimplemented the couple of lines of code involved in integer arithmetic.
Andrew Morton [Sun, 25 May 2003 08:13:17 +0000 (01:13 -0700)]
[PATCH] add notify_count for de_thread
From: Manfred Spraul <manfred@colorfullife.com>
de_thread is called by exec to kill all threads in the thread group except
the threads required for exec.
The waiting is implemented by waiting for a wakeup from __exit_signal: If
the reference count is less or equal to 2, then the waiter is woken up. If
exec is called by a non-leader thread, then two threads are required for
exec.
But if a thread group leader calls exec, then only one thread is required
for exec. Thus the hardcoded "2" leads to a superfluous wakeup. The patch
fixes that by adding a "notify_count" field to the signal structure.
Andrew Morton [Sun, 25 May 2003 08:12:57 +0000 (01:12 -0700)]
[PATCH] overcommit root margin
From: Dave Hansen <haveblue@us.ibm.com>
This patch makes vm_enough_memory(), more likely return failure when
overcommit_memory==0 and !CAP_SYS_ADMIN. I'm not sure it's worth having
another tunable just for this.
I also reworked the documentation a bit. It should be a lot clearer to
read now.
Andrew Morton [Sun, 25 May 2003 08:12:47 +0000 (01:12 -0700)]
[PATCH] devpts xattr handler for security labels
From: Stephen Smalley <sds@epoch.ncsc.mil>
This patch against 2.5.69-bk adds an xattr handler for security labels
to devpts and corresponding hooks to the LSM API to support conversion
between xattr values and the security labels stored in the inode
security field by the security module.
This allows userspace to get and set the security labels on devpts
nodes, e.g. so that sshd can set the security label for the pty using
setxattr, just as sshd already sets the ownership using chown.
SELinux uses this support to protect the pty in accordance with the user
process' security label. The changes to the LSM API are general and
should be re-useable by xattr handlers in other pseudo filesystems to
support similar security labeling. The xattr handler for devpts
includes the same generic framework as in ext[23], so handlers for other
kinds of attributes can be added easily in the future.
Andrew Morton [Sun, 25 May 2003 08:12:17 +0000 (01:12 -0700)]
[PATCH] /proc/pid inode security labels
From: Stephen Smalley <sds@epoch.ncsc.mil>
This patch against 2.5.69-bk adds a hook to proc_pid_make_inode to allow
security modules to set the security attributes on /proc/pid inodes based on
the security attributes of the associated task. This is required by SELinux
in order to control access to the process state accessible via /proc/pid
inodes in accordance with the task's security label.
An alternative approach that was considered was to implement an xattr handler
for /proc/pid inodes. That approach would still require a hook call from the
xattr handler to the security module to obtain an xattr value based on the
task security attributes, so it would add a further level of
indirection/translation. The only benefit of implementing an xattr handler
for the /proc/pid inodes would be that the /proc/pid inode security labels
could then be exported to userspace. However, the /proc/pid inode security
labels are only used internally by the security module for access control
purposes, and userspace access to the full range of process attributes is
already provided via the /proc/pid/attr interface. Consequently, a simple
hook in proc_pid_make_inode seemed preferable.
Andrew Morton [Sun, 25 May 2003 08:12:07 +0000 (01:12 -0700)]
[PATCH] Process Attribute API for Security Modules (fixlet)
From: Stephen Smalley <sds@epoch.ncsc.mil>
This patch, relative to the /proc/pid/attr patch against 2.5.69, fixes the
mode values of the /proc/pid/attr nodes to avoid interference by the normal
Linux access checks for these nodes (and also fixes the /proc/pid/attr/prev
mode to reflect its read-only nature).
Otherwise, when the dumpable flag is cleared by a set[ug]id or unreadable
executable, a process will lose the ability to set its own attributes via
writes to /proc/pid/attr due to a DAC failure (/proc/pid inodes are
assigned the root uid/gid if the task is not dumpable, and the original
mode only permitted the owner to write).
The security module should implement appropriate permission checking in its
[gs]etprocattr hook functions. In the case of SELinux, the setprocattr
hook function only allows a process to write to its own /proc/pid/attr
nodes as well as imposing other policy-based restrictions, and the
getprocattr hook function performs a permission check between the security
labels of the current process and target process to determine whether the
operation is permitted.
Andrew Morton [Sun, 25 May 2003 08:11:57 +0000 (01:11 -0700)]
[PATCH] Process Attribute API for Security Modules
From: Stephen Smalley <sds@epoch.ncsc.mil>
This updated patch against 2.5.69 merges the readdir and lookup routines
for proc_base and proc_attr, fixes the copy_to_user call in proc_attr_read
and proc_info_read, moves the new data and code within CONFIG_SECURITY, and
uses ARRAY_SIZE, per the comments from Al Viro and Andrew Morton. As
before, this patch implements a process attribute API for security modules
via a set of nodes in a /proc/pid/attr directory. Credit for the idea of
implementing this API via /proc/pid/attr nodes goes to Al Viro. Jan Harkes
provided a nice cleanup of the implementation to reduce the code bloat.
Andrew Morton [Sun, 25 May 2003 08:11:35 +0000 (01:11 -0700)]
[PATCH] slab: account for reclaimable caches
We have a problem at present in vm_enough_memory(): it uses smoke-n-mirrors
to try to work out how much memory can be reclaimed from dcache and icache.
it sometimes gets it quite wrong, especially if the slab has internal
fragmentation. And it often does.
So here we take a new approach. Rather than trying to work out how many
pages are reclaimable by counting up the number of inodes and dentries, we
change the slab allocator to keep count of how many pages are currently used
by slabs which can be shrunk by the VM.
The creator of the slab marks the slab as being reclaimable at
kmem_cache_create()-time. Slab keeps a global counter of pages which are
currently in use by thus-tagged slabs.
Of course, we now slightly overestimate the amount of reclaimable memory,
because not _all_ of the icache, dcache, mbcache and quota caches are
reclaimable.
But I think it's better to be a bit permissive rather than bogusly failing
brk() calls as we do at present.
Andrew Morton [Sun, 25 May 2003 08:11:24 +0000 (01:11 -0700)]
[PATCH] Don't remove inode from hash until filesystem has
From: Neil Brown <neilb@cse.unsw.edu.au>
When an NFS request arrives, it contains a filehandle which needs to be
converted to a dentry. Many filesystems use find_exported_dentry in
fs/exportfs/expfs.c. A key part of this on filesystem where a 32bit inode
number uniquely locates a file is export_iget which calls iget(sb, inum).
iget will either:
1/ find the inode in the inode cache and return it
or
2/ create a new inode and call ->read_inode to load it from the
storage device.
export_iget then verifies the inode is really a good inode (->read_inode
didn't detect any problems) and the right inode (base on generation number
from the file handle).
For this to work reliably, it is important that whenever an inode is *not* in
the cache, the on-device version is up-to-date. Otherwise, when read_inode
loads the inode it will get bad data.
For a file that has not been deleted, this condition always holds: a dirty
inode is always flushed to disc before the inode is unhashed.
However for a file that is being deleted this condition doesn't (didn't)
hold. When iput -> iput_final -> generic_drop_inode -> generic_delete_inode
is called we would unhash the inode before calling into the filesytem through
->delete_inode.
So there is a small window between when generic_delete_inode unhashes the
inode, and when ->delete_inode writes something to disc, where a call to
->read_inode (for export_iget) might discover what it thinks is a valid
inode, but is really one that is in the process of being destroyed.
It is this window that I want to close by moving the unhashing to the end of
generic_delete_inode.
Andrew Morton [Sun, 25 May 2003 08:11:05 +0000 (01:11 -0700)]
[PATCH] xirc2ps_cs irq return fix
From zwane
We shutdown the MAC part of the card and have interrupts disabled, interrupt
gets queued, we reenable interrupts after shutting down device, service the
interrupt, check status and get 0xff from powered down device.
No idea what he's talking about here, but apparently the irq return handling
isn't working out. Just return IRQ_HANDLED all the time.
Andrew Morton [Sun, 25 May 2003 08:10:55 +0000 (01:10 -0700)]
[PATCH] reiserfs: inode attributes support.
From: Oleg Drokin <green@namesys.com>
This is a forward port of 2.4's inode attributes support for reiserfs.
Original implementation for 2.4 was performed by Nikita Danilov.
In order to enable this support, one must use "attrs" mount options, eg:
mount /dev/hda1 /mount/pont -t reiserfs -o attrs
Also either the filesystem must have been created with a recent mkreiserfs
or must have been modified by a recent version of reiserfsck with its
"--clean-attributes" option.
If that is not done, attributes support will not be enabled and a kernel
message will be printed. This is necessary because old kernels left random
garbage in the place where these attributes now live.
These attributes are totally compatible with ext2's ones. You can
manipulate them with chattr/lsattr etc.
Additionally the chattr 'd' option may be used to disable tail packing on a
specific file or a directory tree. (The 'd' option normally means "don't
dump". reiserfs has overloaded it).
Andrew Morton [Sun, 25 May 2003 08:10:47 +0000 (01:10 -0700)]
[PATCH] APM does unsafe conditional set_cpus_allowed
From: Zwane Mwaikambo <zwane@linuxpower.ca>
kapmd does a conditional check in order to decide whether to set the task's
cpu affinity mask. This can change during runtime, therefore we
unconditionally set it. There is an early exit in set_cpus_allowed if the
current processor is in the allowed mask anyway.
Andrew Morton [Sun, 25 May 2003 08:10:37 +0000 (01:10 -0700)]
[PATCH] Fix dcache_lock/tasklist_lock ranking bug
__unhash_process acquires the dcache_lock while holding the
tasklist_lock for writing. This can deadlock. Additionally,
fs/proc/base.c incorrectly assumed that p->pid would be set to 0 during
release_task.
The patch fixes that by adding a new spinlock to the task structure and
fixing all references to (!p->pid).
The alternative to the new spinlock would be to hold dcache_lock around
__unhash_process.
- fs/proc/base.c assumed that p->pid is reset to 0 during exit. This is
not the case anymore. I now look at the count of the pid structure for
PIDTYPE_PID.
- de_thread now tested - as broken as it was before: open handles to
/proc/<pid> are either stale or invalid after an exec of a nptl process,
if the exec was call from a secondary thread.
- a few lock_kernels removed - that part of /proc doesn't need it.
- additional instances of 'if(current->pid)' replaced with pid_alive.
Andrew Morton [Sun, 25 May 2003 08:10:00 +0000 (01:10 -0700)]
[PATCH] ppc64: more warning fixes
arch/ppc64/kernel/htab.c:105: warning: implicit declaration of function `pSeries_lpar_hpte_insert'
arch/ppc64/kernel/htab.c:109: warning: implicit declaration of function `pSeries_hpte_insert'
Linus Torvalds [Sun, 25 May 2003 07:52:57 +0000 (00:52 -0700)]
Make cdev infrastructure initialize early
Very early initialization (core_initcall) needs to have the cdev
initialization done. So make it part of the pre-initcall sequence, the
same way the bdev caches were done.
Ingo Molnar [Sun, 25 May 2003 04:50:32 +0000 (21:50 -0700)]
[PATCH] support "requeueing" futexes
This addresses a futex related SMP scalability problem of
glibc. A number of regressions have been reported to the NTPL mailing list
when going to many CPUs, for applications that use condition variables and
the pthread_cond_broadcast() API call. Using this functionality, testcode
shows a slowdown from 0.12 seconds runtime to over 237 seconds (!)
runtime, on 4-CPU systems.
pthread condition variables use two futex-backed mutex-alike locks: an
internal one for the glibc CV state itself, and a user-supplied mutex
which the API guarantees to take in certain codepaths. (Unfortunately the
user-supplied mutex cannot be used to protect the CV state, so we've got
to deal with two locks.)
The cause of the slowdown is a 'swarm effect': if lots of threads are
blocked on a condition variable, and pthread_cond_broadcast() is done,
then glibc first does a FUTEX_WAKE on the cv-internal mutex, then down a
mutex_down() on the user-supplied mutex. Ie. a swarm of threads is created
which all race to serialize on the user-supplied mutex. The more threads
are used, the more likely it becomes that the scheduler will balance them
over to other CPUs - where they just schedule, try to lock the mutex, and
go to sleep. This 'swarm effect' is purely technical, a side-effect of
glibc's use of futexes, and the imperfect coupling of the two locks.
the solution to this problem is to not wake up the swarm of threads, but
'requeue' them from the CV-internal mutex to the user-supplied mutex. The
attached patch adds the FUTEX_REQUEUE feature FUTEX_REQUEUE requeues N
threads from futex address A to futex address B.
This way glibc can wake up a single thread (which will take the
user-mutex), and can requeue the rest, with a single system-call.
Ulrich Drepper has implemented FUTEX_REQUEUE support in glibc, and a
number of people have tested it over the past couple of weeks. Here are
the measurements done by Saurabh Desai:
./pp -v -n 128 -i 1000 -S 32768:
Default NPTL: 128 games in 1.111s 1.270s 16.894s
requeue NPTL: 128 games in 1.111s 1.959s 2.426s
./pp -v -n 1024 -i 10 -S 32768:
Default NPTL: 1024 games in 0.181s 0.394s incompleted 2m+
requeue NPTL: 1024 games in 0.166s 0.254s 0.341s
the speedup with increasing number of threads is quite significant, in the
128 threads, case it's more than 8 times. In the cond-perf test, on 4 CPUs
it's almost infinitely faster than the 'swarm of threads' catastrophy
triggered by the old code.
Alexander Viro [Sun, 25 May 2003 04:42:29 +0000 (21:42 -0700)]
[PATCH] i_cdev/i_cindex
new fields in struct inode - i_cdev and i_cindex. When we do open() on
a character device we cache result of cdev lookup in inode and put the
inode on a cyclic list anchored in cdev. If we already have that done,
we don't bother with any lookups. When inode disappears it's removed
from the list. When cdev gets unregistered we remove all cached
references to it (and remove such inodes from the list). cdev is held
until final fput() now.
Alexander Viro [Sun, 25 May 2003 04:42:20 +0000 (21:42 -0700)]
[PATCH] cdev-cidr, part 1
New object: struct cdev. It contains a kobject, a pointer to
file_operations and a pointer to owner module. These guys have a search
structure of the same sort as gendisks and chrdev_open() picks
file_operations from them.
Intended use: embed such animal in driver-owned structure (e.g.
tty_driver) and register it as associated with given range of device
numbers. Generic code will do lookup for such object and use it for the
rest.
The behaviour of register_chrdev() is _not_ changed - it allocates
struct cdev and registers it; any old driver will work as if nothing had
changed.
On that stage we only use it during chrdev_open() to find
file_operations. Later it will be cached in inode->i_cdev (and index in
range - in inode->i_cindex) so that ->open() could get whatever objects
it wants directly without any special-cased lookups, etc.
Alexander Viro [Sun, 25 May 2003 04:42:11 +0000 (21:42 -0700)]
[PATCH] kobj_map
code responsible for gendisk lookups taken out in drivers/base and
generalized - now it allows to have a range-based mapping from numbers
to kobjects for given struct subsystem.
Alexander Viro [Sun, 25 May 2003 04:40:53 +0000 (21:40 -0700)]
[PATCH] cpqarray fixes
This restores the special-case behaviour of open() on the minor 0;
cpqarray allows to open that guy for ioctls even if nothing is
configured. That got broken when gendisk patches went in. Patch
restores the old behaviour by keeping gendisk for the first disk on
controller always registered; instead of unregistering it we set size to
0.