Andrew Morton [Wed, 30 Oct 2002 07:31:37 +0000 (23:31 -0800)]
[PATCH] percpu: convert timers
Patch from Dipankar Sarma <dipankar@in.ibm.com>
This patch changes the per-CPU data in timer management (tvec_bases)
to use per_cpu data area and makes it safe for cpu_possible allocation
by using CPU notifiers. End result - saving space.
Andrew Morton [Wed, 30 Oct 2002 07:25:03 +0000 (23:25 -0800)]
[PATCH] slab: reap timers
- add a reap timer that returns stale objects from the cpu arrays
- use list_for_each instead of while loops
- /proc/slabinfo layout change, for a new field about reaping.
Implementation:
slab contains 2 caches that contain objects that might be usable to the
systems:
- the cpu arrays contains objects that other cpus could use
- the slabs_free list contains freeable slabs, i.e. pages that someone
else might want.
The patch now keeps track of accesses to the cpu arrays and to the free
list. If there were no recent activities in one of the caches, part of
the cache is flushed.
Unlike <2.5.39, only a small part (~20%) is flushed each time:
The older kernel would refill/drain bounce heavily under memory pressure:
- kmem_cache_alloc: notices that there are no objects in the cpu
cache, loads 120 objects from the slab lists, return 1.
[assuming batchcount=120]
- kmem_cache_reap is called due to memory pressure, finds 119
objects in the cpu array and returns them to the slab lists.
- repeat.
In addition, the length of the free list is limited based on the free
list accesses: a fixed "1" limit hurts the large object caches.
That's the last part for now, next is: [not yet written]
- cleanup: BUG_ON instead of if() BUG
- OOM handling for enable_cpucaches
- remove the unconditional might_sleep() from
cache_alloc_debugcheck_before, and make that DEBUG dependant.
- initial NUMA support, just to collect some stats:
Which percentage of the objects are freed on the wrong
node? 0.1% or 20%?
Andrew Morton [Wed, 30 Oct 2002 07:24:43 +0000 (23:24 -0800)]
[PATCH] slab: cleanups and speedups
- enable the cpu array for all caches
- remove the optimized implementations for quick list access - with
cpu arrays in all caches, the list access is now rare.
- make the cpu arrays mandatory, this removes 50% of the conditional
branches from the hot path of kmem_cache_alloc [1]
- poisoning for objects with constructors
Patch got a bit longer...
I forgot to mention this: head arrays mean that some pages can be
blocked due to objects in the head arrays, and not returned to
page_alloc.c. The current kernel never flushes the head arrays, this
might worsen the behaviour of low memory systems. The hunk that
flushes the arrays regularly comes next.
Details changelog: [to be read site by side with the patch]
* docu update
* "growing" is not really needed: races between grow and shrink are
handled by retrying. [additionally, the current kernel never
shrinks]
* move the batchcount into the cpu array:
the old code contained a race during cpu cache tuning:
update batchcount [in cachep] before or after the IPI?
And NUMA will need it anyway.
* bootstrap support: the cpu arrays are really mandatory, nothing
works without them. Thus a statically allocated cpu array is needed
to for starting the allocators.
* move the full, partial & free lists into a separate structure, as a
preparation for NUMA
* structure reorganization: now the cpu arrays are the most important
part, not the lists.
* dead code elimination: remove "failures", nowhere read.
* dead code elimination: remove "OPTIMIZE": not implemented. The
idea is to skip the virt_to_page lookup for caches with on-slab slab
structures, and use (ptr&PAGE_MASK) instead. The details are in
Bonwicks paper. Not fully implemented.
* remove GROWN: kernel never shrinks a cache, thus grown is
meaningless.
* bootstrap: starting the slab allocator is now a 3 stage process:
- nothing works, use the statically allocated cpu arrays.
- the smallest kmalloc allocator works, use it to allocate
cpu arrays.
- all kmalloc allocators work, use the default cpu array size
* register a cpu nodifier callback, and allocate the needed head
arrays if a new cpu arrives
* always enable head arrays, even for DEBUG builds. Poisoning and
red-zoning now happens before an object is added to the arrays.
Insert enable_all_cpucaches into cpucache_init, there is no need for
seperate function.
* modifications to the debug checks due to the earlier calls of the
dtor for caches with poisoning enabled
* poison+ctor is now supported
* squeezing 3 objects into a cacheline is hopeless, the FIXME is not
solvable and can be removed.
* move do_ccupdate_local nearer to do_tune_cpucache. Should have
been part of -04-drain.
* additional objects checks. red-zoning is tricky: it's implemented
by increasing the object size by 2*BYTES_PER_WORD. Thus
BYTES_PER_WORD must be added to objp before calling the destructor,
constructor or before returing the object from alloc. The poison
functions add BYTES_PER_WORD internally.
* create a flagcheck function, right now the tests are duplicated in
cache_grow [always] and alloc_debugcheck_before [DEBUG only]
* modify slab list updates: all allocs are now bulk allocs that try
to get multiple objects at once, update the list pointers only at the
end of a bulk alloc, not once per alloc.
* might_sleep was moved into kmem_flagcheck.
* major hotpath change:
- cc always exists, no fallback
- cache_alloc_refill is called with disabled interrupts,
and does everything to recover from an empty cpu array.
Far shorter & simpler __cache_alloc [inlined in both
kmalloc and kmem_cache_alloc]
* __free_block, free_block, cache_flusharray: main implementation of
returning objects to the lists. no big changes, diff lost track.
* new debug check: too early kmalloc or kmem_cache_alloc
* slightly reduce the sizes of the cpu arrays: keep the size < a
power of 2, including batchcount, avail and now limit, for optimal
kmalloc memory efficiency.
That's it. I even found 2 bugs while reading: dtors and ctors for
verify were called with wrong parameters, with RED_ZONE enabled, and
some checks still assumed that POISON and ctor are incompatible.
Andrew Morton [Wed, 30 Oct 2002 07:24:33 +0000 (23:24 -0800)]
[PATCH] slab: remove spaces from /proc identifiers
From Manfred Spraul
remove the space from the name of the DMA caches: they make it
impossible to tune the caches through /proc/slabinfo, and make parsing
/proc/slabinfo difficult
Andrew Morton [Wed, 30 Oct 2002 07:24:12 +0000 (23:24 -0800)]
[PATCH] slab: reduce internal fragmentation
From Manfred Spraul
If an object is freed from a slab, then move the slab to the tail of
the partial list - this should increase the probability that the other
objects from the same page are freed, too, and that a page can be
returned to gfp later.
In other words: if we just freed an object from this page then make
this page be the *last* page which is eligible for new allocations.
Under the assumption that other objects in that same page are about to
be freed up as well.
The cpu arrays are now always in front of the list, i.e. cache hit
rates should not matter.
Andrew Morton [Wed, 30 Oct 2002 07:23:34 +0000 (23:23 -0800)]
[PATCH] slab: extended cpu notifiers
Patch from Dipankar Sarma <dipankar@in.ibm.com>
This is Manfred's patch which provides a CPU_UP_PREPARE cpu notifier to
allow initialization of per_cpu data just before the cpu becomes fully
functional.
It also provides a facility for the CPU_UP_PREPARE handler to return
NOTIFY_BAD to signify that the CPU is not permitted to come up. If
that happens, a CPU_UP_CANCELLED message is passed to all the handlers.
The patch also fixes a bogus NOFITY_BAD return from the softirq setup
code.
Patch has been acked by Rusty.
We need this mechanism in slab for starting per-cpu timers and for
allocating the per-cpu slab hgead arrays *before* the CPU has come up
and started using slab.
Linus Torvalds [Wed, 30 Oct 2002 06:54:51 +0000 (22:54 -0800)]
Merge penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/epoll-0.15
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
Patrick Mochel [Wed, 30 Oct 2002 04:41:43 +0000 (20:41 -0800)]
kobjects: add array of default attributes to subsystems, and create on registration.
struct subsystem may now contain a pointer to a NULL-terminated array of
default attributes to be exported when an object is registered with the subsystem.
kobject registration will check the return values of the directory creation and
the creation of each file, and handle it appropriately.
Patrick Mochel [Wed, 30 Oct 2002 04:27:36 +0000 (20:27 -0800)]
sysfs: kill struct sysfs_dir.
Previously, sysfs read() and write() calls looked for sysfs_ops in the struct
sysfs_dir, in the kobject. Since objects belong to a subsystem, and is a member
of a group of like devices, the sysfs_ops have been moved to struct subsystem,
and are referenced from there.
The only remaining member of struct sysfs_dir is the dentry of the object's
directory. That is moved out of the dir struct and directly into struct kobject.
That saves us 4 bytes/object.
All of the sysfs functions that referenced the struct have been changed to just
reference the dentry.
Linus Torvalds [Wed, 30 Oct 2002 04:24:40 +0000 (20:24 -0800)]
Merge penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/kconfig
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
Patrick Mochel [Wed, 30 Oct 2002 03:47:41 +0000 (19:47 -0800)]
Introduce struct subsystem.
A struct subsystem is basically a collection of objects of a certain type,
and some callbacks to operate on objects of that type.
subsystems contain embedded kobjects themselves, and have a similar set of
library routines that kobjects do, which are mostly just wrappers for the
correlating kobject routines.
kobjects are inserted in depth-first order into their subsystem's list of
objects. Orphan kobjects are also given foster parents that point to their
subsystem. This provides a bit more rigidity in the hierarchy, and disallows
any orphan kobjects.
When an object is unregistered, it is removed from its subsystem's list. When
the objects' refcount hits 0, the subsystem's ->release() callback is called.
Documentation describing the objects and the interfaces has also been added.
David Brownell [Tue, 29 Oct 2002 16:12:42 +0000 (08:12 -0800)]
[PATCH] ohci td error cleanup
This is a version of a patch I sent out last Friday to help
address some of the "bad entry" errors that some folk
were seeing, seemingly only with control requests. The fix
is just to not try being clever: remove one TD at a time and
patch the ED as if that TD had completed normally, then do
the next ... don't try to patch just once in this fault case.
(And it nukes some debug info I accidently submitted.)
David Brownell [Tue, 29 Oct 2002 15:43:32 +0000 (07:43 -0800)]
[PATCH] USB: clean up usb structures some more
This patch splits up the usb structures to have two structs,
"usb_XXX_descriptor" with just the descriptor, and "usb_host_XXX" (or
something similar) to wrap it and add the "extra" pointers plus the
array of related descriptors that the host parsed during enumeration.
(2 or 3 words extra in each"usb_host_XXX".) This further matches the
"on the wire" data and enables the gadget drivers to share the same
header file.
Covers all the linux/drivers/usb/* and linux/sound/usb/* stuff, but
not a handful of other drivers (bluetooth, iforce, hisax, irda) that
are out of the usb tree and will likely be affected.
We used to lock (ind mod use count) all drivers just in case, but
it makes more sense to only lock the one we're just using, in
particular since the old scheme was rather broken when insmod'ing
a new driver later.
Again, use a per ttyI timer handler to feed arrived data into the
ttyI. Really, there shouldn't be the need for any timer at all,
rather working flow control, but that'll take a bit to fix.
ISDN: New timer handling for "+++" escape sequence
Instead of having one common timer and walking the list of
all ISDN channels, which might be possibly associated with a
ttyI and even more possibly so waiting for the silence period
after "+++", just use a per ttyI timer, which only gets activated
when necessary.
The common way in the kernel is to pass around the struct (e.g.
struct net_device), and leave the user the possibility to add
its private data using ::priv, so do it the same way when accessing
an ISDN channel.
ISDN: stat_callback() and recv_callback() -> event_callback()
Merge the two different types of callbacks into just one, there's no
good reasons for the receive callback to be different, in particular since
we pass things through the same state machine later anyway.
For some reason, isdnloop didn't support the transparent encoding,
which is necessary for testing V.110. Testing also found a typo
causing an oops in isdn_common.c. Fixed.
It'd probably make more sense to provide it in library form
to the hardware drivers which don't support V.110 natively, but for
now it's at least collected in one place.
ISDN: Move the tty receive queue out of generic code
Moving the tty receive queue into the tty-specific data in fact
simplifies the common code (which doesn't need to know it at all, now),
and the tty code, which can access the queue more directly.
ISDN: Route all driver callbacks through the driver state machine
We used to intercept status callbacks which were for specific channels
instead of the driver before passing them to the driver and short-cutting
to them to the per-channel state machine. Do it correctly for now,
i.e. callback -> driver -> channel, even though that might have a small
performance hit. Correctness first.
ISDN: Remove ttyI specific from global "dev" variable
ISDN still has a huge global struct called "dev", which is a mess
of parts which should be private to their respective subsystem.
It's supposed to die, this is another step in making that happen.
Change the incoming call logic: Incoming calls are signalled to
the net interface code first, then the tty code. It's the lower level's
responsibility to claim the call by issueing ISDN_CMD_ACCEPTD now.
Remove some crud which is handled by isdn_common state machines now.
They were never used except for passing the state to userspace,
but not used in any application I know of. If necessary, the information
can easily be recovered by looking at fi.state == ST_ACTIVE
Since we unfortunately cannot rely on the hardware drivers to get
their states always correct, have the common layer keep track of the
states and sanitize them before passing them on to applications
as network interfaces / ttyIs.
It's useless information, we need to iterate over all potential
drivers anyway, since possibly the first one has unregistered before
the second, leaving a hole.
ISDN: Make array of drivers private to isdn_common.c
Currently, we need to provide a couple of helper functions to avoid
breaking isdn_tty with this change, as that gets cleaned up, the need
for those helpers should vanish as well.
read() should be safe against missed wake-ups now. These devices
should actually be implemented by the hardware drivers directly, would
make for much cleaner code. Unfortunately, isdnctrl is
using /dev/isdnctrl for the common ioctls, which are handled by the link
layer, so that's not easily possible. Too bad.
ISDN: Make raw-IP, CISCO HDLC, ... support optional
They'll still get compiled all into one module, but now you can choose
what you need - it's not hard to go from here to individual modules,
but most protocol-specific code is so small that it's probably not worth
it.
Though I've been mostly moving stuff out of include/linux and into
drivers/isdn/i4l, the finite state machine definitions actually need
to be more wildly accessible, so they go the opposite way.