From: Andi Kleen Date: Tue, 11 Feb 2003 13:20:56 +0000 (-0800) Subject: [PATCH] x86-64 merge X-Git-Tag: v2.5.61~66 X-Git-Url: http://git.neil.brown.name/?a=commitdiff_plain;h=d8f19f2cac70f3dd0d2e631af063f9dd5c05f4b3;p=history.git [PATCH] x86-64 merge This brings the x86-64 port uptodate in 2.5.60. Unfortunately I cannot test too much because i constantly get deadlocks in exit/wait in initscripts on SMP bootup. The kernel seems to still lose a lot of SIGCHLD. 2.5.59/SMP had the same problem. Uniprocessor and SMP kernel on UP seems to work. This patch only touches x86-64 specific files. It requires a few simple changes to arch independent files that I will send separately. - Fixed a lot of obsolete/misleading configure help texts. - Remove old bootblock disk loader and support fdimage target for syslinux instead (H. Peter Anvin) - Fix potential fpu signal restore problem on 32bit emulation. - Merge with 2.5.60 i386 (hugetlbfs, acpi etc.) - Some fixes for local apic disabled modus. - Beginngs of S3 ACPI wakeup from real-mode (not working yet, don't use) - Beginnings of NUMA/CONFIG_DISCONTIGMEM support for AMD K8 (work in progress, port from 2.4): clean up memory mapping at bootup, generalize bootmem etc. - Fix 64bit GS base reload problem and reenable (Karsten Keil) - Fix race with vmalloc accesses from interrupt handlers disturbing page fault/ similar race for the debug handler (thanks to Andrew Morton) - Merge cpu access primitives with i386 - Revert to private module list for now because putting modules nto vmlist triggered too many problems. - Some cleanups, removal of unneeded code. - Let early __get_free_pages see consistent pda - Preempt disabled for now because it is too broken right now - Signal handler fixes - Fix do_gettimeofday to be completely lockless and reenable vsyscalls - Optimize context switch path a bit (should be ported to i386) - Get thread_info via stack for better code - Don't leak pmd pages - Clean up hardcoded task stack sizes. --- diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig index e7d194d086c1..f47ae16317c5 100644 --- a/arch/x86_64/Kconfig +++ b/arch/x86_64/Kconfig @@ -19,11 +19,6 @@ config X86_64 config X86 bool default y - help - This is Linux's home port. Linux was originally native to the Intel - 386, and runs on all the later x86 processors including the Intel - 486, 586, Pentiums, and various instruction-set-compatible chips by - AMD, Cyrix, and others. config MMU bool @@ -35,20 +30,10 @@ config SWAP config ISA bool - help - Find out whether you have ISA slots on your motherboard. ISA is the - name of a bus system, i.e. the way the CPU talks to the other stuff - inside your box. Other bus systems are PCI, EISA, MicroChannel - (MCA) or VESA. ISA is an older system, now being displaced by PCI; - newer boards don't support it. If you have ISA, say Y, otherwise N. config SBUS bool -config UID16 - bool - default y - config RWSEM_GENERIC_SPINLOCK bool default y @@ -86,14 +71,14 @@ choice default MK8 config MK8 - bool "AMD-Hammer" + bool "AMD-Opteron/Athlon64" help - Support for AMD Clawhammer/Sledgehammer CPUs. Only choice for x86-64 - currently so you should choose this if you want a x86-64 kernel. In fact - you will have no other choice than to choose this. + Optimize for AMD Opteron/Athlon64/Hammer/K8 CPUs. config GENERIC_CPU bool "Generic-x86-64" + help + Generic x86-64 CPU. endchoice @@ -196,25 +181,12 @@ config SMP singleprocessor machines. On a singleprocessor machine, the kernel will run faster if you say N here. - Note that if you say Y here and choose architecture "586" or - "Pentium" under "Processor family", the kernel will not work on 486 - architectures. Similarly, multiprocessor kernels for the "PPro" - architecture may not work on all Pentium based boards. - - People using multiprocessor machines who say Y here should also say - Y to "Enhanced Real Time Clock Support", below. The "Advanced Power - Management" code will be disabled if you say Y here. - - See also the , - , , - and the SMP-HOWTO available at - . - If you don't know what to do here, say N. +# broken currently config PREEMPT + depends on NOT_WORKING bool "Preemptible Kernel" - depends on !SMP ---help--- This option reduces the latency of the kernel when reacting to real-time or interactive events by allowing a low priority process to @@ -229,6 +201,28 @@ config PREEMPT Say Y here if you are feeling brave and building a kernel for a desktop, embedded or real-time system. Say N if you are unsure. +# someone write a better help text please. +config K8_NUMA + bool "K8 NUMA support" + depends on SMP + help + Enable NUMA (Non Unified Memory Architecture) support for + AMD Opteron Multiprocessor systems. The kernel will try to allocate + memory used by a CPU on the local memory controller of the CPU + and in the future do more optimizations. This may improve performance + or it may not. Code is still experimental. + Say N if unsure. + +config DISCONTIGMEM + bool + depends on K8_NUMA + default y + +config NUMA + bool + depends on K8_NUMA + default y + config HAVE_DEC_LOCK bool depends on SMP @@ -245,15 +239,17 @@ config NR_CPUS kernel will support. The maximum supported value is 32 and the minimum value which makes sense is 2. - This is purely to save memory - each supported CPU adds - approximately eight kilobytes to the kernel image. + This is purely to save memory - each supported CPU requires + memory in the static kernel configuration. config GART_IOMMU bool "IOMMU support" help Support the K8 IOMMU. Needed to run systems with more than 4GB of memory - properly with 32-bit devices. You should probably turn this on. - The iommu can be turned off at runtime with the iommu=off parameter. + properly with 32-bit PCI devices that do not support DAC (Double Address + Cycle). The IOMMU can be turned off at runtime with the iommu=off parameter. + Normally the kernel will take the right choice by itself. + If unsure say Y config DUMMY_IOMMU bool @@ -291,7 +287,8 @@ config PM Note that, even if you say N here, Linux on the x86 architecture will issue the hlt instruction if nothing is to be done, thereby - sending the processor to sleep and saving power. + sending the processor to limited sleep and saving power. However + using ACPI will likely save more power. config SOFTWARE_SUSPEND bool "Software Suspend (EXPERIMENTAL)" @@ -331,16 +328,6 @@ menu "Bus options (PCI etc.)" config PCI bool "PCI support" - help - Find out whether you have a PCI motherboard. PCI is the name of a - bus system, i.e. the way the CPU talks to the other stuff inside - your box. Other bus systems are ISA, EISA, MicroChannel (MCA) or - VESA. If you have PCI, say Y, otherwise N. - - The PCI-HOWTO, available from - , contains valuable - information about which PCI hardware does work under Linux and which - doesn't. # x86-64 doesn't support PCI BIOS access from long mode so always go direct. config PCI_DIRECT @@ -381,54 +368,10 @@ config KCORE_ELF bool depends on PROC_FS default y - ---help--- - If you enabled support for /proc file system then the file - /proc/kcore will contain the kernel core image. This can be used - in gdb: - - $ cd /usr/src/linux ; gdb vmlinux /proc/kcore - You have two choices here: ELF and A.OUT. Selecting ELF will make - /proc/kcore appear in ELF core format as defined by the Executable - and Linkable Format specification. Selecting A.OUT will choose the - old "a.out" format which may be necessary for some old versions - of binutils or on some architectures. - - This is especially useful if you have compiled the kernel with the - "-g" option to preserve debugging information. It is mainly used - for examining kernel data structures on the live kernel so if you - don't understand what this means or are not a kernel hacker, just - leave it at its default value ELF. - -#tristate 'Kernel support for a.out binaries' CONFIG_BINFMT_AOUT config BINFMT_ELF - tristate "Kernel support for ELF binaries" - ---help--- - ELF (Executable and Linkable Format) is a format for libraries and - executables used across different architectures and operating - systems. Saying Y here will enable your kernel to run ELF binaries - and enlarge it by about 13 KB. ELF support under Linux has now all - but replaced the traditional Linux a.out formats (QMAGIC and ZMAGIC) - because it is portable (this does *not* mean that you will be able - to run executables from different architectures or operating systems - however) and makes building run-time libraries very easy. Many new - executables are distributed solely in ELF format. You definitely - want to say Y here. - - Information about ELF is contained in the ELF HOWTO available from - . - - If you find that after upgrading from Linux kernel 1.2 and saying Y - here, you still can't run any ELF binaries (they just crash), then - you'll have to install the newest ELF runtime libraries, including - ld.so (check the file for location and - latest version). - - If you want to compile this as a module ( = code which can be - inserted in and removed from the running kernel whenever you want), - say M here and read . The module - will be called binfmt_elf. Saying M or N here is dangerous because - some crucial programs on your system might be in ELF format. + bool + default y config BINFMT_MISC tristate "Kernel support for MISC binaries" @@ -436,12 +379,9 @@ config BINFMT_MISC If you say Y here, it will be possible to plug wrapper-driven binary formats into the kernel. You will like this especially when you use programs that need an interpreter to run like Java, Python or - Emacs-Lisp. It's also useful if you often run DOS executables under - the Linux DOS emulator DOSEMU (read the DOSEMU-HOWTO, available from - ). Once you have - registered such a binary class with the kernel, you can start one of - those programs simply by typing in its name at a shell prompt; Linux - will automatically feed it to the correct interpreter. + Emacs-Lisp. Once you have registered such a binary class with the kernel, + you can start one of those programs simply by typing in its name at a shell + prompt; Linux will automatically feed it to the correct interpreter. You can do other nice things, too. Read the file to learn how to use this @@ -467,6 +407,12 @@ config COMPAT depends on IA32_EMULATION default y + +config UID16 + bool + depends on IA32_EMULATION + default y + endmenu source "drivers/mtd/Kconfig" @@ -672,9 +618,11 @@ config DEBUG_SPINLOCK best used in conjunction with the NMI watchdog so that spinlock deadlocks are also debuggable. +# !SMP for now because the context switch early causes GPF in segment reloading +# and the GS base checking does the wrong thing then, causing a hang. config CHECKING bool "Additional run-time checks" - depends on DEBUG_KERNEL + depends on DEBUG_KERNEL && !SMP help Enables some internal consistency checks for kernel debugging. You should normally say N. @@ -683,7 +631,8 @@ config INIT_DEBUG bool "Debug __init statements" depends on DEBUG_KERNEL help - Fill __init and __initdata at the end of boot. This is only for debugging. + Fill __init and __initdata at the end of boot. This helps debugging + illegal uses of __init and __initdata after initialization. config KALLSYMS bool "Load all symbols for debugging/kksymoops" @@ -696,11 +645,11 @@ config FRAME_POINTER bool "Compile the kernel with frame pointers" depends on DEBUG_KERNEL help - If you say Y here the resulting kernel image will be slightly larger - and slower, but it will give very useful debugging information. - If you don't debug the kernel, you can say N, but we may not be able - to solve problems without frame pointers. - Note this is normally not needed on x86-64. + Compile the kernel with frame pointers. This may help for some + debugging with external debuggers. Note the standard oops backtracer + doesn't make use of it and the x86-64 kernel doesn't ensure an consistent + frame pointer through inline assembly (semaphores etc.) + Normally you should say N. endmenu diff --git a/arch/x86_64/Makefile b/arch/x86_64/Makefile index 34100cd32426..2e37ee42a98b 100644 --- a/arch/x86_64/Makefile +++ b/arch/x86_64/Makefile @@ -58,7 +58,8 @@ drivers-$(CONFIG_OPROFILE) += arch/x86_64/oprofile/ boot := arch/x86_64/boot -.PHONY: bzImage bzlilo bzdisk install archmrproper +.PHONY: bzImage bzlilo install archmrproper \ + fdimage fdimage144 fdimage288 archclean #Default target when executing "make" all: bzImage @@ -74,7 +75,7 @@ bzlilo: vmlinux bzdisk: vmlinux $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(BOOTIMAGE) zdisk -install: vmlinux +install fdimage fdimage144 fdimage288: vmlinux $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(BOOTIMAGE) $@ archclean: @@ -103,3 +104,6 @@ define archhelp echo ' install to $$(INSTALL_PATH) and run lilo' endef +CLEAN_FILES += arch/$(ARCH)/boot/fdimage arch/$(ARCH)/boot/mtools.conf + + diff --git a/arch/x86_64/boot/Makefile b/arch/x86_64/boot/Makefile index 922c75f760d4..2f57327ba4e5 100644 --- a/arch/x86_64/boot/Makefile +++ b/arch/x86_64/boot/Makefile @@ -59,8 +59,36 @@ $(obj)/setup $(obj)/bootsect: %: %.o FORCE $(obj)/compressed/vmlinux: FORCE $(Q)$(MAKE) $(build)=$(obj)/compressed IMAGE_OFFSET=$(IMAGE_OFFSET) $@ -zdisk: $(BOOTIMAGE) - dd bs=8192 if=$(BOOTIMAGE) of=/dev/fd0 +# Set this if you want to pass append arguments to the zdisk/fdimage kernel +FDARGS = + +$(obj)/mtools.conf: $(obj)/mtools.conf.in + sed -e 's|@OBJ@|$(obj)|g' < $< > $@ + +# This requires write access to /dev/fd0 +zdisk: $(BOOTIMAGE) $(obj)/mtools.conf + MTOOLSRC=$(src)/mtools.conf mformat a: ; sync + syslinux /dev/fd0 ; sync + echo 'default linux $(FDARGS)' | \ + MTOOLSRC=$(src)/mtools.conf mcopy - a:syslinux.cfg + MTOOLSRC=$(src)/mtools.conf mcopy $(BOOTIMAGE) a:linux ; sync + +# These require being root or having syslinux run setuid +fdimage fdimage144: $(BOOTIMAGE) $(src)/mtools.conf + dd if=/dev/zero of=$(obj)/fdimage bs=1024 count=1440 + MTOOLSRC=$(src)/mtools.conf mformat v: ; sync + syslinux $(obj)/fdimage ; sync + echo 'default linux $(FDARGS)' | \ + MTOOLSRC=$(src)/mtools.conf mcopy - v:syslinux.cfg + MTOOLSRC=$(src)/mtools.conf mcopy $(BOOTIMAGE) v:linux ; sync + +fdimage288: $(BOOTIMAGE) $(src)/mtools.conf + dd if=/dev/zero of=$(obj)/fdimage bs=1024 count=2880 + MTOOLSRC=$(src)/mtools.conf mformat w: ; sync + syslinux $(obj)/fdimage ; sync + echo 'default linux $(FDARGS)' | \ + MTOOLSRC=$(src)/mtools.conf mcopy - w:syslinux.cfg + MTOOLSRC=$(src)/mtools.conf mcopy $(BOOTIMAGE) w:linux ; sync zlilo: $(BOOTIMAGE) if [ -f $(INSTALL_PATH)/vmlinuz ]; then mv $(INSTALL_PATH)/vmlinuz $(INSTALL_PATH)/vmlinuz.old; fi diff --git a/arch/x86_64/boot/bootsect.S b/arch/x86_64/boot/bootsect.S index c17f4bae61c5..bb15d406ee95 100644 --- a/arch/x86_64/boot/bootsect.S +++ b/arch/x86_64/boot/bootsect.S @@ -4,29 +4,13 @@ * modified by Drew Eckhardt * modified by Bruce Evans (bde) * modified by Chris Noe (May 1999) (as86 -> gas) - * - * 360k/720k disk support: Andrzej Krzysztofowicz + * gutted by H. Peter Anvin (Jan 2003) * * BIG FAT NOTE: We're in real mode using 64k segments. Therefore segment * addresses must be multiplied by 16 to obtain their respective linear * addresses. To avoid confusion, linear addresses are written using leading * hex while segment addresses are written as segment:offset. * - * bde - should not jump blindly, there may be systems with only 512K low - * memory. Use int 0x12 to get the top of memory, etc. - * - * It then loads 'setup' directly after itself (0x90200), and the system - * at 0x10000, using BIOS interrupts. - * - * NOTE! currently system is at most (8*65536-4096) bytes long. This should - * be no problem, even in the future. I want to keep it simple. This 508 kB - * kernel size should be enough, especially as this doesn't contain the - * buffer cache as in minix (and especially now that the kernel is - * compressed :-) - * - * The loader has been made as simple as possible, and continuous - * read errors will result in a unbreakable loop. Reboot by hand. It - * loads pretty fast by getting whole tracks at a time whenever possible. */ #include @@ -59,353 +43,51 @@ SWAP_DEV = 0 /* SWAP_DEV is now written by "build" */ .global _start _start: -# First things first. Move ourself from 0x7C00 -> 0x90000 and jump there. - - movw $BOOTSEG, %ax - movw %ax, %ds # %ds = BOOTSEG - movw $INITSEG, %ax - movw %ax, %es # %ax = %es = INITSEG - movw $256, %cx - subw %si, %si - subw %di, %di - cld - rep - movsw - ljmp $INITSEG, $go - -# bde - changed 0xff00 to 0x4000 to use debugger at 0x6400 up (bde). We -# wouldn't have to worry about this if we checked the top of memory. Also -# my BIOS can be configured to put the wini drive tables in high memory -# instead of in the vector table. The old stack might have clobbered the -# drive table. + # Normalize the start address + jmpl $BOOTSEG, $start2 -go: movw $0x4000-12, %di # 0x4000 is an arbitrary value >= - # length of bootsect + length of - # setup + room for stack; - # 12 is disk parm size. - movw %ax, %ds # %ax and %es already contain INITSEG +start2: + movw %cs, %ax + movw %ax, %ds + movw %ax, %es movw %ax, %ss - movw %di, %sp # put stack at INITSEG:0x4000-12. - -# Many BIOS's default disk parameter tables will not recognize -# multi-sector reads beyond the maximum sector number specified -# in the default diskette parameter tables - this may mean 7 -# sectors in some cases. -# -# Since single sector reads are slow and out of the question, -# we must take care of this by creating new parameter tables -# (for the first disk) in RAM. We will set the maximum sector -# count to 36 - the most we will encounter on an ED 2.88. -# -# High doesn't hurt. Low does. -# -# Segments are as follows: %cs = %ds = %es = %ss = INITSEG, %fs = 0, -# and %gs is unused. - - movw %cx, %fs # %fs = 0 - movw $0x78, %bx # %fs:%bx is parameter table address - pushw %ds - ldsw %fs:(%bx), %si # %ds:%si is source - movb $6, %cl # copy 12 bytes - pushw %di # %di = 0x4000-12. - rep # don't worry about cld - movsw # already done above - popw %di - popw %ds - movb $36, 0x4(%di) # patch sector count - movw %di, %fs:(%bx) - movw %es, %fs:2(%bx) - -# Get disk drive parameters, specifically number of sectors/track. + movw $0x7c00, %sp + sti + cld -# It seems that there is no BIOS call to get the number of sectors. -# Guess 36 sectors if sector 36 can be read, 18 sectors if sector 18 -# can be read, 15 if sector 15 can be read. Otherwise guess 9. -# Note that %cx = 0 from rep movsw above. + movw $bugger_off_msg, %si - movw $disksizes, %si # table of sizes to try -probe_loop: +msg_loop: lodsb - cbtw # extend to word - movw %ax, sectors - cmpw $disksizes+4, %si - jae got_sectors # If all else fails, try 9 - - xchgw %cx, %ax # %cx = track and sector - xorw %dx, %dx # drive 0, head 0 - movw $0x0200, %bx # address = 512, in INITSEG (%es = %cs) - movw $0x0201, %ax # service 2, 1 sector - int $0x13 - jc probe_loop # try next value - -got_sectors: - movb $0x03, %ah # read cursor pos - xorb %bh, %bh - int $0x10 - movw $9, %cx - movb $0x07, %bl # page 0, attribute 7 (normal) - # %bh is set above; int10 doesn't - # modify it - movw $msg1, %bp - movw $0x1301, %ax # write string, move cursor - int $0x10 # tell the user we're loading.. - -# Load the setup-sectors directly after the moved bootblock (at 0x90200). -# We should know the drive geometry to do it, as setup may exceed first -# cylinder (for 9-sector 360K and 720K floppies). - - movw $0x0001, %ax # set sread (sector-to-read) to 1 as - movw $sread, %si # the boot sector has already been read - movw %ax, (%si) - - call kill_motor # reset FDC - movw $0x0200, %bx # address = 512, in INITSEG -next_step: - movb setup_sects, %al - movw sectors, %cx - subw (%si), %cx # (%si) = sread - cmpb %cl, %al - jbe no_cyl_crossing - movw sectors, %ax - subw (%si), %ax # (%si) = sread -no_cyl_crossing: - call read_track - pushw %ax # save it - call set_next # set %bx properly; it uses %ax,%cx,%dx - popw %ax # restore - subb %al, setup_sects # rest - for next step - jnz next_step - - pushw $SYSSEG - popw %es # %es = SYSSEG - call read_it - call kill_motor - call print_nl - -# After that we check which root-device to use. If the device is -# defined (!= 0), nothing is done and the given device is used. -# Otherwise, one of /dev/fd0H2880 (2,32) or /dev/PS0 (2,28) or /dev/at0 (2,8) -# depending on the number of sectors we pretend to know we have. - -# Segments are as follows: %cs = %ds = %ss = INITSEG, -# %es = SYSSEG, %fs = 0, %gs is unused. - - movw root_dev, %ax - orw %ax, %ax - jne root_defined - - movw sectors, %bx - movw $0x0208, %ax # /dev/ps0 - 1.2Mb - cmpw $15, %bx - je root_defined - - movb $0x1c, %al # /dev/PS0 - 1.44Mb - cmpw $18, %bx - je root_defined - - movb $0x20, %al # /dev/fd0H2880 - 2.88Mb - cmpw $36, %bx - je root_defined - - movb $0, %al # /dev/fd0 - autodetect -root_defined: - movw %ax, root_dev - -# After that (everything loaded), we jump to the setup-routine -# loaded directly after the bootblock: - - ljmp $SETUPSEG, $0 - -# These variables are addressed via %si register as it gives shorter code. - -sread: .word 0 # sectors read of current track -head: .word 0 # current head -track: .word 0 # current track - -# This routine loads the system at address SYSSEG, making sure -# no 64kB boundaries are crossed. We try to load it as fast as -# possible, loading whole tracks whenever we can. - -read_it: - movw %es, %ax # %es = SYSSEG when called - testw $0x0fff, %ax -die: jne die # %es must be at 64kB boundary - xorw %bx, %bx # %bx is starting address within segment -rp_read: -#ifdef __BIG_KERNEL__ # look in setup.S for bootsect_kludge - bootsect_kludge = 0x220 # 0x200 + 0x20 which is the size of the - lcall *bootsect_kludge # bootsector + bootsect_kludge offset -#else - movw %es, %ax - subw $SYSSEG, %ax - movw %bx, %cx - shr $4, %cx - add %cx, %ax # check offset -#endif - cmpw syssize, %ax # have we loaded everything yet? - jbe ok1_read - - ret - -ok1_read: - movw sectors, %ax - subw (%si), %ax # (%si) = sread - movw %ax, %cx - shlw $9, %cx - addw %bx, %cx - jnc ok2_read - - je ok2_read - - xorw %ax, %ax - subw %bx, %ax - shrw $9, %ax -ok2_read: - call read_track - call set_next - jmp rp_read - -read_track: - pusha - pusha - movw $0xe2e, %ax # loading... message 2e = . + andb %al, %al + jz die + movb $0xe, %ah movw $7, %bx int $0x10 - popa - -# Accessing head, track, sread via %si gives shorter code. + jmp msg_loop - movw 4(%si), %dx # 4(%si) = track - movw (%si), %cx # (%si) = sread - incw %cx - movb %dl, %ch - movw 2(%si), %dx # 2(%si) = head - movb %dl, %dh - andw $0x0100, %dx - movb $2, %ah - pushw %dx # save for error dump - pushw %cx - pushw %bx - pushw %ax - int $0x13 - jc bad_rt - - addw $8, %sp - popa - ret - -set_next: - movw %ax, %cx - addw (%si), %ax # (%si) = sread - cmp sectors, %ax - jne ok3_set - movw $0x0001, %ax - xorw %ax, 2(%si) # change head - jne ok4_set - incw 4(%si) # next track -ok4_set: +die: + # Allow the user to press a key, then reboot xorw %ax, %ax -ok3_set: - movw %ax, (%si) # set sread - shlw $9, %cx - addw %cx, %bx - jnc set_next_fin - movw %es, %ax - addb $0x10, %ah - movw %ax, %es - xorw %bx, %bx -set_next_fin: - ret - -bad_rt: - pushw %ax # save error code - call print_all # %ah = error, %al = read - xorb %ah, %ah - xorb %dl, %dl - int $0x13 - addw $10, %sp - popa - jmp read_track - -# print_all is for debugging purposes. -# -# it will print out all of the registers. The assumption is that this is -# called from a routine, with a stack frame like -# -# %dx -# %cx -# %bx -# %ax -# (error) -# ret <- %sp - -print_all: - movw $5, %cx # error code + 4 registers - movw %sp, %bp -print_loop: - pushw %cx # save count remaining - call print_nl # <-- for readability - cmpb $5, %cl - jae no_reg # see if register name is needed + int $0x16 + int $0x19 - movw $0xe05 + 'A' - 1, %ax - subb %cl, %al - int $0x10 - movb $'X', %al - int $0x10 - movb $':', %al - int $0x10 -no_reg: - addw $2, %bp # next register - call print_hex # print it - popw %cx - loop print_loop - ret - -print_nl: - movw $0xe0d, %ax # CR - int $0x10 - movb $0xa, %al # LF - int $0x10 - ret - -# print_hex is for debugging purposes, and prints the word -# pointed to by %ss:%bp in hexadecimal. - -print_hex: - movw $4, %cx # 4 hex digits - movw (%bp), %dx # load word into %dx -print_digit: - rolw $4, %dx # rotate to use low 4 bits - movw $0xe0f, %ax # %ah = request - andb %dl, %al # %al = mask for nybble - addb $0x90, %al # convert %al to ascii hex - daa # in only four instructions! - adc $0x40, %al - daa - int $0x10 - loop print_digit - ret + # int 0x19 should never return. In case it does anyway, + # invoke the BIOS reset code... + ljmp $0xf000,$0xfff0 -# This procedure turns off the floppy drive motor, so -# that we enter the kernel in a known state, and -# don't have to worry about it later. -# NOTE: Doesn't save %ax or %dx; do it yourself if you need to. -kill_motor: - movw $0x3f2, %dx - xorb %al, %al - outb %al, %dx - ret +bugger_off_msg: + .ascii "Direct booting from floppy is no longer supported.\r\n" + .ascii "Please use a boot loader program instead.\r\n" + .ascii "\n" + .ascii "Remove disk and press any key to reboot . . .\r\n" + .byte 0 -sectors: .word 0 -disksizes: .byte 36, 18, 15, 9 -msg1: .byte 13, 10 - .ascii "Loading" -# XXX: This is a fairly snug fit. + # Kernel attributes; used by setup -.org 497 + .org 497 setup_sects: .byte SETUPSECTS root_flags: .word ROOT_RDONLY syssize: .word SYSSIZE diff --git a/arch/x86_64/boot/mtools.conf.in b/arch/x86_64/boot/mtools.conf.in new file mode 100644 index 000000000000..efd6d2490c1d --- /dev/null +++ b/arch/x86_64/boot/mtools.conf.in @@ -0,0 +1,17 @@ +# +# mtools configuration file for "make (b)zdisk" +# + +# Actual floppy drive +drive a: + file="/dev/fd0" + +# 1.44 MB floppy disk image +drive v: + file="@OBJ@/fdimage" cylinders=80 heads=2 sectors=18 filter + +# 2.88 MB floppy disk image (mostly for virtual uses) +drive w: + file="@OBJ@/fdimage" cylinders=80 heads=2 sectors=36 filter + + diff --git a/arch/x86_64/boot/tools/build.c b/arch/x86_64/boot/tools/build.c index 2c231bdd2910..c2fa66313170 100644 --- a/arch/x86_64/boot/tools/build.c +++ b/arch/x86_64/boot/tools/build.c @@ -150,13 +150,10 @@ int main(int argc, char ** argv) sz = sb.st_size; fprintf (stderr, "System is %d kB\n", sz/1024); sys_size = (sz + 15) / 16; - /* 0x28000*16 = 2.5 MB, conservative estimate for the current maximum */ - if (sys_size > (is_big_kernel ? 0x28000 : DEF_SYSSIZE)) + /* 0x40000*16 = 4.0 MB, reasonable estimate for the current maximum */ + if (sys_size > (is_big_kernel ? 0x40000 : DEF_SYSSIZE)) die("System is too big. Try using %smodules.", is_big_kernel ? "" : "bzImage or "); - if (sys_size > 0xefff) - fprintf(stderr,"warning: kernel is too big for standalone boot " - "from floppy\n"); while (sz > 0) { int l, n; diff --git a/arch/x86_64/defconfig b/arch/x86_64/defconfig index c0fbb1bb960c..4e3820457f54 100644 --- a/arch/x86_64/defconfig +++ b/arch/x86_64/defconfig @@ -5,7 +5,6 @@ CONFIG_X86_64=y CONFIG_X86=y CONFIG_MMU=y CONFIG_SWAP=y -CONFIG_UID16=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y @@ -22,12 +21,6 @@ CONFIG_EXPERIMENTAL=y CONFIG_SYSVIPC=y # CONFIG_BSD_PROCESS_ACCT is not set CONFIG_SYSCTL=y -# CONFIG_LOG_BUF_SHIFT_17 is not set -CONFIG_LOG_BUF_SHIFT_16=y -# CONFIG_LOG_BUF_SHIFT_15 is not set -# CONFIG_LOG_BUF_SHIFT_14 is not set -# CONFIG_LOG_BUF_SHIFT_13 is not set -# CONFIG_LOG_BUF_SHIFT_12 is not set CONFIG_LOG_BUF_SHIFT=16 # @@ -37,6 +30,7 @@ CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y CONFIG_OBSOLETE_MODPARM=y +# CONFIG_MODVERSIONS is not set # CONFIG_KMOD is not set # @@ -103,6 +97,7 @@ CONFIG_BINFMT_ELF=y # CONFIG_BINFMT_MISC is not set CONFIG_IA32_EMULATION=y CONFIG_COMPAT=y +CONFIG_UID16=y # # Memory Technology Devices (MTD) @@ -290,6 +285,7 @@ CONFIG_NETDEVICES=y # Ethernet (10 or 100Mbit) # CONFIG_NET_ETHERNET=y +# CONFIG_MII is not set # CONFIG_HAPPYMEAL is not set # CONFIG_SUNGEM is not set # CONFIG_NET_VENDOR_3COM is not set @@ -490,6 +486,7 @@ CONFIG_RTC=y # CONFIG_DRM is not set # CONFIG_MWAVE is not set CONFIG_RAW_DRIVER=y +# CONFIG_HANGCHECK_TIMER is not set # # Misc devices @@ -615,7 +612,6 @@ CONFIG_DEBUG_KERNEL=y # CONFIG_DEBUG_SLAB is not set CONFIG_MAGIC_SYSRQ=y # CONFIG_DEBUG_SPINLOCK is not set -CONFIG_CHECKING=y # CONFIG_INIT_DEBUG is not set CONFIG_KALLSYMS=y # CONFIG_FRAME_POINTER is not set diff --git a/arch/x86_64/ia32/fpu32.c b/arch/x86_64/ia32/fpu32.c index c1de60a31782..09878eab6571 100644 --- a/arch/x86_64/ia32/fpu32.c +++ b/arch/x86_64/ia32/fpu32.c @@ -146,6 +146,7 @@ int restore_i387_ia32(struct task_struct *tsk, struct _fpstate_ia32 *buf, int fs return -1; } tsk->thread.i387.fxsave.mxcsr &= 0xffbf; + current->used_math = 1; return convert_fxsr_from_user(&tsk->thread.i387.fxsave, buf); } diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index 0356e91dcd9f..f6b55b53a7a1 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -450,7 +450,7 @@ ia32_sys_call_table: .quad sys32_io_getevents .quad sys32_io_submit .quad sys_io_cancel - .quad sys_ni_syscall /* 250 alloc_huge_pages */ + .quad sys_fadvise64 .quad sys_ni_syscall /* free_huge_pages */ .quad sys_exit_group /* exit_group */ .quad sys_lookup_dcookie diff --git a/arch/x86_64/kernel/Makefile b/arch/x86_64/kernel/Makefile index c31251ba7c00..b9adc70bb423 100644 --- a/arch/x86_64/kernel/Makefile +++ b/arch/x86_64/kernel/Makefile @@ -17,7 +17,7 @@ obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o obj-$(CONFIG_X86_IO_APIC) += io_apic.o mpparse.o obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o suspend_asm.o obj-$(CONFIG_ACPI) += acpi.o -#obj-$(CONFIG_ACPI_SLEEP) += acpi_wakeup.o +obj-$(CONFIG_ACPI_SLEEP) += wakeup.o obj-$(CONFIG_EARLY_PRINTK) += early_printk.o obj-$(CONFIG_GART_IOMMU) += pci-gart.o aperture.o obj-$(CONFIG_DUMMY_IOMMU) += pci-nommu.o diff --git a/arch/x86_64/kernel/acpi.c b/arch/x86_64/kernel/acpi.c index fd366de53ff2..e3fc15bf008d 100644 --- a/arch/x86_64/kernel/acpi.c +++ b/arch/x86_64/kernel/acpi.c @@ -44,6 +44,9 @@ #include #include #include +#include +#include +#include extern int acpi_disabled; @@ -70,7 +73,6 @@ __acpi_map_table ( if (phys_addr < (end_pfn_map << PAGE_SHIFT)) return __va(phys_addr); - printk("acpi mapping beyond end_pfn: %lx > %lx\n", phys_addr, end_pfn< -#endif +extern void acpi_prepare_wakeup(void); +extern unsigned char acpi_wakeup[], acpi_wakeup_end[], s3_prot16[]; /* address in low memory of the wakeup routine. */ -unsigned long acpi_wakeup_address = 0; - -/* new page directory that we will be using */ -static pmd_t *pmd; - -/* saved page directory */ -static pmd_t saved_pmd; - -/* page which we'll use for the new page directory */ -static pte_t *ptep; - -extern unsigned long FASTCALL(acpi_copy_wakeup_routine(unsigned long)); - -/* - * acpi_create_identity_pmd - * - * Create a new, identity mapped pmd. - * - * Do this by creating new page directory, and marking all the pages as R/W - * Then set it as the new Page Middle Directory. - * And, of course, flush the TLB so it takes effect. - * - * We save the address of the old one, for later restoration. - */ -static void acpi_create_identity_pmd (void) -{ - pgd_t *pgd; - int i; - - ptep = (pte_t*)__get_free_page(GFP_KERNEL); - - /* fill page with low mapping */ - for (i = 0; i < PTRS_PER_PTE; i++) - set_pte(ptep + i, mk_pte_phys(i << PAGE_SHIFT, PAGE_SHARED)); - - pgd = pgd_offset(current->active_mm, 0); - pmd = pmd_alloc(current->mm,pgd, 0); - - /* save the old pmd */ - saved_pmd = *pmd; - - /* set the new one */ - set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(ptep))); - - /* flush the TLB */ - local_flush_tlb(); -} - -/* - * acpi_restore_pmd - * - * Restore the old pmd saved by acpi_create_identity_pmd and - * free the page that said function alloc'd - */ -static void acpi_restore_pmd (void) -{ - set_pmd(pmd, saved_pmd); - local_flush_tlb(); - free_page((unsigned long)ptep); -} +unsigned long acpi_wakeup_address; /** * acpi_save_state_mem - save kernel state - * - * Create an identity mapped page table and copy the wakeup routine to - * low memory. */ int acpi_save_state_mem (void) { - acpi_create_identity_pmd(); - acpi_copy_wakeup_routine(acpi_wakeup_address); + if (!acpi_wakeup_address) + return -1; + memcpy((void*)acpi_wakeup_address, acpi_wakeup, acpi_wakeup_end - acpi_wakeup); return 0; } /** * acpi_save_state_disk - save kernel state to disk * + * Assume preemption/interrupts are already turned off and that we're running + * on the BP (note this doesn't imply SMP is handled correctly) */ int acpi_save_state_disk (void) { + unsigned long pbase = read_cr3() & PAGE_MASK; + if (pbase >= 0xffffffffUL) { + printk(KERN_ERR "ACPI: High page table. Suspend disabled.\n"); return 1; + } + set_seg_base(smp_processor_id(), GDT_ENTRY_KERNELCS16, s3_prot16); + swap_low_mappings(); + acpi_prepare_wakeup(); + return 0; } /* @@ -537,13 +484,13 @@ int acpi_save_state_disk (void) */ void acpi_restore_state_mem (void) { - acpi_restore_pmd(); + swap_low_mappings(); } /** * acpi_reserve_bootmem - do _very_ early ACPI initialisation * - * We allocate a page in low memory for the wakeup + * We allocate a page in 1MB low memory for the real-mode wakeup * routine for when we come back from a sleep state. The * runtime allocator allows specification of <16M pages, but not * <1M pages. @@ -551,7 +498,10 @@ void acpi_restore_state_mem (void) void __init acpi_reserve_bootmem(void) { acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE); - printk(KERN_DEBUG "ACPI: have wakeup address 0x%8.8lx\n", acpi_wakeup_address); + if (!acpi_wakeup_address) { + printk(KERN_ERR "ACPI: Cannot allocate lowmem. S3 disabled.\n"); + return; + } } #endif /*CONFIG_ACPI_SLEEP*/ diff --git a/arch/x86_64/kernel/aperture.c b/arch/x86_64/kernel/aperture.c index d306f7b93b5c..24d72222e5cc 100644 --- a/arch/x86_64/kernel/aperture.c +++ b/arch/x86_64/kernel/aperture.c @@ -57,7 +57,7 @@ static u32 __init allocate_aperture(void) printk("Cannot allocate aperture memory hole (%p,%uK)\n", p, aper_size>>10); if (p) - free_bootmem((unsigned long)p, aper_size); + free_bootmem_node(nd0, (unsigned long)p, aper_size); return 0; } printk("Mapping aperture over %d KB of RAM @ %lx\n", diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c index 9241f4043fbc..409f7e98f439 100644 --- a/arch/x86_64/kernel/apic.c +++ b/arch/x86_64/kernel/apic.c @@ -1026,7 +1026,7 @@ asmlinkage void smp_error_interrupt(void) irq_exit(); } -int disable_apic __initdata; +int disable_apic; /* * This initializes the IO-APIC and APIC hardware if this is @@ -1038,8 +1038,10 @@ int __init APIC_init_uniprocessor (void) printk(KERN_INFO "Apic disabled\n"); return -1; } - if (!smp_found_config && !cpu_has_apic) + if (!smp_found_config && !cpu_has_apic) { + disable_apic = 1; return -1; + } /* * Complain if the BIOS pretends there is one. @@ -1047,6 +1049,7 @@ int __init APIC_init_uniprocessor (void) if (!cpu_has_apic && APIC_INTEGRATED(apic_version[boot_cpu_id])) { printk(KERN_ERR "BIOS bug, local APIC #%d not detected!...\n", boot_cpu_id); + disable_apic = 1; return -1; } diff --git a/arch/x86_64/kernel/bluesmoke.c b/arch/x86_64/kernel/bluesmoke.c index f63144d6a14a..12c339655be6 100644 --- a/arch/x86_64/kernel/bluesmoke.c +++ b/arch/x86_64/kernel/bluesmoke.c @@ -1,5 +1,9 @@ /* * arch/x86_64/kernel/bluesmoke.c - x86-64 Machine Check Exception Reporting + * + +RED-PEN: need to add power management to restore after S3 wakeup. + */ #include diff --git a/arch/x86_64/kernel/e820.c b/arch/x86_64/kernel/e820.c index e6dd05b2bc85..c82c2332b73b 100644 --- a/arch/x86_64/kernel/e820.c +++ b/arch/x86_64/kernel/e820.c @@ -19,13 +19,17 @@ #include #include -extern unsigned long table_start, table_end; extern char _end[]; +/* + * PFN of last memory page. + */ +unsigned long end_pfn; + /* * end_pfn only includes RAM, while end_pfn_map includes all e820 entries. * The direct mapping extends to end_pfn_map, so that we can directly access - * ACPI and other tables without having to play with fixmaps. + * apertures, ACPI and other tables without having to play with fixmaps. */ unsigned long end_pfn_map; @@ -42,18 +46,16 @@ static inline int bad_addr(unsigned long *addrp, unsigned long size) unsigned long addr = *addrp, last = addr + size; /* various gunk below that needed for SMP startup */ - if (addr < 7*PAGE_SIZE) { - *addrp = 7*PAGE_SIZE; + if (addr < 0x8000) { + *addrp = 0x8000; return 1; } -#if 0 /* direct mapping tables of the kernel */ if (last >= table_start< end_pfn_map) end_pfn = end_pfn_map; + + return end_pfn; } /* diff --git a/arch/x86_64/kernel/early_printk.c b/arch/x86_64/kernel/early_printk.c index a79f52979570..ca9bd6087217 100644 --- a/arch/x86_64/kernel/early_printk.c +++ b/arch/x86_64/kernel/early_printk.c @@ -3,6 +3,7 @@ #include #include #include +#include /* Simple VGA output */ @@ -104,9 +105,9 @@ static __init void early_serial_init(char *opt) s = strsep(&opt, ","); if (s != NULL) { unsigned port; - if (!strncmp(s,"0x",2)) + if (!strncmp(s,"0x",2)) { early_serial_base = simple_strtoul(s, &e, 16); - else { + } else { static int bases[] = { 0x3f8, 0x2f8 }; if (!strncmp(s,"ttyS",4)) s+=4; diff --git a/arch/x86_64/kernel/entry.S b/arch/x86_64/kernel/entry.S index 44e20af8ba97..68b192a517c1 100644 --- a/arch/x86_64/kernel/entry.S +++ b/arch/x86_64/kernel/entry.S @@ -512,8 +512,7 @@ ENTRY(spurious_interrupt) * Exception entry point. This expects an error code/orig_rax on the stack * and the exception handler in %rax. */ - ALIGN -error_entry: +ENTRY(error_entry) /* rdi slot contains rax, oldrax contains error code */ pushq %rsi movq 8(%rsp),%rsi /* load rax */ @@ -532,10 +531,7 @@ error_swapgs: xorl %ebx,%ebx swapgs error_sti: - bt $9,EFLAGS(%rsp) - jnc 1f - sti -1: movq %rdi,RDI(%rsp) + movq %rdi,RDI(%rsp) movq %rsp,%rdi movq ORIG_RAX(%rsp),%rsi /* get error code */ movq $-1,ORIG_RAX(%rsp) @@ -573,7 +569,8 @@ ENTRY(load_gs_index) swapgs gs_change: movl %edi,%gs -2: swapgs +2: sfence /* workaround */ + swapgs popf ret diff --git a/arch/x86_64/kernel/head.S b/arch/x86_64/kernel/head.S index d9d254e033d4..14feb08d2f9d 100644 --- a/arch/x86_64/kernel/head.S +++ b/arch/x86_64/kernel/head.S @@ -72,8 +72,7 @@ startup_32: /* Setup EFER (Extended Feature Enable Register) */ movl $MSR_EFER, %ecx rdmsr - /* Fool rdmsr and reset %eax to avoid dependences */ - xorl %eax, %eax + /* Enable Long Mode */ btsl $_EFER_LME, %eax /* Enable System Call */ @@ -112,7 +111,6 @@ reach_compatibility_mode: jnz second /* Load new GDT with the 64bit segment using 32bit descriptor */ - /* to avoid 32bit relocations we use fixed adresses here */ movl $(pGDT32 - __START_KERNEL_map), %eax lgdt (%eax) @@ -349,17 +347,14 @@ ENTRY(cpu_gdt_table) .quad 0x00cffe000000ffff /* __USER32_CS */ .quad 0x00cff2000000ffff /* __USER_DS, __USER32_DS */ .quad 0x00affa000000ffff /* __USER_CS */ - .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb) - .word 0 # base address = 0 - .word 0x9A00 # code read/exec - .word 0x00CF # granularity = 4096, 386 - # (+5th nibble of limit) - /* __KERNEL32_CS */ + .quad 0x00cf9a000000ffff /* __KERNEL32_CS */ .quad 0,0 /* TSS */ .quad 0 /* LDT */ .quad 0,0,0 /* three TLS descriptors */ - .quad 0x00cff2000000ffff /* dummy descriptor for long base */ - .quad 0 /* pad to cache line boundary */ + .quad 0 /* unused now */ + .quad 0x00009a000000ffff /* __KERNEL16_CS - 16bit PM for S3 wakeup. */ + /* base must be patched for real base address. */ + /* This should be a multiple of the cache line size */ gdt_end: .globl gdt_end diff --git a/arch/x86_64/kernel/head64.c b/arch/x86_64/kernel/head64.c index a9a0842fbe41..f0bbb46a7e85 100644 --- a/arch/x86_64/kernel/head64.c +++ b/arch/x86_64/kernel/head64.c @@ -13,6 +13,8 @@ #include #include +#include +#include /* Don't add a printk in there. printk relies on the PDA which is not initialized yet. */ @@ -70,9 +72,6 @@ static void __init setup_boot_cpu_data(void) boot_cpu_data.x86_mask = eax & 0xf; } -extern void start_kernel(void), pda_init(int), setup_early_printk(char *); -extern int disable_apic; - void __init x86_64_start_kernel(char * real_mode_data) { char *s; @@ -83,6 +82,11 @@ void __init x86_64_start_kernel(char * real_mode_data) s = strstr(saved_command_line, "earlyprintk="); if (s != NULL) setup_early_printk(s+12); +#ifdef CONFIG_DISCONTIGMEM + s = strstr(saved_command_line, "numa="); + if (s != NULL) + numa_setup(s+5); +#endif #ifdef CONFIG_X86_IO_APIC if (strstr(saved_command_line, "disableapic")) disable_apic = 1; diff --git a/arch/x86_64/kernel/init_task.c b/arch/x86_64/kernel/init_task.c index f207c449c297..a2ee7e861281 100644 --- a/arch/x86_64/kernel/init_task.c +++ b/arch/x86_64/kernel/init_task.c @@ -11,6 +11,7 @@ static struct fs_struct init_fs = INIT_FS; static struct files_struct init_files = INIT_FILES; static struct signal_struct init_signals = INIT_SIGNALS(init_signals); +static struct sighand_struct init_sighand = INIT_SIGHAND(init_sighand); struct mm_struct init_mm = INIT_MM(init_mm); /* diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c index 430e903af5e1..ed43656b524e 100644 --- a/arch/x86_64/kernel/irq.c +++ b/arch/x86_64/kernel/irq.c @@ -137,7 +137,8 @@ int show_interrupts(struct seq_file *p, void *v) struct irqaction * action; seq_printf(p, " "); - for_each_cpu(j) + for (j=0; jnext) { + for (prevp = &mod_vmlist ; (map = *prevp) ; prevp = &map->next) { if ((unsigned long)map->addr == addr) { *prevp = map->next; write_unlock(&vmlist_lock); @@ -81,7 +86,7 @@ void *module_alloc(unsigned long size) write_lock(&vmlist_lock); addr = (void *) MODULES_VADDR; - for (p = &vmlist; (tmp = *p); p = &tmp->next) { + for (p = &mod_vmlist; (tmp = *p); p = &tmp->next) { void *next; DEBUGP("vmlist %p %lu addr %p\n", tmp->addr, tmp->size, addr); if (size + (unsigned long) addr + PAGE_SIZE < (unsigned long) tmp->addr) diff --git a/arch/x86_64/kernel/mpparse.c b/arch/x86_64/kernel/mpparse.c index 92ca3b015fc8..c9c5c4470b47 100644 --- a/arch/x86_64/kernel/mpparse.c +++ b/arch/x86_64/kernel/mpparse.c @@ -29,6 +29,7 @@ #include #include #include +#include /* Have we found an MP table */ int smp_found_config; @@ -83,7 +84,6 @@ extern int acpi_parse_ioapic (acpi_table_entry_header *header); * Intel MP BIOS table parsing routines: */ -#ifndef CONFIG_X86_VISWS_APIC /* * Checksum an MP configuration block. */ @@ -582,9 +582,9 @@ static int __init smp_scan_config (unsigned long base, unsigned long length) smp_found_config = 1; printk("found SMP MP-table at %08lx\n", virt_to_phys(mpf)); - reserve_bootmem(virt_to_phys(mpf), PAGE_SIZE); + reserve_bootmem_generic(virt_to_phys(mpf), PAGE_SIZE); if (mpf->mpf_physptr) - reserve_bootmem(mpf->mpf_physptr, PAGE_SIZE); + reserve_bootmem_generic(mpf->mpf_physptr, PAGE_SIZE); mpf_found = mpf; return 1; } @@ -632,38 +632,14 @@ void __init find_intel_smp (void) printk(KERN_WARNING "WARNING: MP table in the EBDA can be UNSAFE, contact linux-smp@vger.kernel.org if you experience SMP problems!\n"); } -#else - -/* - * The Visual Workstation is Intel MP compliant in the hardware - * sense, but it doesnt have a BIOS(-configuration table). - * No problem for Linux. - */ -void __init find_visws_smp(void) -{ - smp_found_config = 1; - - phys_cpu_present_map |= 2; /* or in id 1 */ - apic_version[1] |= 0x10; /* integrated APIC */ - apic_version[0] |= 0x10; - - mp_lapic_addr = APIC_DEFAULT_PHYS_BASE; -} - -#endif - /* * - Intel MP Configuration Table - * - or SGI Visual Workstation configuration */ void __init find_smp_config (void) { #ifdef CONFIG_X86_LOCAL_APIC find_intel_smp(); #endif -#ifdef CONFIG_VISWS - find_visws_smp(); -#endif } diff --git a/arch/x86_64/kernel/msr.c b/arch/x86_64/kernel/msr.c index 095e86a729ea..73e7d3b4de8b 100644 --- a/arch/x86_64/kernel/msr.c +++ b/arch/x86_64/kernel/msr.c @@ -22,6 +22,9 @@ * * This driver uses /dev/cpu/%d/msr where %d is the minor number, and on * an SMP box will direct the access to CPU %d. + +RED-PEN: need to get power management for S3 restore + */ #include diff --git a/arch/x86_64/kernel/nmi.c b/arch/x86_64/kernel/nmi.c index 73e3adacefc1..3b9772754bd2 100644 --- a/arch/x86_64/kernel/nmi.c +++ b/arch/x86_64/kernel/nmi.c @@ -24,6 +24,7 @@ #include #include #include +#include extern void default_do_nmi(struct pt_regs *); @@ -71,13 +72,14 @@ int __init check_nmi_watchdog (void) printk(KERN_INFO "testing NMI watchdog ... "); - for_each_cpu(cpu) { + for (cpu = 0; cpu < NR_CPUS; cpu++) counts[cpu] = cpu_pda[cpu].__nmi_count; - } local_irq_enable(); mdelay((10*1000)/nmi_hz); // wait 10 ticks - for_each_cpu(cpu) { + for (cpu = 0; cpu < NR_CPUS; cpu++) { + if (!cpu_online(cpu)) + continue; if (cpu_pda[cpu].__nmi_count - counts[cpu] <= 5) { printk("CPU#%d: NMI appears to be stuck (%d)!\n", cpu, @@ -173,7 +175,7 @@ static inline void nmi_pm_init(void) { } * Original code written by Keith Owens. */ -static void __pminit setup_k7_watchdog(void) +static void setup_k7_watchdog(void) { int i; unsigned int evntsel; @@ -183,8 +185,10 @@ static void __pminit setup_k7_watchdog(void) nmi_perfctr_msr = MSR_K7_PERFCTR0; for(i = 0; i < 4; ++i) { - wrmsr(MSR_K7_EVNTSEL0+i, 0, 0); - wrmsr(MSR_K7_PERFCTR0+i, 0, 0); + /* Simulator may not support it */ + if (checking_wrmsrl(MSR_K7_EVNTSEL0+i, 0UL)) + return; + wrmsrl(MSR_K7_PERFCTR0+i, 0UL); } evntsel = K7_EVNTSEL_INT @@ -200,16 +204,12 @@ static void __pminit setup_k7_watchdog(void) wrmsr(MSR_K7_EVNTSEL0, evntsel, 0); } -void __pminit setup_apic_nmi_watchdog (void) +void setup_apic_nmi_watchdog (void) { switch (boot_cpu_data.x86_vendor) { case X86_VENDOR_AMD: if (boot_cpu_data.x86 < 6) return; - /* Simics masquerades as AMD, but does not support - performance counters */ - if (strstr(boot_cpu_data.x86_model_id, "Screwdriver")) - return; setup_k7_watchdog(); break; default: diff --git a/arch/x86_64/kernel/process.c b/arch/x86_64/kernel/process.c index 57a527b71d90..26fc4dbbe4be 100644 --- a/arch/x86_64/kernel/process.c +++ b/arch/x86_64/kernel/process.c @@ -366,12 +366,15 @@ void __switch_to(struct task_struct *prev_p, struct task_struct *next_p) also reload when it has changed. when prev process used 64bit base always reload to avoid an information leak. */ - if (unlikely((fsindex | next->fsindex) || prev->fs)) + if (unlikely(fsindex | next->fsindex | prev->fs)) { loadsegment(fs, next->fsindex); - /* check if the user changed the selector - if yes clear 64bit base. */ - if (unlikely(fsindex != prev->fsindex)) + /* check if the user used a selector != 0 + * if yes clear 64bit base, since overloaded base + * is always mapped to the Null selector + */ + if (fsindex) prev->fs = 0; + } /* when next process has a 64bit base use it */ if (next->fs) wrmsrl(MSR_FS_BASE, next->fs); @@ -380,10 +383,11 @@ void __switch_to(struct task_struct *prev_p, struct task_struct *next_p) { unsigned gsindex; asm volatile("movl %%gs,%0" : "=g" (gsindex)); - if (unlikely((gsindex | next->gsindex) || prev->gs)) + if (unlikely(gsindex | next->gsindex | prev->gs)) { load_gs_index(next->gsindex); - if (unlikely(gsindex != prev->gsindex)) + if (gsindex) prev->gs = 0; + } if (next->gs) wrmsrl(MSR_KERNEL_GS_BASE, next->gs); prev->gsindex = gsindex; @@ -537,29 +541,23 @@ int sys_arch_prctl(int code, unsigned long addr) switch (code) { case ARCH_SET_GS: -#if 1 - /* For now. We still have one unsolved bug in long gs base context - switch handling. */ - return -EINVAL; -#else if (addr >= TASK_SIZE) return -EPERM; get_cpu(); - load_gs_index(__USER_LONGBASE); - current->thread.gsindex = __USER_LONGBASE; + load_gs_index(0); + current->thread.gsindex = 0; current->thread.gs = addr; ret = checking_wrmsrl(MSR_KERNEL_GS_BASE, addr); put_cpu(); break; -#endif case ARCH_SET_FS: /* Not strictly needed for fs, but do it for symmetry with gs */ if (addr >= TASK_SIZE) return -EPERM; get_cpu(); - asm volatile("movl %0,%%fs" :: "r" (__USER_LONGBASE)); - current->thread.fsindex = __USER_LONGBASE; + asm volatile("movl %0,%%fs" :: "r" (0)); + current->thread.fsindex = 0; current->thread.fs = addr; ret = checking_wrmsrl(MSR_FS_BASE, addr); put_cpu(); diff --git a/arch/x86_64/kernel/setup.c b/arch/x86_64/kernel/setup.c index 255d7a1a6952..f25c15de2bd5 100644 --- a/arch/x86_64/kernel/setup.c +++ b/arch/x86_64/kernel/setup.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -53,6 +54,8 @@ #include #include +#define Dprintk(x...) printk(x) + /* * Machine setup.. */ @@ -236,6 +239,8 @@ void __init setup_arch(char **cmdline_p) { int i; + Dprintk("setup_arch\n"); + ROOT_DEV = ORIG_ROOT_DEV; drive_info = DRIVE_INFO; screen_info = SCREEN_INFO; @@ -263,34 +268,34 @@ void __init setup_arch(char **cmdline_p) parse_cmdline_early(cmdline_p); -#define PFN_UP(x) (((x) + PAGE_SIZE-1) >> PAGE_SHIFT) -#define PFN_DOWN(x) ((x) >> PAGE_SHIFT) -#define PFN_PHYS(x) ((x) << PAGE_SHIFT) - -#define MAXMEM (120UL * 1024 * 1024 * 1024 * 1024) /* 120TB */ -#define MAXMEM_PFN PFN_DOWN(MAXMEM) -#define MAX_NONPAE_PFN (1 << 20) - /* * partially used pages are not usable - thus * we are rounding upwards: */ - start_pfn = PFN_UP(__pa_symbol(&_end)); - - e820_end_of_ram(); + end_pfn = e820_end_of_ram(); init_memory_mapping(); +#ifdef CONFIG_DISCONTIGMEM + numa_initmem_init(0, end_pfn); +#else contig_initmem_init(); +#endif + + /* Reserve direct mapping */ + reserve_bootmem_generic(table_start << PAGE_SHIFT, + (table_end - table_start) << PAGE_SHIFT); /* reserve kernel */ - reserve_bootmem(HIGH_MEMORY, PFN_PHYS(start_pfn) - HIGH_MEMORY); + unsigned long kernel_end; + kernel_end = round_up(__pa_symbol(&_end),PAGE_SIZE); + reserve_bootmem_generic(HIGH_MEMORY, kernel_end - HIGH_MEMORY); /* * reserve physical page 0 - it's a special BIOS page on many boxes, * enabling clean reboots, SMP operation, laptop functions. */ - reserve_bootmem(0, PAGE_SIZE); + reserve_bootmem_generic(0, PAGE_SIZE); #ifdef CONFIG_SMP /* @@ -298,8 +303,12 @@ void __init setup_arch(char **cmdline_p) * FIXME: Don't need the extra page at 4K, but need to fix * trampoline before removing it. (see the GDT stuff) */ - reserve_bootmem(PAGE_SIZE, PAGE_SIZE); + reserve_bootmem_generic(PAGE_SIZE, PAGE_SIZE); + + /* Reserve SMP trampoline */ + reserve_bootmem_generic(SMP_TRAMPOLINE_BASE, PAGE_SIZE); #endif + #ifdef CONFIG_ACPI_SLEEP /* * Reserve low memory region for sleep support. @@ -315,7 +324,7 @@ void __init setup_arch(char **cmdline_p) #ifdef CONFIG_BLK_DEV_INITRD if (LOADER_TYPE && INITRD_START) { if (INITRD_START + INITRD_SIZE <= (end_pfn << PAGE_SHIFT)) { - reserve_bootmem(INITRD_START, INITRD_SIZE); + reserve_bootmem_generic(INITRD_START, INITRD_SIZE); initrd_start = INITRD_START ? INITRD_START + PAGE_OFFSET : 0; initrd_end = initrd_start+INITRD_SIZE; @@ -330,14 +339,6 @@ void __init setup_arch(char **cmdline_p) } #endif - /* - * NOTE: before this point _nobody_ is allowed to allocate - * any memory using the bootmem allocator. - */ - -#ifdef CONFIG_SMP - smp_alloc_memory(); /* AP processor realmode stacks in low memory*/ -#endif paging_init(); #ifdef CONFIG_ACPI_BOOT /* @@ -347,7 +348,7 @@ void __init setup_arch(char **cmdline_p) * of MADT). */ if (!acpi_disabled) - acpi_boot_init(*cmdline_p); + acpi_boot_init(); #endif #ifdef CONFIG_X86_LOCAL_APIC /* diff --git a/arch/x86_64/kernel/setup64.c b/arch/x86_64/kernel/setup64.c index 3893f05bd86e..0e0e37edacf1 100644 --- a/arch/x86_64/kernel/setup64.c +++ b/arch/x86_64/kernel/setup64.c @@ -1,7 +1,7 @@ /* * X86-64 specific CPU setup. * Copyright (C) 1995 Linus Torvalds - * Copyright 2001, 2002 SuSE Labs / Andi Kleen. + * Copyright 2001, 2002, 2003 SuSE Labs / Andi Kleen. * See setup.c for older changelog. * $Id: setup64.c,v 1.12 2002/03/21 10:09:17 ak Exp $ */ @@ -90,6 +90,17 @@ void pda_init(int cpu) pml4_t *level4; struct x8664_pda *pda = &cpu_pda[cpu]; + /* Setup up data that may be needed in __get_free_pages early */ + asm volatile("movl %0,%%fs ; movl %0,%%gs" :: "r" (0)); + wrmsrl(MSR_GS_BASE, cpu_pda + cpu); + + pda->me = pda; + pda->cpunumber = cpu; + pda->irqcount = -1; + pda->cpudata_offset = 0; + pda->kernelstack = + (unsigned long)current_thread_info() - PDA_STACKOFFSET + THREAD_SIZE; + if (cpu == 0) { /* others are initialized in smpboot.c */ pda->pcurrent = &init_task; @@ -112,18 +123,8 @@ void pda_init(int cpu) asm volatile("movq %0,%%cr3" :: "r" (__pa(level4))); pda->irqstackptr += IRQSTACKSIZE-64; - pda->cpunumber = cpu; - pda->irqcount = -1; - pda->kernelstack = - (unsigned long)stack_thread_info() - PDA_STACKOFFSET + THREAD_SIZE; - pda->me = pda; - pda->cpudata_offset = 0; - pda->active_mm = &init_mm; pda->mmu_state = 0; - - asm volatile("movl %0,%%fs ; movl %0,%%gs" :: "r" (0)); - wrmsrl(MSR_GS_BASE, cpu_pda + cpu); } #define EXCEPTION_STK_ORDER 0 /* >= N_EXCEPTION_STACKS*EXCEPTION_STKSZ */ @@ -150,10 +151,10 @@ void __init cpu_init (void) /* CPU 0 is initialised in head64.c */ if (cpu != 0) { + pda_init(cpu); estacks = (char *)__get_free_pages(GFP_ATOMIC, 0); if (!estacks) panic("Can't allocate exception stacks for CPU %d\n",cpu); - pda_init(cpu); } else estacks = boot_exception_stacks; diff --git a/arch/x86_64/kernel/signal.c b/arch/x86_64/kernel/signal.c index 1cc967923691..ebff00989f58 100644 --- a/arch/x86_64/kernel/signal.c +++ b/arch/x86_64/kernel/signal.c @@ -353,7 +353,7 @@ static void handle_signal(unsigned long sig, siginfo_t *info, sigset_t *oldset, struct pt_regs * regs) { - struct k_sigaction *ka = ¤t->sig->action[sig-1]; + struct k_sigaction *ka = ¤t->sighand->action[sig-1]; #if DEBUG_SIG printk("handle_signal pid:%d sig:%lu rip:%lx rsp:%lx regs=%p\n", current->pid, sig, diff --git a/arch/x86_64/kernel/smp.c b/arch/x86_64/kernel/smp.c index 816c40f9308f..8a4522f2d126 100644 --- a/arch/x86_64/kernel/smp.c +++ b/arch/x86_64/kernel/smp.c @@ -3,6 +3,7 @@ * * (c) 1995 Alan Cox, Building #3 * (c) 1998-99, 2000 Ingo Molnar + * (c) 2002,2003 Andi Kleen, SuSE Labs. * * This code is released under the GNU General Public License version 2 or * later. @@ -491,3 +492,24 @@ asmlinkage void smp_call_function_interrupt(void) } } + +/* Slow. Should be only used for debugging. */ +int slow_smp_processor_id(void) +{ + int stack_location; + unsigned long sp = (unsigned long)&stack_location; + int cpu; + unsigned long mask; + + for_each_cpu(cpu, mask) { + if (sp >= (u64)cpu_pda[cpu].irqstackptr - IRQSTACKSIZE && + sp <= (u64)cpu_pda[cpu].irqstackptr) + return cpu; + + unsigned long estack = init_tss[cpu].ist[0] - EXCEPTION_STKSZ; + if (sp >= estack && sp <= estack+(1<<(PAGE_SHIFT+EXCEPTION_STK_ORDER))) + return cpu; + } + + return stack_smp_processor_id(); +} diff --git a/arch/x86_64/kernel/smpboot.c b/arch/x86_64/kernel/smpboot.c index 84c8ee9e6427..45b9999581fa 100644 --- a/arch/x86_64/kernel/smpboot.c +++ b/arch/x86_64/kernel/smpboot.c @@ -51,13 +51,10 @@ #include #include -/* Bitmask of currently online CPUs */ -unsigned long cpu_online_map; +extern int disable_apic; -/* which CPU (physical APIC ID) maps to which logical CPU number */ -volatile int x86_apicid_to_cpu[NR_CPUS]; -/* which logical CPU number maps to which CPU (physical APIC ID) */ -volatile int x86_cpu_to_apicid[NR_CPUS]; +/* Bitmask of currently online CPUs */ +unsigned long cpu_online_map = 1; static volatile unsigned long cpu_callin_map; volatile unsigned long cpu_callout_map; @@ -75,7 +72,6 @@ int smp_threads_ready; extern unsigned char trampoline_data []; extern unsigned char trampoline_end []; -static unsigned char *trampoline_base; /* * Currently trivial. Write the real->protected mode @@ -85,25 +81,11 @@ static unsigned char *trampoline_base; static unsigned long __init setup_trampoline(void) { + void *tramp = __va(SMP_TRAMPOLINE_BASE); extern volatile __u32 tramp_gdt_ptr; tramp_gdt_ptr = __pa_symbol(&cpu_gdt_table); - memcpy(trampoline_base, trampoline_data, trampoline_end - trampoline_data); - return virt_to_phys(trampoline_base); -} - -/* - * We are called very early to get the low memory for the - * SMP bootup trampoline page. - */ -void __init smp_alloc_memory(void) -{ - trampoline_base = (void *) alloc_bootmem_low_pages(PAGE_SIZE); - /* - * Has to be in very low memory so we can execute - * real-mode AP code. - */ - if (__pa(trampoline_base) >= 0x9F000) - BUG(); + memcpy(tramp, trampoline_data, trampoline_end - trampoline_data); + return virt_to_phys(tramp); } /* @@ -174,6 +156,7 @@ static void __init synchronize_tsc_bp (void) */ atomic_inc(&tsc_count_start); + sync_core(); rdtscll(tsc_values[smp_processor_id()]); /* * We clear the TSC in the last loop: @@ -245,6 +228,7 @@ static void __init synchronize_tsc_ap (void) atomic_inc(&tsc_count_start); while (atomic_read(&tsc_count_start) != num_booting_cpus()) mb(); + sync_core(); rdtscll(tsc_values[smp_processor_id()]); if (i == NR_LOOPS-1) write_tsc(0, 0); @@ -369,6 +353,9 @@ int __init start_secondary(void *unused) cpu_init(); smp_callin(); + /* otherwise gcc will move up the smp_processor_id before the cpu_init */ + barrier(); + Dprintk("cpu %d: waiting for commence\n", smp_processor_id()); while (!test_bit(smp_processor_id(), &smp_commenced_mask)) rep_nop(); @@ -620,8 +607,6 @@ static void __init do_boot_cpu (int apicid) */ init_idle(idle,cpu); - x86_cpu_to_apicid[cpu] = apicid; - x86_apicid_to_cpu[apicid] = cpu; idle->thread.rip = (unsigned long)start_secondary; // idle->thread.rsp = (unsigned long)idle->thread_info + THREAD_SIZE - 512; @@ -713,8 +698,6 @@ static void __init do_boot_cpu (int apicid) } } if (boot_error) { - x86_cpu_to_apicid[cpu] = -1; - x86_apicid_to_cpu[apicid] = -1; clear_bit(cpu, &cpu_callout_map); /* was set here (do_boot_cpu()) */ clear_bit(cpu, &cpu_initialized); /* was set by cpu_init() */ cpucount--; @@ -776,14 +759,6 @@ static void __init smp_boot_cpus(unsigned int max_cpus) { int apicid, cpu; - /* - * Initialize the logical to physical CPU number mapping - */ - - for (apicid = 0; apicid < NR_CPUS; apicid++) { - x86_apicid_to_cpu[apicid] = -1; - } - /* * Setup boot CPU information */ @@ -791,8 +766,6 @@ static void __init smp_boot_cpus(unsigned int max_cpus) printk("CPU%d: ", 0); print_cpu_info(&cpu_data[0]); - x86_apicid_to_cpu[boot_cpu_id] = 0; - x86_cpu_to_apicid[0] = boot_cpu_id; current_thread_info()->cpu = 0; smp_tune_scheduling(); @@ -837,6 +810,7 @@ static void __init smp_boot_cpus(unsigned int max_cpus) io_apic_irqs = 0; cpu_online_map = phys_cpu_present_map = 1; phys_cpu_present_map = 1; + disable_apic = 1; return; } @@ -851,6 +825,7 @@ static void __init smp_boot_cpus(unsigned int max_cpus) io_apic_irqs = 0; cpu_online_map = phys_cpu_present_map = 1; phys_cpu_present_map = 1; + disable_apic = 1; return; } @@ -878,13 +853,6 @@ static void __init smp_boot_cpus(unsigned int max_cpus) continue; do_boot_cpu(apicid); - - /* - * Make sure we unmap all failed CPUs - */ - if ((x86_apicid_to_cpu[apicid] == -1) && - (phys_cpu_present_map & (1 << apicid))) - printk("phys CPU #%d not responding - cannot use it.\n",apicid); } /* diff --git a/arch/x86_64/kernel/sys_x86_64.c b/arch/x86_64/kernel/sys_x86_64.c index 72a22946be65..b33cc16a8294 100644 --- a/arch/x86_64/kernel/sys_x86_64.c +++ b/arch/x86_64/kernel/sys_x86_64.c @@ -55,7 +55,6 @@ long sys_mmap(unsigned long addr, unsigned long len, unsigned long prot, unsigne if (!file) goto out; } - down_write(¤t->mm->mmap_sem); error = do_mmap_pgoff(file, addr, len, prot, flags, off >> PAGE_SHIFT); up_write(¤t->mm->mmap_sem); diff --git a/arch/x86_64/kernel/time.c b/arch/x86_64/kernel/time.c index 32f8e0b2bf18..8e5604b92a2e 100644 --- a/arch/x86_64/kernel/time.c +++ b/arch/x86_64/kernel/time.c @@ -9,6 +9,7 @@ * Copyright (c) 1996 Ingo Molnar * Copyright (c) 1998 Andrea Arcangeli * Copyright (c) 2002 Vojtech Pavlik + * Copyright (c) 2003 Andi Kleen * */ @@ -25,9 +26,14 @@ #include #include #include +#ifdef CONFIG_X86_LOCAL_APIC +#include +#endif u64 jiffies_64; +extern int using_apic_timer; + spinlock_t rtc_lock = SPIN_LOCK_UNLOCKED; extern int using_apic_timer; @@ -56,12 +62,10 @@ struct timezone __sys_tz __section_sys_tz; * together by xtime_lock. */ -static spinlock_t time_offset_lock = SPIN_LOCK_UNLOCKED; -static unsigned long timeoffset = 0; - inline unsigned int do_gettimeoffset(void) { unsigned long t; + sync_core(); rdtscll(t); return (t - hpet.last_tsc) * (1000000L / HZ) / hpet.ticks + hpet.offset; } @@ -74,10 +78,9 @@ inline unsigned int do_gettimeoffset(void) void do_gettimeofday(struct timeval *tv) { - unsigned long flags, t, seq; + unsigned long seq, t; unsigned int sec, usec; - spin_lock_irqsave(&time_offset_lock, flags); do { seq = read_seqbegin(&xtime_lock); @@ -85,11 +88,9 @@ void do_gettimeofday(struct timeval *tv) usec = xtime.tv_nsec / 1000; t = (jiffies - wall_jiffies) * (1000000L / HZ) + do_gettimeoffset(); - if (t > timeoffset) timeoffset = t; - usec += timeoffset; + usec += t; } while (read_seqretry(&xtime_lock, seq)); - spin_unlock_irqrestore(&time_offset_lock, flags); tv->tv_sec = sec + usec / 1000000; tv->tv_usec = usec % 1000000; @@ -104,7 +105,6 @@ void do_gettimeofday(struct timeval *tv) void do_settimeofday(struct timeval *tv) { write_seqlock_irq(&xtime_lock); - vxtime_lock(); tv->tv_usec -= do_gettimeoffset() + (jiffies - wall_jiffies) * tick_usec; @@ -116,7 +116,6 @@ void do_settimeofday(struct timeval *tv) xtime.tv_sec = tv->tv_sec; xtime.tv_nsec = (tv->tv_usec * 1000); - vxtime_unlock(); time_adjust = 0; /* stop active adjtime() */ time_status |= STA_UNSYNC; @@ -207,11 +206,11 @@ static void timer_interrupt(int irq, void *dev_id, struct pt_regs *regs) */ write_seqlock(&xtime_lock); - vxtime_lock(); { unsigned long t; + sync_core(); rdtscll(t); hpet.offset = (t - hpet.last_tsc) * (1000000L / HZ) / hpet.ticks + hpet.offset - 1000000L / HZ; if (hpet.offset >= 1000000L / HZ) @@ -219,7 +218,6 @@ static void timer_interrupt(int irq, void *dev_id, struct pt_regs *regs) hpet.ticks = min_t(long, max_t(long, (t - hpet.last_tsc) * (1000000L / HZ) / (1000000L / HZ - hpet.offset), cpu_khz * 1000/HZ * 15 / 16), cpu_khz * 1000/HZ * 16 / 15); hpet.last_tsc = t; - timeoffset = 0; } /* @@ -255,7 +253,6 @@ static void timer_interrupt(int irq, void *dev_id, struct pt_regs *regs) rtc_update = xtime.tv_sec + 660; } - vxtime_unlock(); write_sequnlock(&xtime_lock); } @@ -348,8 +345,9 @@ static unsigned int __init pit_calibrate_tsc(void) outb((1193182 / (1000 / 50)) & 0xff, 0x42); outb((1193182 / (1000 / 50)) >> 8, 0x42); rdtscll(start); - + sync_core(); while ((inb(0x61) & 0x20) == 0); + sync_core(); rdtscll(end); @@ -382,12 +380,12 @@ void __init time_init(void) pit_init(); printk(KERN_INFO "time.c: Using 1.1931816 MHz PIT timer.\n"); - setup_irq(0, &irq0); cpu_khz = pit_calibrate_tsc(); printk(KERN_INFO "time.c: Detected %d.%03d MHz processor.\n", cpu_khz / 1000, cpu_khz % 1000); hpet.ticks = cpu_khz * (1000 / HZ); rdtscll(hpet.last_tsc); + setup_irq(0, &irq0); } __setup("report_lost_ticks", time_setup); diff --git a/arch/x86_64/kernel/traps.c b/arch/x86_64/kernel/traps.c index c35d487e37a4..a45bca9515be 100644 --- a/arch/x86_64/kernel/traps.c +++ b/arch/x86_64/kernel/traps.c @@ -77,6 +77,12 @@ extern int exception_trace; struct notifier_block *die_chain; +static inline void conditional_sti(struct pt_regs *regs) +{ + if (regs->eflags & X86_EFLAGS_IF) + local_irq_enable(); +} + static int kstack_depth_to_print = 10; #ifdef CONFIG_KALLSYMS @@ -128,8 +134,7 @@ void show_trace(unsigned long *stack) { unsigned long addr; unsigned long *irqstack, *irqstack_end, *estack_end; - /* FIXME: should read the cpuid from the APIC; to still work with bogus %gs */ - const int cpu = smp_processor_id(); + const int cpu = safe_smp_processor_id(); int i; printk("\nCall Trace:"); @@ -210,7 +215,7 @@ void show_stack(unsigned long * rsp) { unsigned long *stack; int i; - const int cpu = smp_processor_id(); + const int cpu = safe_smp_processor_id(); unsigned long *irqstack_end = (unsigned long *) (cpu_pda[cpu].irqstackptr); unsigned long *irqstack = (unsigned long *) (cpu_pda[cpu].irqstackptr - IRQSTACKSIZE); @@ -252,12 +257,7 @@ void show_registers(struct pt_regs *regs) int i; int in_kernel = (regs->cs & 3) == 0; unsigned long rsp; -#ifdef CONFIG_SMP - /* For SMP should get the APIC id here, just to protect against corrupted GS */ - const int cpu = smp_processor_id(); -#else - const int cpu = 0; -#endif + const int cpu = safe_smp_processor_id(); struct task_struct *cur = cpu_pda[cpu].pcurrent; rsp = regs->rsp; @@ -330,7 +330,7 @@ void die(const char * str, struct pt_regs * regs, long err) bust_spinlocks(1); handle_BUG(regs); printk("%s: %04lx\n", str, err & 0xffff); - cpu = smp_processor_id(); + cpu = safe_smp_processor_id(); /* racy, but better than risking deadlock. */ local_irq_disable(); if (!spin_trylock(&die_lock)) { @@ -365,10 +365,12 @@ static inline unsigned long get_cr2(void) static void do_trap(int trapnr, int signr, char *str, struct pt_regs * regs, long error_code, siginfo_t *info) { + conditional_sti(regs); + #ifdef CONFIG_CHECKING { unsigned long gs; - struct x8664_pda *pda = cpu_pda + stack_smp_processor_id(); + struct x8664_pda *pda = cpu_pda + safe_smp_processor_id(); rdmsrl(MSR_GS_BASE, gs); if (gs != (unsigned long)pda) { wrmsrl(MSR_GS_BASE, pda); @@ -454,10 +456,12 @@ extern void dump_pagetable(unsigned long); asmlinkage void do_general_protection(struct pt_regs * regs, long error_code) { + conditional_sti(regs); + #ifdef CONFIG_CHECKING { unsigned long gs; - struct x8664_pda *pda = cpu_pda + hard_smp_processor_id(); + struct x8664_pda *pda = cpu_pda + safe_smp_processor_id(); rdmsrl(MSR_GS_BASE, gs); if (gs != (unsigned long)pda) { wrmsrl(MSR_GS_BASE, pda); @@ -565,7 +569,7 @@ asmlinkage void do_debug(struct pt_regs * regs, long error_code) #ifdef CONFIG_CHECKING { unsigned long gs; - struct x8664_pda *pda = cpu_pda + stack_smp_processor_id(); + struct x8664_pda *pda = cpu_pda + safe_smp_processor_id(); rdmsrl(MSR_GS_BASE, gs); if (gs != (unsigned long)pda) { wrmsrl(MSR_GS_BASE, pda); @@ -576,6 +580,8 @@ asmlinkage void do_debug(struct pt_regs * regs, long error_code) asm("movq %%db6,%0" : "=r" (condition)); + conditional_sti(regs); + if (notify_die(DIE_DEBUG, "debug", regs, error_code) == NOTIFY_BAD) return; @@ -636,7 +642,6 @@ void math_error(void *rip) struct task_struct * task; siginfo_t info; unsigned short cwd, swd; - /* * Save the info for the exception handler and clear the error. */ @@ -688,6 +693,7 @@ void math_error(void *rip) asmlinkage void do_coprocessor_error(struct pt_regs * regs, long error_code) { + conditional_sti(regs); math_error((void *)regs->rip); } @@ -747,6 +753,7 @@ static inline void simd_math_error(void *rip) asmlinkage void do_simd_coprocessor_error(struct pt_regs * regs, long error_code) { + conditional_sti(regs); simd_math_error((void *)regs->rip); } diff --git a/arch/x86_64/kernel/vsyscall.c b/arch/x86_64/kernel/vsyscall.c index efe34103b509..27a2744b1dfe 100644 --- a/arch/x86_64/kernel/vsyscall.c +++ b/arch/x86_64/kernel/vsyscall.c @@ -2,6 +2,7 @@ * linux/arch/x86_64/kernel/vsyscall.c * * Copyright (C) 2001 Andrea Arcangeli SuSE + * Copyright 2003 Andi Kleen, SuSE Labs. * * Thanks to hpa@transmeta.com for some useful hint. * Special thanks to Ingo Molnar for his early experience with @@ -12,7 +13,8 @@ * vsyscalls. One vsyscall can reserve more than 1 slot to avoid * jumping out of line if necessary. * - * $Id: vsyscall.c,v 1.9 2002/03/21 13:42:58 ak Exp $ + * Note: the concept clashes with user mode linux. If you use UML just + * set the kernel.vsyscall sysctl to 0. */ /* @@ -29,6 +31,9 @@ * broken programs will segfault and there's no security risk until we choose to * fix it. * + * Add HPET support (port from 2.4). Still needed? + * Nop out vsyscall syscall to avoid anchor for buffer overflows when sysctl off. + * * These are not urgent things that we need to address only before shipping the first * production binary kernels. */ @@ -37,6 +42,7 @@ #include #include #include +#include #include #include @@ -44,19 +50,13 @@ #include #include - #define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) -#define NO_VSYSCALL 1 +int __sysctl_vsyscall __section_sysctl_vsyscall = 1; +seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED; -#ifdef NO_VSYSCALL #include -static int errno __section_vxtime_sequence; - -static inline _syscall2(int,gettimeofday,struct timeval *,tv,struct timezone *,tz) - -#else static inline void timeval_normalize(struct timeval * tv) { time_t __sec; @@ -69,63 +69,60 @@ static inline void timeval_normalize(struct timeval * tv) } } -long __vxtime_sequence[2] __section_vxtime_sequence; - - static inline void do_vgettimeofday(struct timeval * tv) { long sequence, t; unsigned long sec, usec; do { - sequence = __vxtime_sequence[1]; - rmb(); + sequence = read_seqbegin(&__xtime_lock); + sync_core(); rdtscll(t); sec = __xtime.tv_sec; - usec = __xtime.tv_usec + + usec = (__xtime.tv_nsec * 1000) + (__jiffies - __wall_jiffies) * (1000000 / HZ) + (t - __hpet.last_tsc) * (1000000 / HZ) / __hpet.ticks + __hpet.offset; - rmb(); - } while (sequence != __vxtime_sequence[0]); + } while (read_seqretry(&__xtime_lock, sequence)); tv->tv_sec = sec + usec / 1000000; tv->tv_usec = usec % 1000000; } +/* RED-PEN may want to readd seq locking, but then the variable should be write-once. */ static inline void do_get_tz(struct timezone * tz) { - long sequence; - - do { - sequence = __vxtime_sequence[1]; - rmb(); - *tz = __sys_tz; +} - rmb(); - } while (sequence != __vxtime_sequence[0]); +static inline int gettimeofday(struct timeval *tv, struct timezone *tz) +{ + int ret; + asm volatile("syscall" + : "=a" (ret) + : "0" (__NR_gettimeofday),"D" (tv),"S" (tz) : __syscall_clobber ); + return ret; } -#endif static int __vsyscall(0) vgettimeofday(struct timeval * tv, struct timezone * tz) { -#ifdef NO_VSYSCALL + if (unlikely(!__sysctl_vsyscall)) return gettimeofday(tv,tz); -#else if (tv) do_vgettimeofday(tv); if (tz) do_get_tz(tz); return 0; -#endif } static time_t __vsyscall(1) vtime(time_t * t) { struct timeval tv; - vgettimeofday(&tv,NULL); + if (unlikely(!__sysctl_vsyscall)) + gettimeofday(&tv, NULL); + else + do_vgettimeofday(&tv); if (t) *t = tv.tv_sec; return tv.tv_sec; @@ -139,12 +136,13 @@ static long __vsyscall(2) venosys_0(void) static long __vsyscall(3) venosys_1(void) { return -ENOSYS; + } static void __init map_vsyscall(void) { extern char __vsyscall_0; - unsigned long physaddr_page0 = (unsigned long) &__vsyscall_0 - __START_KERNEL_map; + unsigned long physaddr_page0 = __pa_symbol(&__vsyscall_0); __set_fixmap(VSYSCALL_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL); } diff --git a/arch/x86_64/kernel/wakeup.S b/arch/x86_64/kernel/wakeup.S new file mode 100644 index 000000000000..ba0b4f4e4d4b --- /dev/null +++ b/arch/x86_64/kernel/wakeup.S @@ -0,0 +1,306 @@ +/* + * ACPI S3 entry/exit handling. + * + * Notes: + * Relies on kernel being loaded below 4GB. + * Needs restore_low_mappings called before. + * + * Copyright 2003 by Andi Kleen, SuSE Labs. + * + * Long mode entry losely based on example code in chapter 14 of the x86-64 system + * programmer's manual. + * + * Notebook: + + FIXME need to interface with suspend.c properly. do_magic. check i386. rename to suspend64.S + + Need to fix vgacon,mtrr,bluesmoke to do resume + + Interrupts should be off until the io-apic code has reinited the APIC. + Need support for that in the pm frame work or a special hack? + + SMP support is non existent. Need to somehow restart the other CPUs again. + If CPU hotplug was working it could be used. Save/Restore needs to run on the same CPU. + + Should check magic like i386 code + + suspend code copies something. check what it is. + */ + +#include + +#include +#include +#include + +#define O(x) (x-acpi_wakeup) + + .text + .code16 +ENTRY(acpi_wakeup) + /* 16bit real mode entered from ACPI BIOS */ + /* The machine is just through BIOS setup after power down and everything set up + by Linux needs to be restored. */ + /* The code here needs to be position independent or manually relocated, + because it is copied to a <1MB page for real mode execution */ + + /* A20 enabled (according to ACPI spec) */ + /* cs = acpi_wakeup >> 4 ; eip = acpi_wakeup & 0xF */ + + movw %cs,%ax + movw %ax,%ds /* make %ds point to acpi_wakeup */ + movw %ax,%ss + movw $O(wakeup_stack),%sp /* setup stack */ + + pushl $0 + popfl /* clear EFLAGS */ + + lgdt %ds:O(pGDT) /* load kernel GDT */ + + movl $0x1,%eax /* enable protected mode */ + movl %eax,%cr0 + + movl %ds:O(wakeup_page_table),%edi + ljmpl $__KERNEL16_CS,$0 /* -> s3_prot16 (filled in earlier by caller) */ + + /* patched by s3_restore_state below */ +pGDT: + .short 0 + .quad 0 + + .align 4 + .globl wakeup_page_table +wakeup_page_table: + .long 0 + + .align 8 +wakeup_stack: + .fill 128,1,0 + .globl acpi_wakeup_end +acpi_wakeup_end: + /* end of real mode trampoline */ + + /* pointed to by __KERNEL16_CS:0 */ + .code16 +ENTRY(s3_prot16) + /* Now in 16bit protected mode, still no paging, stack/data segments invalid */ + + /* Prepare everything for 64bit paging, but still keep it turned off */ + movl %cr4,%eax + bts $5,%eax /* set PAE bit */ + movl %eax,%cr4 + + movl %edi,%cr3 /* load kernel page table */ + + movl $0x80000001,%eax + cpuid /* no execute supported ? */ + movl %edx,%esi + + movl $MSR_EFER,%ecx + rdmsr + bts $8,%eax /* long mode */ + bt $20,%esi /* NX supported ? */ + jnc 1f + bt $_EFER_NX,%eax +1: + wrmsr /* set temporary efer - real one is restored a bit later */ + + movl %cr0,%eax + bts $31,%eax /* paging */ + movl %eax,%cr0 + + /* running in identity mapping now */ + + /* go to 64bit code segment */ + ljmpl $__KERNEL_CS,$s3_restore_state-__START_KERNEL_map + + .code64 + .macro SAVEMSR msr,target + movl $\msr,%ecx + rdmsr + shlq $32,%rdx + orq %rax,%rdx + movq %rdx,\target(%rip) + .endm + + .macro RESTMSR msr,src + movl $\msr,%ecx + movq \src(%rip),%rax + movq %rax,%rdx + shrq $32,%rdx + wrmsr + .endm + + .macro SAVECTL reg + movq %\reg,%rax + movq %rax,saved_\reg(%rip) + .endm + + .macro RESTCTL reg + movq saved_\reg(%rip),%rax + movq %rax,%\reg + .endm + + /* Running in identity mapping, long mode */ +s3_restore_state_low: + movq $s3_restore_state,%rax + jmpq *%rax + + /* Running in real kernel mapping now */ +s3_restore_state: + xorl %eax,%eax + movl %eax,%ds + movq saved_rsp(%rip),%rsp + movw saved_ss(%rip),%ss + movw saved_fs(%rip),%fs + movw saved_gs(%rip),%gs + movw saved_es(%rip),%es + movw saved_ds(%rip),%ds + + lidt saved_idt + ltr saved_tr + lldt saved_ldt + /* gdt is already loaded */ + + RESTCTL cr0 + RESTCTL cr4 + /* cr3 is already loaded */ + + RESTMSR MSR_EFER,saved_efer + RESTMSR MSR_LSTAR,saved_lstar + RESTMSR MSR_CSTAR,saved_cstar + RESTMSR MSR_FS_BASE,saved_fs_base + RESTMSR MSR_GS_BASE,saved_gs_base + RESTMSR MSR_KERNEL_GS_BASE,saved_kernel_gs_base + RESTMSR MSR_SYSCALL_MASK,saved_syscall_mask + + fxrstor fpustate(%rip) + + RESTCTL dr0 + RESTCTL dr1 + RESTCTL dr2 + RESTCTL dr3 + RESTCTL dr6 + RESTCTL dr7 + + movq saved_rflags(%rip),%rax + pushq %rax + popfq + + movq saved_rbp(%rip),%rbp + movq saved_rbx(%rip),%rbx + movq saved_r12(%rip),%r12 + movq saved_r13(%rip),%r13 + movq saved_r14(%rip),%r14 + movq saved_r15(%rip),%r15 + ret + +ENTRY(acpi_prepare_wakeup) + sgdt saved_gdt + + /* copy gdt descr and page table to low level wakeup code so that it can + reload them early. */ + movq acpi_wakeup_address(%rip),%rax + movw saved_gdt+8(%rip),%cx + movw %cx,O(pGDT)+8(%rax) + movq saved_gdt(%rip),%rcx + movq %rcx,O(pGDT)(%rax) + + movq %cr3,%rdi + movl %edi,O(wakeup_page_table)(%rax) + ret + + /* Save CPU state. */ + /* Everything saved here needs to be restored above. */ +ENTRY(do_suspend_lowlevel) + testl %edi,%edi + jnz s3_restore_state + + SAVECTL cr0 + SAVECTL cr4 + SAVECTL cr3 + + str saved_tr + sidt saved_idt + sgdt saved_gdt + sldt saved_ldt + + SAVEMSR MSR_EFER,saved_efer + SAVEMSR MSR_LSTAR,saved_lstar + SAVEMSR MSR_CSTAR,saved_cstar + SAVEMSR MSR_FS_BASE,saved_fs_base + SAVEMSR MSR_GS_BASE,saved_gs_base + SAVEMSR MSR_KERNEL_GS_BASE,saved_kernel_gs_base + SAVEMSR MSR_SYSCALL_MASK,saved_syscall_mask + + movw %ds,saved_ds(%rip) + movw %es,saved_es(%rip) + movw %fs,saved_fs(%rip) + movw %gs,saved_gs(%rip) + movw %ss,saved_ss(%rip) + movq %rsp,saved_rsp(%rip) + + pushfq + popq %rax + movq %rax,saved_rflags(%rip) + + SAVECTL dr0 + SAVECTL dr1 + SAVECTL dr2 + SAVECTL dr3 + SAVECTL dr6 + SAVECTL dr7 + + fxsave fpustate(%rip) + + /* finally save callee saved registers */ + movq %rbp,saved_rbp(%rip) + movq %rbx,saved_rbx(%rip) + movq %r12,saved_r12(%rip) + movq %r13,saved_r13(%rip) + movq %r14,saved_r14(%rip) + movq %r15,saved_r15(%rip) + movq $3,%rdi + call acpi_enter_sleep_state + ret /* should not happen */ + + .data + .align 8 +saved_efer: .quad 0 +saved_lstar: .quad 0 +saved_cstar: .quad 0 +saved_cr4: .quad 0 +saved_cr3: .quad 0 +saved_cr0: .quad 0 +saved_rbp: .quad 0 +saved_rbx: .quad 0 +saved_rsp: .quad 0 +saved_r12: .quad 0 +saved_r13: .quad 0 +saved_r14: .quad 0 +saved_r15: .quad 0 +saved_rflags: .quad 0 +saved_gs_base: .quad 0 +saved_fs_base: .quad 0 +saved_kernel_gs_base: .quad 0 +saved_syscall_mask: .quad 0 +saved_dr0: .quad 0 +saved_dr1: .quad 0 +saved_dr2: .quad 0 +saved_dr3: .quad 0 +saved_dr6: .quad 0 +saved_dr7: .quad 0 +saved_ds: .short 0 +saved_fs: .short 0 +saved_gs: .short 0 +saved_es: .short 0 +saved_ss: .short 0 +saved_idt: .short 0 + .quad 0 +saved_ldt: .short 0 +saved_gdt: .short 0 + .quad 0 +saved_tr: .short 0 + + .align 16 +fpustate: .fill 512,1,0 diff --git a/arch/x86_64/mm/Makefile b/arch/x86_64/mm/Makefile index fb244bc0b6ba..e991f691d64b 100644 --- a/arch/x86_64/mm/Makefile +++ b/arch/x86_64/mm/Makefile @@ -1,6 +1,8 @@ # -# Makefile for the linux i386-specific parts of the memory manager. +# Makefile for the linux x86_64-specific parts of the memory manager. # obj-y := init.o fault.o ioremap.o extable.o pageattr.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o +obj-$(CONFIG_DISCONTIGMEM) += numa.o +obj-$(CONFIG_K8_NUMA) += k8topology.o diff --git a/arch/x86_64/mm/fault.c b/arch/x86_64/mm/fault.c index c662d6726339..d588c9d07812 100644 --- a/arch/x86_64/mm/fault.c +++ b/arch/x86_64/mm/fault.c @@ -121,7 +121,10 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code) /* get the address */ __asm__("movq %%cr2,%0":"=r" (address)); - if (page_fault_trace) + if (likely(regs->eflags & X86_EFLAGS_IF)) + local_irq_enable(); + + if (unlikely(page_fault_trace)) printk("pagefault rip:%lx rsp:%lx cs:%lu ss:%lu address %lx error %lx\n", regs->rip,regs->rsp,regs->cs,regs->ss,address,error_code); @@ -139,7 +142,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code) * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (unlikely(in_atomic() || !mm)) goto no_context; again: @@ -148,7 +151,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code) vma = find_vma(mm, address); if (!vma) goto bad_area; - if (vma->vm_start <= address) + if (likely(vma->vm_start <= address)) goto good_area; if (!(vma->vm_flags & VM_GROWSDOWN)) goto bad_area; @@ -222,7 +225,8 @@ bad_area_nosemaphore: return; } #endif - printk("%s[%d] segfault at rip:%lx rsp:%lx adr:%lx err:%lx\n", + printk(KERN_INFO + "%s[%d] segfault at rip:%lx rsp:%lx adr:%lx err:%lx\n", tsk->comm, tsk->pid, regs->rip, regs->rsp, address, error_code); diff --git a/arch/x86_64/mm/hugetlbpage.c b/arch/x86_64/mm/hugetlbpage.c index e1c31afb196e..adf3346bc82b 100644 --- a/arch/x86_64/mm/hugetlbpage.c +++ b/arch/x86_64/mm/hugetlbpage.c @@ -162,6 +162,37 @@ back1: return i; } +struct page * +follow_huge_addr(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long address, int write) +{ + return NULL; +} + +struct vm_area_struct *hugepage_vma(struct mm_struct *mm, unsigned long addr) +{ + return NULL; +} + +int pmd_huge(pmd_t pmd) +{ + return !!(pmd_val(pmd) & _PAGE_PSE); +} + +struct page * +follow_huge_pmd(struct mm_struct *mm, unsigned long address, + pmd_t *pmd, int write) +{ + struct page *page; + + page = pte_page(*(pte_t *)pmd); + if (page) { + page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT); + get_page(page); + } + return page; +} + void free_huge_page(struct page *page) { BUG_ON(page_count(page)); @@ -193,8 +224,6 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, unsig BUG_ON(start & (HPAGE_SIZE - 1)); BUG_ON(end & (HPAGE_SIZE - 1)); - spin_lock(&htlbpage_lock); - spin_unlock(&htlbpage_lock); for (address = start; address < end; address += HPAGE_SIZE) { pte = huge_pte_offset(mm, address); page = pte_page(*pte); @@ -216,7 +245,7 @@ void zap_hugepage_range(struct vm_area_struct *vma, unsigned long start, unsigne int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma) { struct mm_struct *mm = current->mm; - struct inode = mapping->host; + struct inode *inode = mapping->host; unsigned long addr; int ret = 0; diff --git a/arch/x86_64/mm/init.c b/arch/x86_64/mm/init.c index 44c9c6a6e8f9..5ee4ffecb5ba 100644 --- a/arch/x86_64/mm/init.c +++ b/arch/x86_64/mm/init.c @@ -3,7 +3,7 @@ * * Copyright (C) 1995 Linus Torvalds * Copyright (C) 2000 Pavel Machek - * Copyright (C) 2002 Andi Kleen + * Copyright (C) 2002,2003 Andi Kleen */ #include @@ -37,8 +37,9 @@ #include #include #include +#include -unsigned long start_pfn, end_pfn; +#define Dprintk(x...) printk(x) struct mmu_gather mmu_gathers[NR_CPUS]; @@ -90,9 +91,11 @@ static void *spp_getpage(void) if (after_bootmem) ptr = (void *) get_zeroed_page(GFP_ATOMIC); else - ptr = alloc_bootmem_low(PAGE_SIZE); - if (!ptr) + ptr = alloc_bootmem_pages(PAGE_SIZE); + if (!ptr || ((unsigned long)ptr & ~PAGE_MASK)) panic("set_pte_phys: cannot allocate page data %s\n", after_bootmem?"after bootmem":""); + + Dprintk("spp_getpage %p\n", ptr); return ptr; } @@ -104,6 +107,8 @@ static void set_pte_phys(unsigned long vaddr, pmd_t *pmd; pte_t *pte; + Dprintk("set_pte_phys %lx to %lx\n", vaddr, phys); + level4 = pml4_offset_k(vaddr); if (pml4_none(*level4)) { printk("PML4 FIXMAP MISSING, it should be setup in head.S!\n"); @@ -114,7 +119,7 @@ static void set_pte_phys(unsigned long vaddr, pmd = (pmd_t *) spp_getpage(); set_pgd(pgd, __pgd(__pa(pmd) | _KERNPG_TABLE | _PAGE_USER)); if (pmd != pmd_offset(pgd, 0)) { - printk("PAGETABLE BUG #01!\n"); + printk("PAGETABLE BUG #01! %p <-> %p\n", pmd, pmd_offset(pgd,0)); return; } } @@ -128,6 +133,7 @@ static void set_pte_phys(unsigned long vaddr, } } pte = pte_offset_kernel(pmd, vaddr); + /* CHECKME: */ if (pte_val(*pte)) pte_ERROR(*pte); set_pte(pte, pfn_pte(phys >> PAGE_SHIFT, prot)); @@ -151,7 +157,8 @@ void __set_fixmap (enum fixed_addresses idx, unsigned long phys, pgprot_t prot) set_pte_phys(address, phys, prot); } -extern unsigned long start_pfn, end_pfn; +unsigned long __initdata table_start, table_end; + extern pmd_t temp_boot_pmds[]; static struct temp_map { @@ -168,21 +175,21 @@ static __init void *alloc_low_page(int *index, unsigned long *phys) { struct temp_map *ti; int i; - unsigned long pfn = start_pfn++, paddr; + unsigned long pfn = table_end++, paddr; void *adr; - if (pfn >= end_pfn_map) + if (pfn >= end_pfn) panic("alloc_low_page: ran out of memory"); for (i = 0; temp_mappings[i].allocated; i++) { if (!temp_mappings[i].pmd) panic("alloc_low_page: ran out of temp mappings"); } ti = &temp_mappings[i]; - paddr = (pfn & (~511)) << PAGE_SHIFT; + paddr = (pfn << PAGE_SHIFT) & PMD_MASK; set_pmd(ti->pmd, __pmd(paddr | _KERNPG_TABLE | _PAGE_PSE)); ti->allocated = 1; __flush_tlb(); - adr = ti->address + (pfn & 511)*PAGE_SIZE; + adr = ti->address + ((pfn << PAGE_SHIFT) & ~PMD_MASK); *index = i; *phys = pfn * PAGE_SIZE; return adr; @@ -203,20 +210,26 @@ static void __init phys_pgd_init(pgd_t *pgd, unsigned long address, unsigned lon pgd = pgd + i; for (; i < PTRS_PER_PGD; pgd++, i++) { int map; - unsigned long paddr = i*PGDIR_SIZE, pmd_phys; + unsigned long paddr, pmd_phys; pmd_t *pmd; + paddr = (address & PML4_MASK) + i*PGDIR_SIZE; if (paddr >= end) { for (; i < PTRS_PER_PGD; i++, pgd++) set_pgd(pgd, __pgd(0)); break; } + + if (!e820_mapped(paddr, paddr+PGDIR_SIZE, 0)) { + set_pgd(pgd, __pgd(0)); + continue; + } + pmd = alloc_low_page(&map, &pmd_phys); set_pgd(pgd, __pgd(pmd_phys | _KERNPG_TABLE)); - for (j = 0; j < PTRS_PER_PMD; pmd++, j++) { + for (j = 0; j < PTRS_PER_PMD; pmd++, j++, paddr += PMD_SIZE) { unsigned long pe; - paddr = i*PGDIR_SIZE + j*PMD_SIZE; if (paddr >= end) { for (; j < PTRS_PER_PMD; j++, pmd++) set_pmd(pmd, __pmd(0)); @@ -239,13 +252,37 @@ void __init init_memory_mapping(void) unsigned long adr; unsigned long end; unsigned long next; + unsigned long pgds, pmds, tables; + + Dprintk("init_memory_mapping\n"); + + end = end_pfn_map << PAGE_SHIFT; + + /* + * Find space for the kernel direct mapping tables. + * Later we should allocate these tables in the local node of the memory + * mapped. Unfortunately this is done currently before the nodes are + * discovered. + */ + + pgds = (end + PGDIR_SIZE - 1) >> PGDIR_SHIFT; + pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT; + tables = round_up(pgds*8, PAGE_SIZE) + round_up(pmds * 8, PAGE_SIZE); + + table_start = find_e820_area(0x8000, __pa_symbol(&_text), tables); + if (table_start == -1UL) + panic("Cannot find space for the kernel page tables"); + + table_start >>= PAGE_SHIFT; + table_end = table_start; + + end += __PAGE_OFFSET; /* turn virtual */ - end = PAGE_OFFSET + (end_pfn_map * PAGE_SIZE); for (adr = PAGE_OFFSET; adr < end; adr = next) { int map; unsigned long pgd_phys; pgd_t *pgd = alloc_low_page(&map, &pgd_phys); - next = adr + (512UL * 1024 * 1024 * 1024); + next = adr + PML4_SIZE; if (next > end) next = end; phys_pgd_init(pgd, adr-PAGE_OFFSET, next-PAGE_OFFSET); @@ -254,20 +291,35 @@ void __init init_memory_mapping(void) } asm volatile("movq %%cr4,%0" : "=r" (mmu_cr4_features)); __flush_tlb_all(); + early_printk("kernel direct mapping tables upto %lx @ %lx-%lx\n", end, + table_start<> 10, reservedpages << (PAGE_SHIFT-10), datasize >> 10, @@ -392,3 +453,16 @@ void free_initrd_mem(unsigned long start, unsigned long end) } } #endif + +void __init reserve_bootmem_generic(unsigned long phys, unsigned len) +{ + /* Should check here against the e820 map to avoid double free */ +#ifdef CONFIG_DISCONTIGMEM + int nid = phys_to_nid(phys); + if (phys < HIGH_MEMORY && nid) + panic("reserve of %lx at node %d", phys, nid); + reserve_bootmem_node(NODE_DATA(nid), phys, len); +#else + reserve_bootmem(phys, len); +#endif +} diff --git a/arch/x86_64/mm/ioremap.c b/arch/x86_64/mm/ioremap.c index c095886d9b16..2f10ba92beaf 100644 --- a/arch/x86_64/mm/ioremap.c +++ b/arch/x86_64/mm/ioremap.c @@ -133,14 +133,16 @@ void * __ioremap(unsigned long phys_addr, unsigned long size, unsigned long flag */ if (phys_addr < virt_to_phys(high_memory)) { char *t_addr, *t_end; - struct page *page; t_addr = __va(phys_addr); t_end = t_addr + (size - 1); +#ifndef CONFIG_DISCONTIGMEM + struct page *page; for(page = virt_to_page(t_addr); page <= virt_to_page(t_end); page++) if(!PageReserved(page)) return NULL; +#endif } /* diff --git a/arch/x86_64/mm/k8topology.c b/arch/x86_64/mm/k8topology.c new file mode 100644 index 000000000000..ed2da1470216 --- /dev/null +++ b/arch/x86_64/mm/k8topology.c @@ -0,0 +1,141 @@ +/* + * AMD K8 NUMA support. + * Discover the memory map and associated nodes. + * + * Doesn't use the ACPI SRAT table because it has a questionable license. + * Instead the northbridge registers are read directly. + * XXX in 2.5 we could use the generic SRAT code + * + * Copyright 2002,2003 Andi Kleen, SuSE Labs. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static int find_northbridge(void) +{ + int num; + + for (num = 0; num < 32; num++) { + u32 header; + + header = read_pci_config(0, num, 0, 0x00); + if (header != (PCI_VENDOR_ID_AMD | (0x1100<<16))) + continue; + + header = read_pci_config(0, num, 1, 0x00); + if (header != (PCI_VENDOR_ID_AMD | (0x1101<<16))) + continue; + return num; + } + + return -1; +} + +int __init k8_scan_nodes(unsigned long start, unsigned long end) +{ + unsigned long prevbase; + struct node nodes[MAXNODE]; + int nodeid, numnodes, maxnode, i, nb; + + nb = find_northbridge(); + if (nb < 0) + return nb; + + printk(KERN_INFO "Scanning NUMA topology in Northbridge %d\n", nb); + + numnodes = (read_pci_config(0, nb, 0, 0x60 ) >> 4) & 3; + + memset(&nodes,0,sizeof(nodes)); + prevbase = 0; + maxnode = -1; + for (i = 0; i < MAXNODE; i++) { + unsigned long base,limit; + + base = read_pci_config(0, nb, 1, 0x40 + i*8); + limit = read_pci_config(0, nb, 1, 0x44 + i*8); + + nodeid = limit & 3; + if (!limit) { + printk(KERN_INFO "Skipping node entry %d (base %lx)\n", i, base); + continue; + } + if ((base >> 8) & 3 || (limit >> 8) & 3) { + printk(KERN_ERR "Node %d using interleaving mode %lx/%lx\n", + nodeid, (base>>8)&3, (limit>>8) & 3); + return -1; + } + if (nodeid > maxnode) + maxnode = nodeid; + if ((1UL << nodeid) & nodes_present) { + printk("Node %d already present. Skipping\n", nodeid); + continue; + } + + limit >>= 16; + limit <<= 24; + + if (limit > end_pfn_map << PAGE_SHIFT) + limit = end_pfn_map << PAGE_SHIFT; + if (limit <= base) { + printk(KERN_INFO "Node %d beyond memory map\n", nodeid); + continue; + } + + base >>= 16; + base <<= 24; + + if (base < start) + base = start; + if (limit > end) + limit = end; + if (limit == base) + continue; + if (limit < base) { + printk(KERN_INFO"Node %d bogus settings %lx-%lx. Ignored.\n", + nodeid, base, limit); + continue; + } + + /* Could sort here, but pun for now. Should not happen anyroads. */ + if (prevbase > base) { + printk(KERN_INFO "Node map not sorted %lx,%lx\n", + prevbase,base); + return -1; + } + + printk(KERN_INFO "Node %d MemBase %016lx Limit %016lx\n", + nodeid, base, limit); + + nodes[nodeid].start = base; + nodes[nodeid].end = limit; + + prevbase = base; + } + + if (maxnode <= 0) + return -1; + + memnode_shift = compute_hash_shift(nodes,maxnode,end); + if (memnode_shift < 0) { + printk(KERN_ERR "No NUMA node hash function found. Contact maintainer\n"); + return -1; + } + printk(KERN_INFO "Using node hash shift of %d\n", memnode_shift); + + early_for_all_nodes(i) { + setup_node_bootmem(i, nodes[i].start, nodes[i].end); + } + + return 0; +} + diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c new file mode 100644 index 000000000000..8135efbf522d --- /dev/null +++ b/arch/x86_64/mm/numa.c @@ -0,0 +1,207 @@ +/* + * Generic VM initialization for x86-64 NUMA setups. + * Copyright 2002,2003 Andi Kleen, SuSE Labs. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define Dprintk(x...) printk(x) + +struct pglist_data *node_data[MAXNODE]; +bootmem_data_t plat_node_bdata[MAX_NUMNODES]; + +int memnode_shift; +u8 memnodemap[NODEMAPSIZE]; + +static int numa_off __initdata; + +unsigned long nodes_present; +int maxnode; + +static int emunodes __initdata; + +int compute_hash_shift(struct node *nodes, int numnodes, u64 maxmem) +{ + int i; + int shift = 24; + u64 addr; + + /* When in doubt use brute force. */ + while (shift < 48) { + memset(memnodemap,0xff,sizeof(*memnodemap) * NODEMAPSIZE); + early_for_all_nodes (i) { + for (addr = nodes[i].start; + addr < nodes[i].end; + addr += (1UL << shift)) { + if (memnodemap[addr >> shift] != 0xff) { + printk("node %d shift %d addr %Lx conflict %d\n", + i, shift, addr, memnodemap[addr>>shift]); + goto next; + } + memnodemap[addr >> shift] = i; + } + } + return shift; + next: + shift++; + } + memset(memnodemap,0,sizeof(*memnodemap) * NODEMAPSIZE); + return -1; +} + +/* Initialize bootmem allocator for a node */ +void __init setup_node_bootmem(int nodeid, unsigned long start, unsigned long end) +{ + unsigned long start_pfn, end_pfn, bootmap_pages, bootmap_size, bootmap_start; + unsigned long nodedata_phys; + const int pgdat_size = round_up(sizeof(pg_data_t), PAGE_SIZE); + + start = round_up(start, ZONE_ALIGN); + + printk("Bootmem setup node %d %016lx-%016lx\n", nodeid, start, end); + + start_pfn = start >> PAGE_SHIFT; + end_pfn = end >> PAGE_SHIFT; + + nodedata_phys = find_e820_area(start, end, pgdat_size); + if (nodedata_phys == -1L) + panic("Cannot find memory pgdat in node %d\n", nodeid); + + Dprintk("nodedata_phys %lx\n", nodedata_phys); + + node_data[nodeid] = phys_to_virt(nodedata_phys); + memset(NODE_DATA(nodeid), 0, sizeof(pg_data_t)); + NODE_DATA(nodeid)->bdata = &plat_node_bdata[nodeid]; + NODE_DATA(nodeid)->node_start_pfn = start_pfn; + NODE_DATA(nodeid)->node_size = end_pfn - start_pfn; + + /* Find a place for the bootmem map */ + bootmap_pages = bootmem_bootmap_pages(end_pfn - start_pfn); + bootmap_start = round_up(nodedata_phys + pgdat_size, PAGE_SIZE); + bootmap_start = find_e820_area(bootmap_start, end, bootmap_pages<> PAGE_SHIFT, + start_pfn, end_pfn); + + e820_bootmem_free(NODE_DATA(nodeid), start, end); + + reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size); + reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start, bootmap_pages< maxnode) + maxnode = nodeid; + nodes_present |= (1UL << nodeid); +} + +/* Initialize final allocator for a zone */ +void __init setup_node_zones(int nodeid) +{ + unsigned long start_pfn, end_pfn; + unsigned long zones[MAX_NR_ZONES]; + unsigned long dma_end_pfn; + + memset(zones, 0, sizeof(unsigned long) * MAX_NR_ZONES); + + start_pfn = node_start_pfn(nodeid); + end_pfn = node_end_pfn(nodeid); + + printk("setting up node %d %lx-%lx\n", nodeid, start_pfn, end_pfn); + + /* All nodes > 0 have a zero length zone DMA */ + dma_end_pfn = __pa(MAX_DMA_ADDRESS) >> PAGE_SHIFT; + if (start_pfn < dma_end_pfn) { + zones[ZONE_DMA] = dma_end_pfn - start_pfn; + zones[ZONE_NORMAL] = end_pfn - dma_end_pfn; + } else { + zones[ZONE_NORMAL] = end_pfn - start_pfn; + } + + free_area_init_node(nodeid, NODE_DATA(nodeid), NULL, zones, + start_pfn, NULL); +} + +int fake_node; + +int __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn) +{ +#ifdef CONFIG_K8_NUMA + if (!numa_off && !k8_scan_nodes(start_pfn< 0) { + struct node nodes[MAXNODE]; + unsigned long nodesize = (end_pfn << PAGE_SHIFT) / emunodes; + int i; + if (emunodes > MAXNODE) + emunodes = MAXNODE; + printk(KERN_INFO "Faking %d nodes of size %ld MB\n", emunodes, nodesize>>20); + for (i = 0; i < emunodes; i++) { + unsigned long end = (i+1)*nodesize; + if (i == emunodes-1) + end = end_pfn << PAGE_SHIFT; + nodes[i].start = i * nodesize; + nodes[i].end = end; + setup_node_bootmem(i, nodes[i].start, nodes[i].end); + } + memnode_shift = compute_hash_shift(nodes, emunodes, nodes[i-1].end); + return 0; + } + + printk(KERN_INFO "Faking a node at %016lx-%016lx\n", + start_pfn << PAGE_SHIFT, + end_pfn << PAGE_SHIFT); + /* setup dummy node covering all memory */ + fake_node = 1; + memnode_shift = 63; + memnodemap[0] = 0; + setup_node_bootmem(0, start_pfn<task; \ -}) - - #define current get_current() #else diff --git a/include/asm-x86_64/desc.h b/include/asm-x86_64/desc.h index 724777a946f6..ae54a8810268 100644 --- a/include/asm-x86_64/desc.h +++ b/include/asm-x86_64/desc.h @@ -135,6 +135,14 @@ static inline void set_ldt_desc(unsigned cpu, void *addr, int size) DESC_LDT, size); } +static inline void set_seg_base(unsigned cpu, int entry, void *base) +{ + struct desc_struct *d = &cpu_gdt_table[cpu][entry]; + d->base0 = PTR_LOW(base); + d->base1 = PTR_MIDDLE(base); + d->base2 = PTR_HIGH(base); +} + #define LDT_entry_a(info) \ ((((info)->base_addr & 0x0000ffff) << 16) | ((info)->limit & 0x0ffff)) #define LDT_entry_b(info) \ diff --git a/include/asm-x86_64/dma-mapping.h b/include/asm-x86_64/dma-mapping.h index 48ada1b2956f..414efa3c3bcb 100644 --- a/include/asm-x86_64/dma-mapping.h +++ b/include/asm-x86_64/dma-mapping.h @@ -1,5 +1,5 @@ -#ifndef _ASM_X8664_DMA_MAPPING_H -#define _ASM_X8664_DMA_MAPPING_H +#ifndef _X8664_DMA_MAPPING_H +#define _X8664_DMA_MAPPING_H 1 #include diff --git a/include/asm-x86_64/e820.h b/include/asm-x86_64/e820.h index ff2df2ed7f42..9b447dd69663 100644 --- a/include/asm-x86_64/e820.h +++ b/include/asm-x86_64/e820.h @@ -47,7 +47,7 @@ extern void add_memory_region(unsigned long start, unsigned long size, int type); extern void setup_memory_region(void); extern void contig_e820_setup(void); -extern void e820_end_of_ram(void); +extern unsigned long e820_end_of_ram(void); extern void e820_reserve_resources(void); extern void e820_print_map(char *who); extern int e820_mapped(unsigned long start, unsigned long end, int type); diff --git a/include/asm-x86_64/i387.h b/include/asm-x86_64/i387.h index f3d20c00741f..a07d79a9aef2 100644 --- a/include/asm-x86_64/i387.h +++ b/include/asm-x86_64/i387.h @@ -39,16 +39,10 @@ static inline int need_signal_i387(struct task_struct *me) #define kernel_fpu_end() stts() #define unlazy_fpu(tsk) do { \ - if (test_tsk_thread_flag(tsk, TIF_USEDFPU)) \ + if ((tsk)->thread_info->flags & TIF_USEDFPU) \ save_init_fpu(tsk); \ } while (0) -#define unlazy_current_fpu() do { \ - if (test_thread_flag(TIF_USEDFPU)) \ - save_init_fpu(tsk); \ -} while (0) - - #define clear_fpu(tsk) do { \ if (test_tsk_thread_flag(tsk, TIF_USEDFPU)) { \ asm volatile("fwait"); \ @@ -134,7 +128,7 @@ static inline void save_init_fpu( struct task_struct *tsk ) { asm volatile( "fxsave %0 ; fnclex" : "=m" (tsk->thread.i387.fxsave)); - clear_tsk_thread_flag(tsk, TIF_USEDFPU); + tsk->thread_info->flags &= ~TIF_USEDFPU; stts(); } diff --git a/include/asm-x86_64/io.h b/include/asm-x86_64/io.h index aabef395f935..e6614c5ebb1f 100644 --- a/include/asm-x86_64/io.h +++ b/include/asm-x86_64/io.h @@ -1,6 +1,8 @@ #ifndef _ASM_IO_H #define _ASM_IO_H +#include + /* * This file contains the definitions for the x86 IO instructions * inb/inw/inl/outb/outw/outl and the "string versions" of the same @@ -135,7 +137,12 @@ extern inline void * phys_to_virt(unsigned long address) /* * Change "struct page" to physical address. */ +#ifdef CONFIG_DISCONTIGMEM +#include +#define page_to_phys(page) ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT) +#else #define page_to_phys(page) ((page - mem_map) << PAGE_SHIFT) +#endif extern void * __ioremap(unsigned long offset, unsigned long size, unsigned long flags); diff --git a/include/asm-x86_64/mmsegment.h b/include/asm-x86_64/mmsegment.h new file mode 100644 index 000000000000..d3f80c996330 --- /dev/null +++ b/include/asm-x86_64/mmsegment.h @@ -0,0 +1,8 @@ +#ifndef _ASM_MMSEGMENT_H +#define _ASM_MMSEGMENT_H 1 + +typedef struct { + unsigned long seg; +} mm_segment_t; + +#endif diff --git a/include/asm-x86_64/mmzone.h b/include/asm-x86_64/mmzone.h new file mode 100644 index 000000000000..9d5b9772f81c --- /dev/null +++ b/include/asm-x86_64/mmzone.h @@ -0,0 +1,79 @@ +/* K8 NUMA support */ +/* Copyright 2002,2003 by Andi Kleen, SuSE Labs */ +/* 2.5 Version losely based on the NUMAQ Code by Pat Gaughen. */ +#ifndef _ASM_X86_64_MMZONE_H +#define _ASM_X86_64_MMZONE_H 1 + +#include + +#ifdef CONFIG_DISCONTIGMEM + +#define VIRTUAL_BUG_ON(x) + +#include +#include + +#define MAXNODE 8 +#define NODEMAPSIZE 0xff + +/* Simple perfect hash to map physical addresses to node numbers */ +extern int memnode_shift; +extern u8 memnodemap[NODEMAPSIZE]; +extern int maxnode; + +extern struct pglist_data *node_data[]; + +/* kern_addr_valid below hardcodes the same algorithm*/ +static inline __attribute__((pure)) int phys_to_nid(unsigned long addr) +{ + int nid; + VIRTUAL_BUG_ON((addr >> memnode_shift) >= NODEMAPSIZE); + nid = memnodemap[addr >> memnode_shift]; + VIRTUAL_BUG_ON(nid > maxnode); + return nid; +} + +#define kvaddr_to_nid(kaddr) phys_to_nid(__pa(kaddr)) +#define NODE_DATA(nid) (node_data[nid]) + +#define node_mem_map(nid) (NODE_DATA(nid)->node_mem_map) + +#define node_mem_map(nid) (NODE_DATA(nid)->node_mem_map) +#define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn) +#define node_end_pfn(nid) (NODE_DATA(nid)->node_start_pfn + \ + NODE_DATA(nid)->node_size) +#define node_size(nid) (NODE_DATA(nid)->node_size) + +#define local_mapnr(kvaddr) \ + ( (__pa(kvaddr) >> PAGE_SHIFT) - node_start_pfn(kvaddr_to_nid(kvaddr)) ) +#define kern_addr_valid(kvaddr) ({ \ + int ok = 0; \ + unsigned long index = __pa(kvaddr) >> memnode_shift; \ + if (index <= NODEMAPSIZE) { \ + unsigned nodeid = memnodemap[index]; \ + unsigned long pfn = __pa(kvaddr) >> PAGE_SHIFT; \ + unsigned long start_pfn = node_start_pfn(nodeid); \ + ok = (nodeid != 0xff) && \ + (pfn >= start_pfn) && \ + (pfn < start_pfn + node_size(nodeid)); \ + } \ + ok; \ +}) + +/* AK: this currently doesn't deal with invalid addresses. We'll see + if the 2.5 kernel doesn't pass them + (2.4 used to). */ +#define pfn_to_page(pfn) ({ \ + int nid = phys_to_nid(((unsigned long)(pfn)) << PAGE_SHIFT); \ + ((pfn) - node_start_pfn(nid)) + node_mem_map(nid); \ +}) + +#define page_to_pfn(page) \ + (long)(((page) - page_zone(page)->zone_mem_map) + page_zone(page)->zone_start_pfn) + +/* AK: !DISCONTIGMEM just forces it to 1. Can't we too? */ +#define pfn_valid(pfn) ((pfn) < num_physpages) + + +#endif +#endif diff --git a/include/asm-x86_64/mpspec.h b/include/asm-x86_64/mpspec.h index 528fd8a9a272..8ef120fe2274 100644 --- a/include/asm-x86_64/mpspec.h +++ b/include/asm-x86_64/mpspec.h @@ -185,7 +185,6 @@ extern int mp_bus_id_to_pci_bus [MAX_MP_BUSSES]; extern int mp_current_pci_id; extern unsigned long mp_lapic_addr; extern int pic_mode; -extern int using_apic_timer; #ifdef CONFIG_ACPI_BOOT extern void mp_register_lapic (u8 id, u8 enabled); @@ -199,5 +198,7 @@ extern void mp_parse_prt (void); #endif /*CONFIG_X86_IO_APIC*/ #endif +extern int using_apic_timer; + #endif diff --git a/include/asm-x86_64/msr.h b/include/asm-x86_64/msr.h index 4085cc8c5dbe..c57f9da6efab 100644 --- a/include/asm-x86_64/msr.h +++ b/include/asm-x86_64/msr.h @@ -67,6 +67,61 @@ : "=a" (low), "=d" (high) \ : "c" (counter)) +extern inline void cpuid(int op, int *eax, int *ebx, int *ecx, int *edx) +{ + __asm__("cpuid" + : "=a" (*eax), + "=b" (*ebx), + "=c" (*ecx), + "=d" (*edx) + : "0" (op)); +} + +/* + * CPUID functions returning a single datum + */ +extern inline unsigned int cpuid_eax(unsigned int op) +{ + unsigned int eax; + + __asm__("cpuid" + : "=a" (eax) + : "0" (op) + : "bx", "cx", "dx"); + return eax; +} +extern inline unsigned int cpuid_ebx(unsigned int op) +{ + unsigned int eax, ebx; + + __asm__("cpuid" + : "=a" (eax), "=b" (ebx) + : "0" (op) + : "cx", "dx" ); + return ebx; +} +extern inline unsigned int cpuid_ecx(unsigned int op) +{ + unsigned int eax, ecx; + + __asm__("cpuid" + : "=a" (eax), "=c" (ecx) + : "0" (op) + : "bx", "dx" ); + return ecx; +} +extern inline unsigned int cpuid_edx(unsigned int op) +{ + unsigned int eax, edx; + + __asm__("cpuid" + : "=a" (eax), "=d" (edx) + : "0" (op) + : "bx", "cx"); + return edx; +} + + #endif /* AMD/K8 specific MSRs */ diff --git a/include/asm-x86_64/numa.h b/include/asm-x86_64/numa.h new file mode 100644 index 000000000000..7686e4dfd9f4 --- /dev/null +++ b/include/asm-x86_64/numa.h @@ -0,0 +1,22 @@ +#ifndef _ASM_X8664_NUMA_H +#define _ASM_X8664_NUMA_H 1 + +#define MAXNODE 8 +#define NODEMASK 0xff + +struct node { + u64 start,end; +}; + +#define for_all_nodes(x) for ((x) = 0; (x) <= maxnode; (x)++) \ + if ((1UL << (x)) & nodes_present) + +#define early_for_all_nodes(n) \ + for (n=0; n + +#ifdef CONFIG_DISCONTIGMEM +#define MAX_NUMNODES 8 /* APIC limit currently */ +#else +#define MAX_NUMNODES 1 +#endif + +#endif diff --git a/include/asm-x86_64/page.h b/include/asm-x86_64/page.h index 2e65d509ec25..d954802230bb 100644 --- a/include/asm-x86_64/page.h +++ b/include/asm-x86_64/page.h @@ -1,6 +1,8 @@ #ifndef _X86_64_PAGE_H #define _X86_64_PAGE_H +#include + /* PAGE_SHIFT determines the page size */ #define PAGE_SHIFT 12 #ifdef __ASSEMBLY__ @@ -10,7 +12,13 @@ #endif #define PAGE_MASK (~(PAGE_SIZE-1)) #define PHYSICAL_PAGE_MASK (~(PAGE_SIZE-1) & (__PHYSICAL_MASK << PAGE_SHIFT)) -#define THREAD_SIZE (2*PAGE_SIZE) + +#define THREAD_ORDER 1 +#ifdef __ASSEMBLY__ +#define THREAD_SIZE (1 << (PAGE_SHIFT + THREAD_ORDER)) +#else +#define THREAD_SIZE (1UL << (PAGE_SHIFT + THREAD_ORDER)) +#endif #define CURRENT_MASK (~(THREAD_SIZE-1)) #define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1)) @@ -58,7 +66,7 @@ typedef struct { unsigned long pgprot; } pgprot_t; /* to align the pointer to the (next) page boundary */ #define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK) -/* See Documentation/x86_64/mm.txt for a description of the layout. */ +/* See Documentation/x86_64/mm.txt for a description of the memory map. */ #define __START_KERNEL 0xffffffff80100000 #define __START_KERNEL_map 0xffffffff80000000 #define __PAGE_OFFSET 0x0000010000000000 @@ -100,10 +108,13 @@ extern __inline__ int get_order(unsigned long size) __pa(v); }) #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) +#ifndef CONFIG_DISCONTIGMEM #define pfn_to_page(pfn) (mem_map + (pfn)) #define page_to_pfn(page) ((unsigned long)((page) - mem_map)) -#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT) #define pfn_valid(pfn) ((pfn) < max_mapnr) +#endif + +#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT) #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT) #define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT) diff --git a/include/asm-x86_64/pda.h b/include/asm-x86_64/pda.h index 47d243c980cc..2af44c17ba3a 100644 --- a/include/asm-x86_64/pda.h +++ b/include/asm-x86_64/pda.h @@ -55,7 +55,10 @@ asm volatile(op "q %0,%%gs:%c1"::"r" (val),"i"(pda_offset(field)):"memory"); bre } \ } while (0) - +/* + * AK: PDA read accesses should be neither volatile nor have an memory clobber. + * Unfortunately removing them causes all hell to break lose currently. + */ #define pda_from_op(op,field) ({ \ typedef typeof_field(struct x8664_pda, field) T__; T__ ret__; \ switch (sizeof_field(struct x8664_pda, field)) { \ diff --git a/include/asm-x86_64/pgalloc.h b/include/asm-x86_64/pgalloc.h index 899f603e69fa..4cae8e6a37a0 100644 --- a/include/asm-x86_64/pgalloc.h +++ b/include/asm-x86_64/pgalloc.h @@ -14,8 +14,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) { - set_pmd(pmd, __pmd(_PAGE_TABLE | - ((u64)(pte - mem_map) << PAGE_SHIFT))); + set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT))); } extern __inline__ pmd_t *get_pmd(void) @@ -76,6 +75,6 @@ extern inline void pte_free(struct page *pte) } #define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte)) -#define __pmd_free_tlb(tlb,x) do { } while (0) +#define __pmd_free_tlb(tlb,x) pmd_free(x) #endif /* _X86_64_PGALLOC_H */ diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h index fd60629f3386..44d57e317dbc 100644 --- a/include/asm-x86_64/pgtable.h +++ b/include/asm-x86_64/pgtable.h @@ -103,6 +103,8 @@ static inline void set_pml4(pml4_t *dst, pml4_t val) #define ptep_get_and_clear(xp) __pte(xchg(&(xp)->pte, 0)) #define pte_same(a, b) ((a).pte == (b).pte) +#define PML4_SIZE (1UL << PML4_SHIFT) +#define PML4_MASK (~(PML4_SIZE-1)) #define PMD_SIZE (1UL << PMD_SHIFT) #define PMD_MASK (~(PMD_SIZE-1)) #define PGDIR_SIZE (1UL << PGDIR_SHIFT) @@ -317,7 +319,8 @@ static inline pgd_t *current_pgd_offset_k(unsigned long address) /* PMD - Level 2 access */ #define pmd_page_kernel(pmd) ((unsigned long) __va(pmd_val(pmd) & PTE_MASK)) -#define pmd_page(pmd) (mem_map + ((pmd_val(pmd) & PTE_MASK)>>PAGE_SHIFT)) +#define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)) + #define __pmd_offset(address) (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1)) #define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \ __pmd_offset(address)) @@ -372,7 +375,9 @@ typedef pte_t *pte_addr_t; #endif /* !__ASSEMBLY__ */ +#ifndef CONFIG_DISCONTIGMEM #define kern_addr_valid(addr) (1) +#endif #define io_remap_page_range remap_page_range diff --git a/include/asm-x86_64/processor.h b/include/asm-x86_64/processor.h index 9c84b14139f2..7739379bbe12 100644 --- a/include/asm-x86_64/processor.h +++ b/include/asm-x86_64/processor.h @@ -17,6 +17,7 @@ #include #include #include +#include #define TF_MASK 0x00000100 #define IF_MASK 0x00000200 @@ -109,64 +110,6 @@ extern void dodgy_tsc(void); #define X86_EFLAGS_VIP 0x00100000 /* Virtual Interrupt Pending */ #define X86_EFLAGS_ID 0x00200000 /* CPUID detection flag */ -/* - * Generic CPUID function - * FIXME: This really belongs to msr.h - */ -extern inline void cpuid(int op, int *eax, int *ebx, int *ecx, int *edx) -{ - __asm__("cpuid" - : "=a" (*eax), - "=b" (*ebx), - "=c" (*ecx), - "=d" (*edx) - : "0" (op)); -} - -/* - * CPUID functions returning a single datum - */ -extern inline unsigned int cpuid_eax(unsigned int op) -{ - unsigned int eax; - - __asm__("cpuid" - : "=a" (eax) - : "0" (op) - : "bx", "cx", "dx"); - return eax; -} -extern inline unsigned int cpuid_ebx(unsigned int op) -{ - unsigned int eax, ebx; - - __asm__("cpuid" - : "=a" (eax), "=b" (ebx) - : "0" (op) - : "cx", "dx" ); - return ebx; -} -extern inline unsigned int cpuid_ecx(unsigned int op) -{ - unsigned int eax, ecx; - - __asm__("cpuid" - : "=a" (eax), "=c" (ecx) - : "0" (op) - : "bx", "dx" ); - return ecx; -} -extern inline unsigned int cpuid_edx(unsigned int op) -{ - unsigned int eax, edx; - - __asm__("cpuid" - : "=a" (eax), "=d" (edx) - : "0" (op) - : "bx", "cx"); - return edx; -} - /* * Intel CPU features in CR4 */ @@ -210,36 +153,6 @@ static inline void clear_in_cr4 (unsigned long mask) :"ax"); } -#if 0 -/* - * Cyrix CPU configuration register indexes - */ -#define CX86_CCR0 0xc0 -#define CX86_CCR1 0xc1 -#define CX86_CCR2 0xc2 -#define CX86_CCR3 0xc3 -#define CX86_CCR4 0xe8 -#define CX86_CCR5 0xe9 -#define CX86_CCR6 0xea -#define CX86_CCR7 0xeb -#define CX86_DIR0 0xfe -#define CX86_DIR1 0xff -#define CX86_ARR_BASE 0xc4 -#define CX86_RCR_BASE 0xdc - -/* - * Cyrix CPU indexed register access macros - */ - -#define getCx86(reg) ({ outb((reg), 0x22); inb(0x23); }) - -#define setCx86(reg, data) do { \ - outb((reg), 0x22); \ - outb((data), 0x23); \ -} while (0) - -#endif - /* * Bus types */ @@ -286,10 +199,6 @@ union i387_union { struct i387_fxsave_struct fxsave; }; -typedef struct { - unsigned long seg; -} mm_segment_t; - struct tss_struct { u32 reserved1; u64 rsp0; @@ -302,7 +211,7 @@ struct tss_struct { u16 reserved5; u16 io_map_base; u32 io_bitmap[IO_BITMAP_SIZE]; -} __attribute__((packed)); +} __attribute__((packed)) ____cacheline_aligned; struct thread_struct { unsigned long rsp0; @@ -336,6 +245,7 @@ struct thread_struct { #define NMI_STACK 3 #define N_EXCEPTION_STACKS 3 /* hw limit: 7 */ #define EXCEPTION_STKSZ 1024 +#define EXCEPTION_STK_ORDER 0 #define start_thread(regs,new_rip,new_rsp) do { \ asm volatile("movl %0,%%fs; movl %0,%%es; movl %0,%%ds": :"r" (0)); \ @@ -378,6 +288,13 @@ extern inline void rep_nop(void) __asm__ __volatile__("rep;nop": : :"memory"); } +/* Stop speculative execution */ +extern inline void sync_core(void) +{ + int tmp; + asm volatile("cpuid" : "=a" (tmp) : "0" (1) : "ebx","ecx","edx","memory"); +} + #define cpu_has_fpu 1 #define ARCH_HAS_PREFETCH @@ -389,7 +306,6 @@ extern inline void rep_nop(void) #define spin_lock_prefetch(x) prefetchw(x) #define cpu_relax() rep_nop() - /* * NSC/Cyrix CPU configuration register indexes */ @@ -417,4 +333,11 @@ extern inline void rep_nop(void) outb((data), 0x23); \ } while (0) +#define stack_current() \ +({ \ + struct thread_info *ti; \ + asm("andq %%rsp,%0; ":"=r" (ti) : "0" (CURRENT_MASK)); \ + ti->task; \ +}) + #endif /* __ASM_X86_64_PROCESSOR_H */ diff --git a/include/asm-x86_64/proto.h b/include/asm-x86_64/proto.h index 2d56397b90e1..1f40821a0b7a 100644 --- a/include/asm-x86_64/proto.h +++ b/include/asm-x86_64/proto.h @@ -25,6 +25,8 @@ extern void iommu_hole_init(void); extern void do_softirq_thunk(void); +extern int numa_setup(char *opt); + extern int setup_early_printk(char *); extern void early_printk(const char *fmt, ...) __attribute__((format(printf,1,2))); @@ -36,18 +38,27 @@ extern unsigned long numa_free_all_bootmem(void); extern void reserve_bootmem_generic(unsigned long phys, unsigned len); extern void free_bootmem_generic(unsigned long phys, unsigned len); -extern unsigned long start_pfn, end_pfn, end_pfn_map; +extern unsigned long end_pfn_map; extern void show_stack(unsigned long * rsp); extern void exception_table_check(void); -extern int acpi_boot_init(char *); +extern void acpi_reserve_bootmem(void); + +extern void swap_low_mappings(void); extern int map_syscall32(struct mm_struct *mm, unsigned long address); extern char *syscall32_page; +void setup_node_bootmem(int nodeid, unsigned long start, unsigned long end); + +extern unsigned long max_mapnr; +extern unsigned long end_pfn; +extern unsigned long table_start, table_end; + struct thread_struct; +struct user_desc; int do_set_thread_area(struct thread_struct *t, struct user_desc *u_info); int do_get_thread_area(struct thread_struct *t, struct user_desc *u_info); diff --git a/include/asm-x86_64/segment.h b/include/asm-x86_64/segment.h index 6992086cbe7c..d867ab520e02 100644 --- a/include/asm-x86_64/segment.h +++ b/include/asm-x86_64/segment.h @@ -19,13 +19,15 @@ #define __USER_DS 0x2b /* 5*8+3 */ #define __USER_CS 0x33 /* 6*8+3 */ #define __USER32_DS __USER_DS +#define __KERNEL16_CS (GDT_ENTRY_KERNELCS16 * 8) #define GDT_ENTRY_TLS 1 #define GDT_ENTRY_TSS 8 /* needs two entries */ #define GDT_ENTRY_LDT 10 #define GDT_ENTRY_TLS_MIN 11 #define GDT_ENTRY_TLS_MAX 13 -#define GDT_ENTRY_LONGBASE 14 +/* 14 free */ +#define GDT_ENTRY_KERNELCS16 15 #define GDT_ENTRY_TLS_ENTRIES 3 diff --git a/include/asm-x86_64/smp.h b/include/asm-x86_64/smp.h index b6a8eb92d2b1..d8a44adfda50 100644 --- a/include/asm-x86_64/smp.h +++ b/include/asm-x86_64/smp.h @@ -44,7 +44,9 @@ extern void smp_send_reschedule(int cpu); extern void smp_send_reschedule_all(void); extern void smp_invalidate_rcv(void); /* Process an NMI */ extern void (*mtrr_hook) (void); -extern void zap_low_mappings (void); +extern void zap_low_mappings(void); + +#define SMP_TRAMPOLINE_BASE 0x6000 /* * On x86 all CPUs are mapped 1:1 to the APIC space. @@ -55,38 +57,26 @@ extern void zap_low_mappings (void); extern volatile unsigned long cpu_callout_map; #define cpu_possible(cpu) (cpu_callout_map & (1<<(cpu))) +#define cpu_online(cpu) (cpu_online_map & (1<<(cpu))) -extern inline int cpu_logical_map(int cpu) -{ - return cpu; -} -extern inline int cpu_number_map(int cpu) -{ - return cpu; -} +#define for_each_cpu(cpu, mask) \ + for(mask = cpu_online_map; \ + cpu = __ffs(mask), mask != 0; \ + mask &= ~(1UL<> (cpu+1); - if (!left) return -1; - return ffz(~left) + cpu; } -extern inline int find_first_cpu(void) +extern inline unsigned int num_online_cpus(void) { - return ffz(~cpu_online_map); + return hweight32(cpu_online_map); } -/* RED-PEN different from i386 */ -#define for_each_cpu(i) \ - for((i) = find_first_cpu(); (i)>=0; (i)=find_next_cpu(i)) - static inline int num_booting_cpus(void) { return hweight32(cpu_callout_map); @@ -94,28 +84,25 @@ static inline int num_booting_cpus(void) extern volatile unsigned long cpu_callout_map; -/* - * Some lowlevel functions might want to know about - * the real APIC ID <-> CPU # mapping. - */ -extern volatile int x86_apicid_to_cpu[NR_CPUS]; -extern volatile int x86_cpu_to_apicid[NR_CPUS]; - -/* - * This function is needed by all SMP systems. It must _always_ be valid - * from the initial startup. We map APIC_BASE very early in page_setup(), - * so this is correct in the x86 case. - */ - #define smp_processor_id() read_pda(cpunumber) - extern __inline int hard_smp_processor_id(void) { /* we don't want to mark this access volatile - bad code generation */ return GET_APIC_ID(*(unsigned int *)(APIC_BASE+APIC_ID)); } +extern int disable_apic; +extern int slow_smp_processor_id(void); + +extern inline int safe_smp_processor_id(void) +{ + if (disable_apic) + return slow_smp_processor_id(); + else + return hard_smp_processor_id(); +} + #define cpu_online(cpu) (cpu_online_map & (1<<(cpu))) #endif /* !ASSEMBLY */ @@ -128,6 +115,7 @@ extern __inline int hard_smp_processor_id(void) #ifndef CONFIG_SMP #define stack_smp_processor_id() 0 +#define safe_smp_processor_id() 0 #define for_each_cpu(x) (x)=0; #define cpu_logical_map(x) (x) #else @@ -135,7 +123,7 @@ extern __inline int hard_smp_processor_id(void) #define stack_smp_processor_id() \ ({ \ struct thread_info *ti; \ - __asm__("andq %%rsp,%0; ":"=r" (ti) : "0" (~8191UL)); \ + __asm__("andq %%rsp,%0; ":"=r" (ti) : "0" (CURRENT_MASK)); \ ti->cpu; \ }) #endif diff --git a/include/asm-x86_64/spinlock.h b/include/asm-x86_64/spinlock.h index 00ae8043a534..ae3615feecdb 100644 --- a/include/asm-x86_64/spinlock.h +++ b/include/asm-x86_64/spinlock.h @@ -15,7 +15,7 @@ extern int printk(const char * fmt, ...) typedef struct { volatile unsigned int lock; -#if CONFIG_DEBUG_SPINLOCK +#ifdef CONFIG_DEBUG_SPINLOCK unsigned magic; #endif } spinlock_t; @@ -56,13 +56,56 @@ typedef struct { /* * This works. Despite all the confusion. + * (except on PPro SMP or if we are using OOSTORE) + * (PPro errata 66, 92) */ + +#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE) + #define spin_unlock_string \ - "movb $1,%0" + "movb $1,%0" \ + :"=m" (lock->lock) : : "memory" + + +static inline void _raw_spin_unlock(spinlock_t *lock) +{ +#ifdef CONFIG_DEBUG_SPINLOCK + if (lock->magic != SPINLOCK_MAGIC) + BUG(); + if (!spin_is_locked(lock)) + BUG(); +#endif + __asm__ __volatile__( + spin_unlock_string + ); +} + +#else + +#define spin_unlock_string \ + "xchgb %b0, %1" \ + :"=q" (oldval), "=m" (lock->lock) \ + :"0" (oldval) : "memory" + +static inline void _raw_spin_unlock(spinlock_t *lock) +{ + char oldval = 1; +#ifdef CONFIG_DEBUG_SPINLOCK + if (lock->magic != SPINLOCK_MAGIC) + BUG(); + if (!spin_is_locked(lock)) + BUG(); +#endif + __asm__ __volatile__( + spin_unlock_string + ); +} + +#endif static inline int _raw_spin_trylock(spinlock_t *lock) { - signed char oldval; + char oldval; __asm__ __volatile__( "xchgb %b0,%1" :"=q" (oldval), "=m" (lock->lock) @@ -85,18 +128,6 @@ printk("eip: %p\n", &&here); :"=m" (lock->lock) : : "memory"); } -static inline void _raw_spin_unlock(spinlock_t *lock) -{ -#ifdef CONFIG_DEBUG_SPINLOCK - if (lock->magic != SPINLOCK_MAGIC) - BUG(); - if (!spin_is_locked(lock)) - BUG(); -#endif - __asm__ __volatile__( - spin_unlock_string - :"=m" (lock->lock) : : "memory"); -} /* * Read-write spinlocks, allowing multiple readers @@ -127,6 +158,8 @@ typedef struct { #define rwlock_init(x) do { *(x) = RW_LOCK_UNLOCKED; } while(0) +#define rwlock_is_locked(x) ((x)->lock != RW_LOCK_BIAS) + /* * On x86, we implement read-write locks as a 32-bit counter * with the high bit (sign) being the "contended" bit. @@ -136,9 +169,9 @@ typedef struct { * Changed to use the same technique as rw semaphores. See * semaphore.h for details. -ben */ -/* the spinlock helpers are in arch/x86_64/kernel/semaphore.S */ +/* the spinlock helpers are in arch/i386/kernel/semaphore.c */ -extern inline void _raw_read_lock(rwlock_t *rw) +static inline void _raw_read_lock(rwlock_t *rw) { #ifdef CONFIG_DEBUG_SPINLOCK if (rw->magic != RWLOCK_MAGIC) @@ -168,6 +201,4 @@ static inline int _raw_write_trylock(rwlock_t *lock) return 0; } -#define rwlock_is_locked(x) ((x)->lock != RW_LOCK_BIAS) - #endif /* __ASM_SPINLOCK_H */ diff --git a/include/asm-x86_64/system.h b/include/asm-x86_64/system.h index 268865b34785..585e13370873 100644 --- a/include/asm-x86_64/system.h +++ b/include/asm-x86_64/system.h @@ -83,7 +83,7 @@ extern void load_gs_index(unsigned); #define loadsegment(seg,value) \ asm volatile("\n" \ "1:\t" \ - "movl %0,%%" #seg "\n" \ + "movl %k0,%%" #seg "\n" \ "2:\n" \ ".section .fixup,\"ax\"\n" \ "3:\t" \ @@ -94,7 +94,7 @@ extern void load_gs_index(unsigned); ".align 8\n\t" \ ".quad 1b,3b\n" \ ".previous" \ - : :"r" ((int)(value))) + : :"r" (value)) #define set_debug(value,register) \ __asm__("movq %0,%%db" #register \ @@ -119,6 +119,13 @@ static inline void write_cr0(unsigned long val) asm volatile("movq %0,%%cr0" :: "r" (val)); } +static inline unsigned long read_cr3(void) +{ + unsigned long cr3; + asm("movq %%cr3,%0" : "=r" (cr3)); + return cr3; +} + static inline unsigned long read_cr4(void) { unsigned long cr4; diff --git a/include/asm-x86_64/thread_info.h b/include/asm-x86_64/thread_info.h index 9f034cf938d1..334982074c51 100644 --- a/include/asm-x86_64/thread_info.h +++ b/include/asm-x86_64/thread_info.h @@ -9,11 +9,8 @@ #ifdef __KERNEL__ -#ifndef __ASSEMBLY__ -#include -#include -#include -#endif +#include +#include /* * low level task data that entry.S needs immediate access to @@ -21,6 +18,10 @@ * - this struct shares the supervisor stack pages */ #ifndef __ASSEMBLY__ +struct task_struct; +struct exec_domain; +#include + struct thread_info { struct task_struct *task; /* main task structure */ struct exec_domain *exec_domain; /* execution domain */ @@ -31,7 +32,6 @@ struct thread_info { mm_segment_t addr_limit; struct restart_block restart_block; }; - #endif /* @@ -55,27 +55,17 @@ struct thread_info { #define init_thread_info (init_thread_union.thread_info) #define init_stack (init_thread_union.stack) -/* how to get the thread information struct from C */ - -#define THREAD_SIZE (2*PAGE_SIZE) - static inline struct thread_info *current_thread_info(void) { struct thread_info *ti; - ti = (void *)read_pda(kernelstack) + PDA_STACKOFFSET - THREAD_SIZE; - return ti; -} - -static inline struct thread_info *stack_thread_info(void) -{ - struct thread_info *ti; - __asm__("andq %%rsp,%0; ":"=r" (ti) : "0" (~8191UL)); + asm("andq %%rsp,%0; ":"=r" (ti) : "0" (CURRENT_MASK)); return ti; } /* thread information allocation */ -#define alloc_thread_info() ((struct thread_info *) __get_free_pages(GFP_KERNEL,1)) -#define free_thread_info(ti) free_pages((unsigned long) (ti), 1) +#define alloc_thread_info() \ + ((struct thread_info *) __get_free_pages(GFP_KERNEL,THREAD_ORDER)) +#define free_thread_info(ti) free_pages((unsigned long) (ti), THREAD_ORDER) #define get_thread_info(ti) get_task_struct((ti)->task) #define put_thread_info(ti) put_task_struct((ti)->task) @@ -84,7 +74,7 @@ static inline struct thread_info *stack_thread_info(void) /* how to get the thread information struct from ASM */ /* only works on the process stack. otherwise get it via the PDA. */ #define GET_THREAD_INFO(reg) \ - movq $-8192, reg; \ + movq $CURRENT_MASK, reg; \ andq %rsp, reg #endif diff --git a/include/asm-x86_64/timex.h b/include/asm-x86_64/timex.h index 8180474c3758..7e0a1309bfce 100644 --- a/include/asm-x86_64/timex.h +++ b/include/asm-x86_64/timex.h @@ -1,7 +1,7 @@ /* - * linux/include/asm-x8664/timex.h + * linux/include/asm-x86_64/timex.h * - * x8664 architecture timex specifications + * x86-64 architecture timex specifications */ #ifndef _ASMx8664_TIMEX_H #define _ASMx8664_TIMEX_H @@ -16,20 +16,6 @@ (1000000/CLOCK_TICK_FACTOR) / (CLOCK_TICK_RATE/CLOCK_TICK_FACTOR)) \ << (SHIFT_SCALE-SHIFT_HZ)) / HZ) -/* - * Standard way to access the cycle counter on i586+ CPUs. - * Currently only used on SMP. - * - * If you really have a SMP machine with i486 chips or older, - * compile for that, and this will just always return zero. - * That's ok, it just means that the nicer scheduling heuristics - * won't work for you. - * - * We only use the low 32 bits, and we'd simply better make sure - * that we reschedule before that wraps. Scheduling at least every - * four billion cycles just basically sounds like a good idea, - * regardless of how fast the machine is. - */ typedef unsigned long long cycles_t; extern cycles_t cacheflush_time; diff --git a/include/asm-x86_64/topology.h b/include/asm-x86_64/topology.h index 2ea9ab1b9423..702b71fd2f64 100644 --- a/include/asm-x86_64/topology.h +++ b/include/asm-x86_64/topology.h @@ -1,6 +1,26 @@ #ifndef _ASM_X86_64_TOPOLOGY_H #define _ASM_X86_64_TOPOLOGY_H +#include + +#ifdef CONFIG_DISCONTIGMEM + +/* Map the K8 CPU local memory controllers to a simple 1:1 CPU:NODE topology */ + +extern int fake_node; +extern unsigned long cpu_online_map; + +#define cpu_to_node(cpu) (fake_node ? 0 : (cpu)) +#define memblk_to_node(memblk) (fake_node ? 0 : (memblk)) +#define parent_node(node) (node) +#define node_to_first_cpu(node) (fake_node ? 0 : (node)) +#define node_to_cpu_mask(node) (fake_node ? cpu_online_map : (1UL << (node))) +#define node_to_memblk(node) (node) + +#define NODE_BALANCE_RATE 30 /* CHECKME */ + +#endif + #include -#endif /* _ASM_X86_64_TOPOLOGY_H */ +#endif diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 2377301d8453..f1aa8fd2986c 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -500,8 +500,10 @@ __SYSCALL(__NR_set_tid_address, sys_set_tid_address) __SYSCALL(__NR_restart_syscall, sys_restart_syscall) #define __NR_semtimedop 220 __SYSCALL(__NR_semtimedop, sys_semtimedop) +#define __NR_fadvise64 221 +__SYSCALL(__NR_fadvise64, sys_fadvise64) -#define __NR_syscall_max __NR_semtimedop +#define __NR_syscall_max __NR_fadvise64 #ifndef __NO_STUBS /* user-visible error numbers are in the range -1 - -4095 */ diff --git a/include/asm-x86_64/vsyscall.h b/include/asm-x86_64/vsyscall.h index f6ad9ef886c8..4c6a539ca7dc 100644 --- a/include/asm-x86_64/vsyscall.h +++ b/include/asm-x86_64/vsyscall.h @@ -2,6 +2,7 @@ #define _ASM_X86_64_VSYSCALL_H_ #include +#include enum vsyscall_num { __NR_vgettimeofday, @@ -19,8 +20,10 @@ enum vsyscall_num { #define __section_wall_jiffies __attribute__ ((unused, __section__ (".wall_jiffies"), aligned(16))) #define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16))) #define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz"), aligned(16))) +#define __section_sysctl_vsyscall __attribute__ ((unused, __section__ (".sysctl_vsyscall"), aligned(16))) #define __section_xtime __attribute__ ((unused, __section__ (".xtime"), aligned(16))) -#define __section_vxtime_sequence __attribute__ ((unused, __section__ (".vxtime_sequence"), aligned(16))) +#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock"), aligned(L1_CACHE_BYTES))) + struct hpet_data { long address; /* base address */ @@ -36,21 +39,21 @@ struct hpet_data { #define hpet_writel(d,a) writel(d, fix_to_virt(FIX_HPET_BASE) + a) /* vsyscall space (readonly) */ -extern long __vxtime_sequence[2]; extern struct hpet_data __hpet; extern struct timespec __xtime; extern volatile unsigned long __jiffies; extern unsigned long __wall_jiffies; extern struct timezone __sys_tz; +extern seqlock_t __xtime_lock; /* kernel space (writeable) */ -extern long vxtime_sequence[2]; extern struct hpet_data hpet; extern unsigned long wall_jiffies; extern struct timezone sys_tz; +extern int sysctl_vsyscall; +extern seqlock_t xtime_lock; -#define vxtime_lock() do { vxtime_sequence[0]++; wmb(); } while(0) -#define vxtime_unlock() do { wmb(); vxtime_sequence[1]++; } while (0) +#define ARCH_HAVE_XTIME_LOCK 1 #endif /* __KERNEL__ */