Memory Hotplug
==============
-Last Updated: Jul 28 2007
+Created: Jul 28 2007
+Add description of notifier of memory hotplug Oct 11 2007
This document is about memory hotplug including how-to-use and current status.
Because Memory Hotplug is still under development, contents of this text will
6.1 Memory offline and ZONE_MOVABLE
6.2. How to offline memory
7. Physical memory remove
-8. Future Work List
+8. Memory hotplug event notifier
+9. Future Work List
Note(1): x86_64's has special implementation for memory hotplug.
This text does not describe it.
(see Section 4.).
Logical Memory Hotplug phase is to change memory state into
-avaiable/unavailable for users. Amount of memory from user's view is
+available/unavailable for users. Amount of memory from user's view is
changed by this phase. The kernel makes all memory in it as free pages
when a memory range is available.
In this document, this phase is described as online/offline.
-Logical Memory Hotplug phase is triggred by write of sysfs file by system
+Logical Memory Hotplug phase is triggered by write of sysfs file by system
administrator. For the hot-add case, it must be executed after Physical Hotplug
phase by hand.
(However, if you writes udev's hotplug scripts for memory hotplug, these
This option can be kernel module too.
--------------------------------
-3 sysfs files for memory hotplug
+4 sysfs files for memory hotplug
--------------------------------
-All sections have their device information under /sys/devices/system/memory as
+All sections have their device information in sysfs. Each section is part of
+a memory block under /sys/devices/system/memory as
/sys/devices/system/memory/memoryXXX
-(XXX is section id.)
+(XXX is the section id.)
-Now, XXX is defined as start_address_of_section / section_size.
+Now, XXX is defined as (start_address_of_section / section_size) of the first
+section contained in the memory block. The files 'phys_index' and
+'end_phys_index' under each directory report the beginning and end section id's
+for the memory block covered by the sysfs directory. It is expected that all
+memory sections in this range are present and no memory holes exist in the
+range. Currently there is no way to determine if there is a memory hole, but
+the existence of one should not affect the hotplug capabilities of the memory
+block.
For example, assume 1GiB section size. A device for a memory starting at
0x100000000 is /sys/device/system/memory/memory4
(0x100000000 / 1Gib = 4)
This device covers address range [0x100000000 ... 0x140000000)
-Under each section, you can see 3 files.
+Under each section, you can see 4 or 5 files, the end_phys_index file being
+a recent addition and not present on older kernels.
-/sys/devices/system/memory/memoryXXX/phys_index
+/sys/devices/system/memory/memoryXXX/start_phys_index
+/sys/devices/system/memory/memoryXXX/end_phys_index
/sys/devices/system/memory/memoryXXX/phys_device
/sys/devices/system/memory/memoryXXX/state
-
-'phys_index' : read-only and contains section id, same as XXX.
-'state' : read-write
- at read: contains online/offline state of memory.
- at write: user can specify "online", "offline" command
-'phys_device': read-only: designed to show the name of physical memory device.
- This is not well implemented now.
+/sys/devices/system/memory/memoryXXX/removable
+
+'phys_index' : read-only and contains section id of the first section
+ in the memory block, same as XXX.
+'end_phys_index' : read-only and contains section id of the last section
+ in the memory block.
+'state' : read-write
+ at read: contains online/offline state of memory.
+ at write: user can specify "online", "offline" command
+ which will be performed on al sections in the block.
+'phys_device' : read-only: designed to show the name of physical memory
+ device. This is not well implemented now.
+'removable' : read-only: contains an integer value indicating
+ whether the memory block is removable or not
+ removable. A value of 1 indicates that the memory
+ block is removable and a value of 0 indicates that
+ it is not removable. A memory block is removable only if
+ every section in the block is removable.
NOTE:
These directories/files appear after physical memory hotplug phase.
+If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed
+via symbolic links located in the /sys/devices/system/node/node* directories.
+
+For example:
+/sys/devices/system/node/node0/memory9 -> ../../memory/memory9
+
+A backlink will also be created:
+/sys/devices/system/memory/memory9/node0 -> ../../node/node0
--------------------------------
4. Physical memory hot-add phase
- Notification completion of remove works by OS to firmware.
- Guard from remove if not yet.
+--------------------------------
+8. Memory hotplug event notifier
+--------------------------------
+Memory hotplug has event notifer. There are 6 types of notification.
+
+MEMORY_GOING_ONLINE
+ Generated before new memory becomes available in order to be able to
+ prepare subsystems to handle memory. The page allocator is still unable
+ to allocate from the new memory.
+
+MEMORY_CANCEL_ONLINE
+ Generated if MEMORY_GOING_ONLINE fails.
+
+MEMORY_ONLINE
+ Generated when memory has successfully brought online. The callback may
+ allocate pages from the new memory.
+
+MEMORY_GOING_OFFLINE
+ Generated to begin the process of offlining memory. Allocations are no
+ longer possible from the memory but some of the memory to be offlined
+ is still in use. The callback can be used to free memory known to a
+ subsystem from the indicated memory section.
+
+MEMORY_CANCEL_OFFLINE
+ Generated if MEMORY_GOING_OFFLINE fails. Memory is available again from
+ the section that we attempted to offline.
+
+MEMORY_OFFLINE
+ Generated after offlining memory is complete.
+
+A callback routine can be registered by
+ hotplug_memory_notifier(callback_func, priority)
+
+The second argument of callback function (action) is event types of above.
+The third argument is passed by pointer of struct memory_notify.
+
+struct memory_notify {
+ unsigned long start_pfn;
+ unsigned long nr_pages;
+ int status_change_nid;
+}
+
+start_pfn is start_pfn of online/offline memory.
+nr_pages is # of pages of online/offline memory.
+status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
+set/clear. It means a new(memoryless) node gets new memory by online and a
+node loses all memory. If this is -1, then nodemask status is not changed.
+If status_changed_nid >= 0, callback should create/discard structures for the
+node if necessary.
+
--------------
-8. Future Work
+9. Future Work
--------------
- allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
sysctl or new control file.
- showing memory section and physical device relationship.
- - showing memory section and node relationship (maybe good for NUMA)
- showing memory section is under ZONE_MOVABLE or not
- test and make it better memory offlining.
- support HugeTLB page migration and offlining.