Documentation/mm/physical_memory.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 ===============
   4 Physical Memory
   5 ===============
   6
   7 Linux is available for a wide range of architectures so there is a need for an
   8 architecture-independent abstraction to represent the physical memory. This
   9 chapter describes the structures used to manage physical memory in a running
  10 system.
  11
  12 The first principal concept prevalent in the memory management is
  13 `Non-Uniform Memory Access (NUMA)
  14 <https://en.wikipedia.org/wiki/Non-uniform_memory_access>`_.
  15 With multi-core and multi-socket machines, memory may be arranged into banks
  16 that incur a different cost to access depending on the “distance” from the
  17 processor. For example, there might be a bank of memory assigned to each CPU or
  18 a bank of memory very suitable for DMA near peripheral devices.
  19
  20 Each bank is called a node and the concept is represented under Linux by a
  21 ``struct pglist_data`` even if the architecture is UMA. This structure is
  22 always referenced by its typedef ``pg_data_t``. A ``pg_data_t`` structure
  23 for a particular node can be referenced by ``NODE_DATA(nid)`` macro where
  24 ``nid`` is the ID of that node.
  25
  26 For NUMA architectures, the node structures are allocated by the architecture
  27 specific code early during boot. Usually, these structures are allocated
  28 locally on the memory bank they represent. For UMA architectures, only one
  29 static ``pg_data_t`` structure called ``contig_page_data`` is used. Nodes will
  30 be discussed further in Section :ref:`Nodes <nodes>`
  31
  32 The entire physical address space is partitioned into one or more blocks
  33 called zones which represent ranges within memory. These ranges are usually
  34 determined by architectural constraints for accessing the physical memory.
  35 The memory range within a node that corresponds to a particular zone is
  36 described by a ``struct zone``, typedeffed to ``zone_t``. Each zone has
  37 one of the types described below.
  38
  39 * ``ZONE_DMA`` and ``ZONE_DMA32`` historically represented memory suitable for
  40   DMA by peripheral devices that cannot access all of the addressable
  41   memory. For many years there are better more and robust interfaces to get
  42   memory with DMA specific requirements (Documentation/core-api/dma-api.rst),
  43   but ``ZONE_DMA`` and ``ZONE_DMA32`` still represent memory ranges that have
  44   restrictions on how they can be accessed.
  45   Depending on the architecture, either of these zone types or even they both
  46   can be disabled at build time using ``CONFIG_ZONE_DMA`` and
  47   ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
  48   both zones as they support peripherals with different DMA addressing
  49   limitations.
  50
  51 * ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all
  52   the time. DMA operations can be performed on pages in this zone if the DMA
  53   devices support transfers to all addressable memory. ``ZONE_NORMAL`` is
  54   always enabled.
  55
  56 * ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a
  57   permanent mapping in the kernel page tables. The memory in this zone is only
  58   accessible to the kernel using temporary mappings. This zone is available
  59   only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``.
  60
  61 * ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
  62   The difference is that the contents of most pages in ``ZONE_MOVABLE`` is
  63   movable. That means that while virtual addresses of these pages do not
  64   change, their content may move between different physical pages. Often
  65   ``ZONE_MOVABLE`` is populated during memory hotplug, but it may be
  66   also populated on boot using one of ``kernelcore``, ``movablecore`` and
  67   ``movable_node`` kernel command line parameters. See
  68   Documentation/mm/page_migration.rst and
  69   Documentation/admin-guide/mm/memory-hotplug.rst for additional details.
  70
  71 * ``ZONE_DEVICE`` represents memory residing on devices such as PMEM and GPU.
  72   It has different characteristics than RAM zone types and it exists to provide
  73   :ref:`struct page <Pages>` and memory map services for device driver
  74   identified physical address ranges. ``ZONE_DEVICE`` is enabled with
  75   configuration option ``CONFIG_ZONE_DEVICE``.
  76
  77 It is important to note that many kernel operations can only take place using
  78 ``ZONE_NORMAL`` so it is the most performance critical zone. Zones are
  79 discussed further in Section :ref:`Zones <zones>`.
  80
  81 The relation between node and zone extents is determined by the physical memory
  82 map reported by the firmware, architectural constraints for memory addressing
  83 and certain parameters in the kernel command line.
  84
  85 For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
  86 entire memory will be on node 0 and there will be three zones: ``ZONE_DMA``,
  87 ``ZONE_NORMAL`` and ``ZONE_HIGHMEM``::
  88
  89   0                                                            2G
  90   +-------------------------------------------------------------+
  91   |                            node 0                           |
  92   +-------------------------------------------------------------+
  93
  94   0         16M                    896M                        2G
  95   +----------+-----------------------+--------------------------+
  96   | ZONE_DMA |      ZONE_NORMAL      |       ZONE_HIGHMEM       |
  97   +----------+-----------------------+--------------------------+
  98
  99
 100 With a kernel built with ``ZONE_DMA`` disabled and ``ZONE_DMA32`` enabled and
 101 booted with ``movablecore=80%`` parameter on an arm64 machine with 16 Gbytes of
 102 RAM equally split between two nodes, there will be ``ZONE_DMA32``,
 103 ``ZONE_NORMAL`` and ``ZONE_MOVABLE`` on node 0, and ``ZONE_NORMAL`` and
 104 ``ZONE_MOVABLE`` on node 1::
 105
 106
 107   1G                                9G                         17G
 108   +--------------------------------+ +--------------------------+
 109   |              node 0            | |          node 1          |
 110   +--------------------------------+ +--------------------------+
 111
 112   1G       4G        4200M          9G          9320M          17G
 113   +---------+----------+-----------+ +------------+-------------+
 114   |  DMA32  |  NORMAL  |  MOVABLE  | |   NORMAL   |   MOVABLE   |
 115   +---------+----------+-----------+ +------------+-------------+
 116
 117
 118 Memory banks may belong to interleaving nodes. In the example below an x86
 119 machine has 16 Gbytes of RAM in 4 memory banks, even banks belong to node 0
 120 and odd banks belong to node 1::
 121
 122
 123   0              4G              8G             12G            16G
 124   +-------------+ +-------------+ +-------------+ +-------------+
 125   |    node 0   | |    node 1   | |    node 0   | |    node 1   |
 126   +-------------+ +-------------+ +-------------+ +-------------+
 127
 128   0   16M      4G
 129   +-----+-------+ +-------------+ +-------------+ +-------------+
 130   | DMA | DMA32 | |    NORMAL   | |    NORMAL   | |    NORMAL   |
 131   +-----+-------+ +-------------+ +-------------+ +-------------+
 132
 133 In this case node 0 will span from 0 to 12 Gbytes and node 1 will span from
 134 4 to 16 Gbytes.
 135
 136 .. _nodes:
 137
 138 Nodes
 139 =====
 140
 141 As we have mentioned, each node in memory is described by a ``pg_data_t`` which
 142 is a typedef for a ``struct pglist_data``. When allocating a page, by default
 143 Linux uses a node-local allocation policy to allocate memory from the node
 144 closest to the running CPU. As processes tend to run on the same CPU, it is
 145 likely the memory from the current node will be used. The allocation policy can
 146 be controlled by users as described in
 147 Documentation/admin-guide/mm/numa_memory_policy.rst.
 148
 149 Most NUMA architectures maintain an array of pointers to the node
 150 structures. The actual structures are allocated early during boot when
 151 architecture specific code parses the physical memory map reported by the
 152 firmware. The bulk of the node initialization happens slightly later in the
 153 boot process by free_area_init() function, described later in Section
 154 :ref:`Initialization <initialization>`.
 155
 156
 157 Along with the node structures, kernel maintains an array of ``nodemask_t``
 158 bitmasks called ``node_states``. Each bitmask in this array represents a set of
 159 nodes with particular properties as defined by ``enum node_states``:
 160
 161 ``N_POSSIBLE``
 162   The node could become online at some point.
 163 ``N_ONLINE``
 164   The node is online.
 165 ``N_NORMAL_MEMORY``
 166   The node has regular memory.
 167 ``N_HIGH_MEMORY``
 168   The node has regular or high memory. When ``CONFIG_HIGHMEM`` is disabled
 169   aliased to ``N_NORMAL_MEMORY``.
 170 ``N_MEMORY``
 171   The node has memory(regular, high, movable)
 172 ``N_CPU``
 173   The node has one or more CPUs
 174
 175 For each node that has a property described above, the bit corresponding to the
 176 node ID in the ``node_states[<property>]`` bitmask is set.
 177
 178 For example, for node 2 with normal memory and CPUs, bit 2 will be set in ::
 179
 180   node_states[N_POSSIBLE]
 181   node_states[N_ONLINE]
 182   node_states[N_NORMAL_MEMORY]
 183   node_states[N_HIGH_MEMORY]
 184   node_states[N_MEMORY]
 185   node_states[N_CPU]
 186
 187 For various operations possible with nodemasks please refer to
 188 ``include/linux/nodemask.h``.
 189
 190 Among other things, nodemasks are used to provide macros for node traversal,
 191 namely ``for_each_node()`` and ``for_each_online_node()``.
 192
 193 For instance, to call a function foo() for each online node::
 194
 195         for_each_online_node(nid) {
 196                 pg_data_t *pgdat = NODE_DATA(nid);
 197
 198                 foo(pgdat);
 199         }
 200
 201 Node structure
 202 --------------
 203
 204 The nodes structure ``struct pglist_data`` is declared in
 205 ``include/linux/mmzone.h``. Here we briefly describe fields of this
 206 structure:
 207
 208 General
 209 ~~~~~~~
 210
 211 ``node_zones``
 212   The zones for this node.  Not all of the zones may be populated, but it is
 213   the full list. It is referenced by this node's node_zonelists as well as
 214   other node's node_zonelists.
 215
 216 ``node_zonelists``
 217   The list of all zones in all nodes. This list defines the order of zones
 218   that allocations are preferred from. The ``node_zonelists`` is set up by
 219   ``build_zonelists()`` in ``mm/page_alloc.c`` during the initialization of
 220   core memory management structures.
 221
 222 ``nr_zones``
 223   Number of populated zones in this node.
 224
 225 ``node_mem_map``
 226   For UMA systems that use FLATMEM memory model the 0's node
 227   ``node_mem_map`` is array of struct pages representing each physical frame.
 228
 229 ``node_page_ext``
 230   For UMA systems that use FLATMEM memory model the 0's node
 231   ``node_page_ext`` is array of extensions of struct pages. Available only
 232   in the kernels built with ``CONFIG_PAGE_EXTENSION`` enabled.
 233
 234 ``node_start_pfn``
 235   The page frame number of the starting page frame in this node.
 236
 237 ``node_present_pages``
 238   Total number of physical pages present in this node.
 239
 240 ``node_spanned_pages``
 241   Total size of physical page range, including holes.
 242
 243 ``node_size_lock``
 244   A lock that protects the fields defining the node extents. Only defined when
 245   at least one of ``CONFIG_MEMORY_HOTPLUG`` or
 246   ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` configuration options are enabled.
 247   ``pgdat_resize_lock()`` and ``pgdat_resize_unlock()`` are provided to
 248   manipulate ``node_size_lock`` without checking for ``CONFIG_MEMORY_HOTPLUG``
 249   or ``CONFIG_DEFERRED_STRUCT_PAGE_INIT``.
 250
 251 ``node_id``
 252   The Node ID (NID) of the node, starts at 0.
 253
 254 ``totalreserve_pages``
 255   This is a per-node reserve of pages that are not available to userspace
 256   allocations.
 257
 258 ``first_deferred_pfn``
 259   If memory initialization on large machines is deferred then this is the first
 260   PFN that needs to be initialized. Defined only when
 261   ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` is enabled
 262
 263 ``deferred_split_queue``
 264   Per-node queue of huge pages that their split was deferred. Defined only when ``CONFIG_TRANSPARENT_HUGEPAGE`` is enabled.
 265
 266 ``__lruvec``
 267   Per-node lruvec holding LRU lists and related parameters. Used only when
 268   memory cgroups are disabled. It should not be accessed directly, use
 269   ``mem_cgroup_lruvec()`` to look up lruvecs instead.
 270
 271 Reclaim control
 272 ~~~~~~~~~~~~~~~
 273
 274 See also Documentation/mm/page_reclaim.rst.
 275
 276 ``kswapd``
 277   Per-node instance of kswapd kernel thread.
 278
 279 ``kswapd_wait``, ``pfmemalloc_wait``, ``reclaim_wait``
 280   Workqueues used to synchronize memory reclaim tasks
 281
 282 ``nr_writeback_throttled``
 283   Number of tasks that are throttled waiting on dirty pages to clean.
 284
 285 ``nr_reclaim_start``
 286   Number of pages written while reclaim is throttled waiting for writeback.
 287
 288 ``kswapd_order``
 289   Controls the order kswapd tries to reclaim
 290
 291 ``kswapd_highest_zoneidx``
 292   The highest zone index to be reclaimed by kswapd
 293
 294 ``kswapd_failures``
 295   Number of runs kswapd was unable to reclaim any pages
 296
 297 ``min_unmapped_pages``
 298   Minimal number of unmapped file backed pages that cannot be reclaimed.
 299   Determined by ``vm.min_unmapped_ratio`` sysctl. Only defined when
 300   ``CONFIG_NUMA`` is enabled.
 301
 302 ``min_slab_pages``
 303   Minimal number of SLAB pages that cannot be reclaimed. Determined by
 304   ``vm.min_slab_ratio sysctl``. Only defined when ``CONFIG_NUMA`` is enabled
 305
 306 ``flags``
 307   Flags controlling reclaim behavior.
 308
 309 Compaction control
 310 ~~~~~~~~~~~~~~~~~~
 311
 312 ``kcompactd_max_order``
 313   Page order that kcompactd should try to achieve.
 314
 315 ``kcompactd_highest_zoneidx``
 316   The highest zone index to be compacted by kcompactd.
 317
 318 ``kcompactd_wait``
 319   Workqueue used to synchronize memory compaction tasks.
 320
 321 ``kcompactd``
 322   Per-node instance of kcompactd kernel thread.
 323
 324 ``proactive_compact_trigger``
 325   Determines if proactive compaction is enabled. Controlled by
 326   ``vm.compaction_proactiveness`` sysctl.
 327
 328 Statistics
 329 ~~~~~~~~~~
 330
 331 ``per_cpu_nodestats``
 332   Per-CPU VM statistics for the node
 333
 334 ``vm_stat``
 335   VM statistics for the node.
 336
 337 .. _zones:
 338
 339 Zones
 340 =====
 341
 342 .. admonition:: Stub
 343
 344    This section is incomplete. Please list and describe the appropriate fields.
 345
 346 .. _pages:
 347
 348 Pages
 349 =====
 350
 351 .. admonition:: Stub
 352
 353    This section is incomplete. Please list and describe the appropriate fields.
 354
 355 .. _folios:
 356
 357 Folios
 358 ======
 359
 360 .. admonition:: Stub
 361
 362    This section is incomplete. Please list and describe the appropriate fields.
 363
 364 .. _initialization:
 365
 366 Initialization
 367 ==============
 368
 369 .. admonition:: Stub
 370
 371    This section is incomplete. Please list and describe the appropriate fields.