Use NVDIMM in QEMU
tldr: The
--qemu-pmem
option inrun.py
will add persistent memory to the VM. If you want to customize further, read on.
Qemu has support for NVDIMM that is provided by a memory-backend-file or memory-backend-ram. A simple way to create a vNVDIMM device at startup time is done via the following command-line options:
-machine pc,nvdimm
-m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE
-object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE
-device nvdimm,id=nvdimm1,memdev=mem1
Where,
-
the
nvdimm
machine option enables vNVDIMM feature. -
slots=$N
should be equal to or larger than the total amount of normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here. -
maxmem=$MAX_SIZE
should be equal to or larger than the total size of normal RAM devices and vNVDIMM devices. -
object memory-backend-file,id=mem1,share=on,mem-path=$PATH, size=$NVDIMM_SIZE
creates a backend storage of size$NVDIMM_SIZE
. -
share=on/off
controls the visibility of guest writes. Ifshare=on
, then the writes from multiple guests will be visible to each other. -
device nvdimm,id=nvdimm1,memdev=mem1
creates a read/write virtual NVDIMM device whose storage is provided by above memory backend device.
Guest Data Persistence
Though QEMU supports multiple types of vNVDIMM backends on Linux, the only backend that can guarantee the guest write persistence is:
- DAX device (e.g.,
/dev/dax0.0
, ) or - DAX file(mounted with dax option)
When using DAX file (A file supporting direct mapping of persistent memory) as a
backend, write persistence is guaranteed if the host kernel has support for the
MAP_SYNC
flag in the mmap system call and additionally, both 'pmem' and
'share' flags are set to 'on' on the backend.
NVDIMM Persistence
Users can provide a persistence value to a guest via the optional
nvdimm-persistence
machine command line option:
-machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu
There are currently two valid values for this option:
mem-ctrl
- The platform supports flushing dirty data from the memory
controller to the NVDIMMs in the event of power loss.
cpu
- The platform supports flushing dirty data from the CPU cache to the
NVDIMMs in the event of power loss.
Emulate PMEM using DRAM
Linux systems allow emulating DRAM as PMEM. These devices are seen as the Persistent Memory Region by the OS. Usually, these devices are faster than actual PMEM devices and do not provide any persistence. So, such devices are used only for development purposes.
On Linux, to find the DRAM region that can be used as PMEM, use dmesg:
dmesg | grep BIOS-e820
The viable region will have "usable" word at the end.
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000053fffffff] usable
This means that the memory region between 4 GiB (0x0000000100000000) and 21 GiB (0x000000053fffffff) is usable. Say we want to reserve a 16 GiB region starting from 4 GiB; we need to add this information to the grub configuration file.
sudo vi /etc/default/grub
GRUB_CMDLINE_LINUX="memmap=16G!4G"
sudo update-grub2
After rebooting with our new kernel parameter, the dmesg | grep user
should
show a persistent memory region like the following:
[ 0.000000] user: [mem 0x0000000100000000-0x00000004ffffffff] persistent (type 12)
We will see this reserved memory range as /dev/pmem0
. Now the emulated PMEM
region is ready to use. Mount it with the dax option.
sudo mkdir /mnt/pmem0
sudo mkfs.ext4 /dev/pmem0
sudo mount -o dax /dev/pmem0 /mnt/pmem0
Use it as a mem-path=/mnt/pmem0
as explained earlier.
Configure and Provision NVDIMMs
The NVDIMMs need to be configured and provisioned before using them for the
applications. Intel ipmctl
tool can be used to discover and provision the
Intel PMs on a Linux machine.
To show all the NVDIMMs attached to the machine, run:
sudo ipmctl show -dimm
To show all the NVDIMMs attached on a socket, run:
sudo ipmctl show -dimm -socket SocketID
Provisioning
NVDIMMs can be configured both in volatile (MemoryMode) and non-volatile
(AppDirect) modes or a mix of two using ipmctl
tool on Linux.
We are only interested in using the NVDIMMs in AppDirect mode. Even in AppDirect mode, the NVDIMMs can be configured in two ways; AppDirect and AppDirectNotInterleaved. In AppDirect mode, the data is interleaved across multiple DIMMs, and to use each NVDIMMs individually, AppDirectNotInterleaved is used. To configure multiple DIMMs in AppDirect interleaved mode, run:
sudo ipmctl create -goal PersistentMemoryType=AppDirect
Reboot the machine to reflect the changes made using the -goal command. The command creates a region on each socket on the machine.
ndctl show --regions
To show the DIMMs included in each region, run:
ndctl show --regions --dimms
Each region can be divided in one or more namespaces in order to show the storage devices in the operating system. To create the namespace(s), run:
sudo ndctl create-namespace --mode=[raw/sector/fsdax/devdax]
The namespace can be created in different modes like raw, sector, fsdax, and devdax. The default mode is fsdax.
Reboot the machine after creating the namespaces, and the devices will show-up in /dev/depending on the mode. For example, if the mode is fsdax, the devices will be named /dev/pmem.
Mount these devices:
sudo mkdir /mnt/pmem0
sudo mkfs.ext4 /dev/pmem0
sudo mount -o dax /dev/pmem0 /mnt/pmem0
These mount points can be used directly in the userspace applications or for Qemu virtual machine as explained earlier.