A typical Lustre filesystem consists of a meta data server (MDS) and multiple object storage servers (OSSes). Each OSS stores file data on multiple object storage targets (OSTs) which in turn contain many conventional hard disks. Files can be striped over multiple OSTs to increase performance. The MDS stores striping information and other attributes on meta data targets (MDTs) using much faster solid state drives (SSDs). These servers -- forming the Lustre filesystem -- are designed to handle a large number of I/O requests (read, write, stat, open, close, etc.) serving jobs and interactive processes alike. But their capabilities are not unlimited. An excessive number of requests caused by one or more inefficient use cases can degrade the performance and the overall health of the complete file system for all users.
For the data in /lustre/scratch we use a Lustre feature called Data-on-MDT (DoM). It has the effect that if files are under a certain size, their data is not stored on OSTs but on (the much faster) MDTs. But if computations create an excessive amount of very small output files (hundreds of thousands) the MDTs can actually be filled up with their data. The size ratio between the OSTs and the MDTs is designed for typical HPC uses cases. If your use case deviates significantly from that (Terabytes of tiny files) you will cause the MDTs to run out of space.
Find out if your data causes problems:
Use the command lfs quota -h -v -u <UserID_hpc> /lustre/ to see your usage of the Lustre filesystems!
The command generates a table of MDTs and OSTs:
scratch and mlvnme combined.lsf-MDT0000_UUID to lsf-MDT0003_UUID) show your usage of the MDTs. Each MDT has a capacity of 6TB and has to be shared with all other cluster users. So it's OK if you have thousands of files using up around a dozen GB but if you have millions of files using hundreds of GB or more, you need to change your usage patterns.lfs-OST0000_UUID to lfs-OST0027_UUID) show your usage of the hard-disk-based OSTs (i.e. your data in the scratch filesystem).quotactl ost.. failed.) are normal and you can ignore them.lfs-OST0064_UUID to lfs-OST006d_UUID) show your usage of the OSTs making up the mlnvme filesystem. Each has a capycity of 27TB.Best practice:
mlnvme-OSTs, you need to put your data elsewhere. Keep in mind that simply moving you data to a workspace in scratch using the mv command will not actually free up space on mlnvme-OSTs. Just as described in the next section about striping, you need to first copy the data (cp), then remove it at the source (rm or better yet munlink) and then you can release the workspace.The high performance Lustre filesystems achieve comes mostly from the ability to store data over multiple targets. But the striping also dictates whether data is just stored on OSTs or also on MDTs. The default striping on /lustre/scratch uses the MDTs for all data. It is thus highly relevant for the problem of clogged up MDTs.
Commands:
lfs getstripe <directory|file>lfs setstripe [STRIPE_OPTIONS] <directory|file>mv command) keeps its striping.cp command, to a location with different striping) to change its striping.lfs setstripe -E 1G -c 1 -S 1M -E 10G -c 8 -S 1M -E -1 -c -1 -S 1M <directory|file>Best practices:
Using the standard rm command to delete a large amount of small files in not recommended on Lustre filesystems -- especially if it is performed recursively over many subfolders (rm -r). The problem is that rm requests the metadata (filetype, owner, permission, modification time, etc.) for each file before it removes it. This leads to unnecessary load on the MDS resulting in lower performance for the user performing the command (it takes much longer) and potentially also for other users.
Best practice:
find command to recursively discover files and then use the munlink command to remove them:find -P <dirname> -type f -print0 -o -type l -print0 | xargs -0 munlink-delete option of the find command to recursively remove all empty folders:find -P <dirname> -type d -empty -deleteSee https://pawsey.atlassian.net/wiki/spaces/US/pages/51925900/Deleting+Large+Numbers+of+Files+on+Lustre+Filesystems for a more detailed explanation of both commands.
Tip: You should also do this before you release a workspace with
ws_release, as releasing a workspace by itself will not delete any data immediately. This is because a released workspace is kept for 21 days for emergency recovery.
Using the ls -l command shows the permissions, ownership, size and last modification time of files and directories. The MDS has this information readily available. But there is one exception: The size of files is only available on the OSTs. So using the -l option creates comparatively costly requests to the OSTs and can thus take a long time if it is used on a large number of files.
Best practice:
ls without the -l option if you only want to see which files exist.ls -l <filename> if you need to know the additional information for a specific file.See https://www.nas.nasa.gov/hecc/support/kb/lustre-best-practices_226.html for additional explanations and more best practices.
The University of Helsinki also uses Lustre for its HPC systems and has created a detailed user guide containing many useful commands and best practices: https://wiki.helsinki.fi/xwiki/bin/view/it4sci/IT for Science group/HPC Environment User Guide/Lustre User Guide/