System monitoring and best practice
Active users share cyclone computing resources (CPU and memory). With several nodes and limited resources, it is necessary to monitor system resources and distribute users and applications on the relevant servers. Here are some user recommendations to make sure that everybody can use it more efficiently.
Read and relate to motd message (Message Of The Day) at ssh login - also given at command: "motd"
Monitor cpuload and memory, included history (username:gfireader, pw: gfimonitor): CyclonesMonitor
Monitor GPU resources included history (username:gfireader, pw: gfimonitor): Cyclone3 GPU
Monitor storage quotas and backup status: UiB Grafana web site
Other useful monitoring commands are:
• htop, top ('top -e -m'): Monitor usage of CPU and memory usage for the entire system. CPU usage given here is percent of one CPU core. I.e., a usage of 800% corresponds to 8 CPU cores being fully used.
• (h)top -u <your_username>: Monitor CPU and memory usage for your programs only.
• free -h -t: Monitor total memory, used, free and available
Be aware: Compared to top , the numerical "used" memory value in htop more closely represents actively used memory by processes, excluding buffers and cache. From top "used" memory value often includes buffers and cache by default, which can be misleading as this memory can be freed if needed.
Limit your CPU use
CPU usage on cyclones is currenly limited by the operating system to eight (8) CPU cores per user session. Trying to use more than that will not only slow down execution for everybody else, but also for yourself!
Please limit the number of CPUs your programs use.
Often, one program running on cyclone will use one CPU with maximum 100% CPU. However, some software such as Matlab, some Python modules, some model simulations, etc, by default try to occupy the entire machine they run on (64 CPU cores for cyclone1 & 2). This drastically slows down the jobs of other users, especially if several such programs run at same time.
- For Matlab: LASTN = maxNumCompThreads(N) with N set to a maximum of 8 (preferably less) - the user allows for 8 physical cores.
https://se.mathworks.com/help/matlab/ref/maxnumcompthreads.html
- For Python (especially when using Pandas) and OpenMP-parallelised programs (written in Fortran, C, or any other language): 1) In the shell, before starting the program: export OMP_NUM_THREADS=N 2) Alternatively, within python: import os os.environ["OMP_NUM_THREADS"] = "N"
Preferably, set N to a maximum of 8 (preferably less).
If these limits are too restrictive for your application, you may want to look at other options for running heavily parallelized jobs, such as Fram of the Norwegian e-infrastructure.
Limit your memory use
Cyclone1 and Cyclone2 have a physical limit of 512GB RAM. There is a hard limit imposed by the operating system that no user session can allocate >120GB memory. If you attempt to violate this restriction, the offending process will be killed by the operating system.
Some times, if the job uses a lot of memory it can be the sign of something going wrong in your script. Therefore, make sure to control your memory and to clear up/delete all unused variables (to deallocate the memory). Try also to close Matlab or Python when the job is finished, in the evening or before leaving for the weekend.
Thank you for following these recommendations.
Useful memory monitoring aliases for your login files:
• alias pstop-cpu='ps axcu --sort -%cpu | head -n30 | numfmt --invalid=ignore --header=1 --field=5,6 --from-unit=1Ki --to=iec | column -t' • alias pstop-mem='ps axcu --sort -pmem | head -n30 | numfmt --invalid=ignore --header=1 --field=5,6 --from-unit=1Ki --to=iec | column -t' • alias mypstop-cpu='ps xcu --sort -%cpu | head -n10 | numfmt --invalid=ignore --header=1 --field=5,6 --from-unit=1Ki --to=iec | column -t' • alias mypstop-mem='ps xcu --sort -pmem | head -n10 | numfmt --invalid=ignore --header=1 --field=5,6 --from-unit=1Ki --to=iec | column -t' • alias mysmem='smem -t -r | numfmt --invalid=ignore --header=1 --field=4,5,6,7 --from-unit=1Ki --to=iec --format="%.1f"'
