System monitoring and best practice: Difference between revisions

From gfi
Ngfih (talk | contribs)
No edit summary
Ngfih (talk | contribs)
No edit summary
Line 2: Line 2:
Users share cyclone computing resources (CPU and memory) at the same time. With several nodes and limited resources, it is necessary to monitor system resources and distribute users and applications on the relevant servers. Here are a few recommendations to make sure that everybody can use it most efficiently.
Users share cyclone computing resources (CPU and memory) at the same time. With several nodes and limited resources, it is necessary to monitor system resources and distribute users and applications on the relevant servers. Here are a few recommendations to make sure that everybody can use it most efficiently.


Lookup cpuload and memory, included history (username:gfireader, pw: gfimonitor):
Monitor cpuload and memory, included history (username:gfireader, pw: gfimonitor):
[https://monitor.gfi.uib.no/d/rYdddlPWj/node-exporter-full?orgId=1&from=now-3h&to=now&timezone=browser&var-DS_PROMETHEUS=default&var-job=node_exporters&var-node=cyclone1.gfi.uib.no:9100 MonitorCyclones]  
[https://monitor.gfi.uib.no/d/rYdddlPWj/node-exporter-full?orgId=1&from=now-3h&to=now&timezone=browser&var-DS_PROMETHEUS=default&var-job=node_exporters&var-node=cyclone1.gfi.uib.no:9100 Cyclones]  


Lookup GPU resources included history (username:gfireader, pw: gfimonitor):
Monitor GPU resources included history (username:gfireader, pw: gfimonitor):
[https://monitor.gfi.uib.no/d/ads6vth/nvidia-dcgm-exporter-dashboard-cyclone3?orgId=1&from=now-5m&to=now&timezone=browser&var-instance=cyclone3.gfi.uib.no:9400 MonitorCyclone3 GPU]
[https://monitor.gfi.uib.no/d/ads6vth/nvidia-dcgm-exporter-dashboard-cyclone3?orgId=1&from=now-5m&to=now&timezone=browser&var-instance=cyclone3.gfi.uib.no:9400 Cyclone3 GPU]


Look up storage quotas and backup status: [https://grafse.app.uib.no/public-dashboards/22f87936e6e649f0806c36379562ebae UiB Grafana web site]
Monitor storage quotas and backup status: [https://grafse.app.uib.no/public-dashboards/22f87936e6e649f0806c36379562ebae UiB Grafana web site]




Useful commands are
Other useful monitoring commands are
     • htop and top: Monitor usage of CPU and memory usage for the entire system. CPU usage given here is percent of one CPU core. I.e., a usage of 800% corresponds to 8 CPU cores being fully used.
     • htop and top: Monitor usage of CPU and memory usage for the entire system. CPU usage given here is percent of one CPU core. I.e., a usage of 800% corresponds to 8 CPU cores being fully used.
     • top -u <your_username>: Monitor CPU and memory usage for your programs only.
     • top -u <your_username>: Monitor CPU and memory usage for your programs only.
Line 18: Line 18:
'''Limit your CPU use'''
'''Limit your CPU use'''


Often, one program running on cyclone will use one CPU with maximum 100% CPU. However, some software such as Matlab, some Python modules, some model simulations, etc, by default try to occupy the entire machine they run on (72 virtual CPU cores for cyclone). This drastically slows down the jobs of other users, especially if several such programs run at same time.
Often, one program running on cyclone will use one CPU with maximum 100% CPU. However, some software such as Matlab, some Python modules, some model simulations, etc, by default try to occupy the entire machine they run on (64 CPU cores for cyclone1 & 2). This drastically slows down the jobs of other users, especially if several such programs run at same time.


CPU usage on cyclone is limited by the operating system to 25 virtual CPU cores - approx 1/3 of total. Trying to use more than that will not only slow down execution for everybody else, but also for yourself!
CPU usage on cyclones are currenly limited by the operating system to eight (8) CPU cores. Trying to use more than that will not only slow down execution for everybody else, but also for yourself!


Therefore, please limit the number of CPUs your programs use.
Therefore, please limit the number of CPUs your programs use.
Line 26: Line 26:
- For Matlab:  
- For Matlab:  
LASTN = maxNumCompThreads(N)
LASTN = maxNumCompThreads(N)
with N set to a maximum of 8 (preferably less) - the user allows for 8 physical cores and 16 virtual cores.
with N set to a maximum of 8 (preferably less) - the user allows for 8 physical cores.


https://se.mathworks.com/help/matlab/ref/maxnumcompthreads.html
https://se.mathworks.com/help/matlab/ref/maxnumcompthreads.html
Line 46: Line 46:
This is currently less critical, as cyclone has a large amount of memory. However, sometimes, if the job uses a lot of memory it can be the sign of something going wrong in your script. Therefore, make sure to control your memory and to clear up/delete all unused variables (to deallocate the memory). Try also to close Matlab or Python when the job is finished, in the evening or before leaving for the weekend.
This is currently less critical, as cyclone has a large amount of memory. However, sometimes, if the job uses a lot of memory it can be the sign of something going wrong in your script. Therefore, make sure to control your memory and to clear up/delete all unused variables (to deallocate the memory). Try also to close Matlab or Python when the job is finished, in the evening or before leaving for the weekend.


There is a hard limit imposed by the operating system that no user can take up more than half of cyclone’s memory. If you attempt to violate this restriction, the offending process will be killed by the operating system.
There is a hard limit imposed by the operating system that no user session can allocate >120GB memory. If you attempt to violate this restriction, the offending process will be killed by the operating system.


Thank you for following these recommendations.
Thank you for following these recommendations.

Revision as of 13:01, 17 March 2026

Users share cyclone computing resources (CPU and memory) at the same time. With several nodes and limited resources, it is necessary to monitor system resources and distribute users and applications on the relevant servers. Here are a few recommendations to make sure that everybody can use it most efficiently.

Monitor cpuload and memory, included history (username:gfireader, pw: gfimonitor): Cyclones

Monitor GPU resources included history (username:gfireader, pw: gfimonitor): Cyclone3 GPU

Monitor storage quotas and backup status: UiB Grafana web site


Other useful monitoring commands are

   • htop and top: Monitor usage of CPU and memory usage for the entire system. CPU usage given here is percent of one CPU core. I.e., a usage of 800% corresponds to 8 CPU cores being fully used.
   • top -u <your_username>: Monitor CPU and memory usage for your programs only.


Limit your CPU use

Often, one program running on cyclone will use one CPU with maximum 100% CPU. However, some software such as Matlab, some Python modules, some model simulations, etc, by default try to occupy the entire machine they run on (64 CPU cores for cyclone1 & 2). This drastically slows down the jobs of other users, especially if several such programs run at same time.

CPU usage on cyclones are currenly limited by the operating system to eight (8) CPU cores. Trying to use more than that will not only slow down execution for everybody else, but also for yourself!

Therefore, please limit the number of CPUs your programs use.

- For Matlab: LASTN = maxNumCompThreads(N) with N set to a maximum of 8 (preferably less) - the user allows for 8 physical cores.

https://se.mathworks.com/help/matlab/ref/maxnumcompthreads.html

- For Python (especially when using Pandas) and OpenMP-parallelised programs (written in Fortran, C, or any other language): 1) In the shell, before starting the program: export OMP_NUM_THREADS=N 2) Alternatively, within python: import os os.environ["OMP_NUM_THREADS"] = "N"

Preferably, set N to a maximum of 8 (preferably less).

If these limits are too restrictive for your application, you may want to look at other options for running heavily parallelized jobs, such as Fram of the Norwegian e-infrastructure.


Limit your memory use

This is currently less critical, as cyclone has a large amount of memory. However, sometimes, if the job uses a lot of memory it can be the sign of something going wrong in your script. Therefore, make sure to control your memory and to clear up/delete all unused variables (to deallocate the memory). Try also to close Matlab or Python when the job is finished, in the evening or before leaving for the weekend.

There is a hard limit imposed by the operating system that no user session can allocate >120GB memory. If you attempt to violate this restriction, the offending process will be killed by the operating system.

Thank you for following these recommendations.