ReFrame configuration file¶
In order for ReFrame to run tests on your system, it needs to know some properties about your system.
For example, it needs to know what kind of job scheduler you have, which partitions the system has,
how to submit to those partitions, etc.
All of this has to be described in a ReFrame configuration file (see also the section on $RFM_CONFIG_FILES
).
This page is organized as follows:
- available ReFrame configuration files
- Verifying your ReFrame configuration
- How to write a ReFrame configuration file
Available ReFrame configuration files¶
There are some available ReFrame configuration files for HPC systems and public cloud in the config directory for more inspiration.
Below is a simple ReFrame configuration file with minimal changes required for getting you started on using the test suite for a CPU partition. Please check that stagedir
is set to a path on a (shared) scratch filesystem for storing (temporary) files related to the tests, and access
is set to a list of arguments that you would normally pass to the scheduler when submitting to this partition (for example '-p cpu' for submitting to a Slurm partition called cpu).
To write a ReFrame configuration file for your system, check the section How to write a ReFrame configuration file.
"""
simple ReFrame configuration file
"""
import os
from eessi.testsuite.common_config import common_logging_config, common_eessi_init, format_perfvars, perflog_format
from eessi.testsuite.constants import *
site_configuration = {
'systems': [
{
'name': 'cpu_partition',
'descr': 'CPU partition',
'modules_system': 'lmod',
'hostnames': ['*'],
# Note that the stagedir should be a shared directory available on all nodes running ReFrame tests
'stagedir': f'/some/shared/dir/{os.environ.get("USER")}/reframe_output/staging',
'partitions': [
{
'name': 'cpu_partition',
'descr': 'CPU partition',
'scheduler': 'slurm',
'launcher': 'mpirun',
'access': ['-p cpu', '--export=None'],
'prepare_cmds': ['source %s' % common_eessi_init()],
'environs': ['default'],
'max_jobs': 4,
'resources': [
{
'name': 'memory',
'options': ['--mem={size}'],
}
],
'features': [
FEATURES[CPU]
] + list(SCALES.keys()),
}
]
},
],
'environments': [
{
'name': 'default',
'cc': 'cc',
'cxx': '',
'ftn': '',
},
],
'logging': common_logging_config(),
'general': [
{
# Enable automatic detection of CPU architecture for each partition
# See https://reframe-hpc.readthedocs.io/en/stable/configure.html#auto-detecting-processor-information
'remote_detect': True,
}
],
}
# optional logging to syslog
site_configuration['logging'][0]['handlers_perflog'].append({
'type': 'syslog',
'address': '/dev/log',
'level': 'info',
'format': f'reframe: {perflog_format}',
'format_perfvars': format_perfvars,
'append': True,
})
Verifying your ReFrame configuration¶
To verify the ReFrame configuration, you can query the configuration using --show-config
.
To see the full configuration, use:
To only show the configuration of a particular system partition, you can use the --system
option.
To query a specific setting, you can pass an argument to --show-config
.
For example, to show the configuration of the gpu
partition of the example
system:
You can drill it down further to only show the value of a particular configuration setting.
For example, to only show the launcher
value for the gpu
partition of the example
system:
How to write a ReFrame configuration file¶
The official ReFrame documentation provides the full description on configuring ReFrame for your site. However, there are some configuration settings that are specifically required for the EESSI test suite. Also, there are a large amount of configuration settings available in ReFrame, which makes the official documentation potentially a bit overwhelming.
Here, we will describe how to create a configuration file that works with the EESSI test suite, starting from an
example configuration file settings_example.py
,
which defines the most common configuration settings.
Python imports¶
The EESSI test suite standardizes a few string-based values as constants, as well as the logging format used by ReFrame. Every ReFrame configuration file used for running the EESSI test suite should therefore start with the following import statements:
from eessi.testsuite.common_config import common_logging_config, common_eessi_init
from eessi.testsuite.constants import *
High-level system info (systems
)¶
First, we describe the system at its highest level through the
systems
keyword.
You can define multiple systems in a single configuration file (systems
is a Python list value).
We recommend defining just a single system in each configuration file, as it makes the configuration file a bit easier to digest (for humans).
An example of the systems
section of the configuration file would be:
site_configuration = {
'systems': [
# We could list multiple systems. Here, we just define one
{
'name': 'example',
'descr': 'Example cluster',
'modules_system': 'lmod',
'hostnames': ['*'],
'stagedir': f'/some/shared/dir/{os.environ.get("USER")}/reframe_output/staging',
'partitions': [...],
}
]
}
The most common configuration items defined at this level are:
name
: The name of the system. Pick whatever makes sense for you.descr
: Description of the system. Again, pick whatever you like.modules_system
: The modules system used on your system. EESSI provides modules inlmod
format. There is no need to change this, unless you want to run tests from the EESSI test suite with non-EESSI modules.hostnames
: The names of the hosts on which you will run the ReFrame command, as regular expression. Using these names, ReFrame can automatically determine which of the listed configurations in thesystems
list to use, which is useful if you're defining multiple systems in a single configuration file. If you follow our recommendation to limit yourself to one system per configuration file, simply define'hostnames': ['*']
.prefix
: Prefix directory for a ReFrame run on this system. Any directories or files produced by ReFrame will use this prefix, if not specified otherwise. We recommend setting the$RFM_PREFIX
environment variable rather than specifyingprefix
in your configuration file, so our common logging configuration can pick up on it (see also$RFM_PREFIX
).stagedir
: A shared directory that is available on all nodes that will execute ReFrame tests. This is used for storing (temporary) files related to the test. Typically, you want to set this to a path on a (shared) scratch filesystem. Defining this is optional: the default is a 'stage
' directory inside theprefix
directory.partitions
: Details on system partitions, see below.
System partitions (systems.partitions
)¶
The next step is to add the system partitions to the configuration files, which is also specified as a Python list since a system can have multiple partitions.
The partitions
section of the configuration for a system with two Slurm partitions (one CPU partition,
and one GPU partition) could for example look something like this:
site_configuration = {
'systems': [
{
...
'partitions': [
{
'name': 'cpu_partition',
'descr': 'CPU partition'
'scheduler': 'slurm',
'prepare_cmds': ['source %s' % common_eessi_init()],
'launcher': 'mpirun',
'access': ['-p cpu'],
'environs': ['default'],
'max_jobs': 4,
'features': [
FEATURES[CPU]
] + list(SCALES.keys()),
},
{
'name': 'gpu_partition',
'descr': 'GPU partition'
'scheduler': 'slurm',
'prepare_cmds': ['source %s' % common_eessi_init()],
'launcher': 'mpirun',
'access': ['-p gpu'],
'environs': ['default'],
'max_jobs': 4,
'resources': [
{
'name': '_rfm_gpu',
'options': ['--gpus-per-node={num_gpus_per_node}'],
}
],
'devices': [
{
'type': DEVICE_TYPES[GPU],
'num_devices': 4,
}
],
'features': [
FEATURES[CPU],
FEATURES[GPU],
],
'extras': {
GPU_VENDOR: GPU_VENDORS[NVIDIA],
},
},
]
}
]
}
The most common configuration items defined at this level are:
name
: The name of the partition. Pick anything you like.descr
: Description of the partition. Again, pick whatever you like.scheduler
: The scheduler used to submit to this partition, for exampleslurm
. All valid options can be found in the ReFrame documentation.launcher
: The parallel launcher used on this partition, for examplempirun
orsrun
. All valid options can be found in the ReFrame documentation.access
: A list of arguments that you would normally pass to the scheduler when submitting to this partition (for example '-p cpu
' for submitting to a Slurm partition calledcpu
). If supported by your scheduler, we recommend to not export the submission environment (for example by using '--export=None
' with Slurm). This avoids test failures due to environment variables set in the submission environment that are passed down to submitted jobs.prepare_cmds
: Commands to execute at the start of every job that runs a test. If your batch scheduler does not export the environment of the submit host, this is typically where you can initialize the EESSI environment.environs
: The names of the programming environments (to be defined later in the configuration file viaenvironments
) that may be used on this partition. A programming environment is required for tests that are compiled first, before they can run. The EESSI test suite however only tests existing software installations, so no compilation (or specific programming environment) is needed. Simply specify'environs': ['default']
, since ReFrame requires that a default environment is defined.max_jobs
: The maximum amount of jobs ReFrame is allowed to submit in parallel. Some batch systems limit how many jobs users are allowed to have in the queue. You can use this to make sure ReFrame doesn't exceed that limit.resources
: This field defines how additional resources can be requested in a batch job. Specifically, on a GPU partition, you have to define a resource with the name '_rfm_gpu
'. Theoptions
field should then contain the argument to be passed to the batch scheduler in order to request a certain number of GPUs per node, which could be different for different batch schedulers. For example, when using Slurm you would specify:processor
: We recommend to NOT define this field, unless CPU autodetection is not working for you. The EESSI test suite relies on information about your processor topology to run. Using CPU autodetection is the easiest way to ensure that all processor-related information needed by the EESSI test suite are defined. Only if CPU autodetection is failing for you do we advice you to set theprocessor
in the partition configuration as an alternative. Although additional fields might be used by future EESSI tests, at this point you'll have to specify at least the following fields:features
: Thefeatures
field is used by the EESSI test suite to run tests only on a partition if it supports a certain feature (for example if GPUs are available). Feature names are standardized in the EESSI test suite ineessi.testsuite.constants.FEATURES
dictionary. Typically, you want to definefeatures: [FEATURES[CPU]] + list(SCALES.keys())
for CPU based partitions, andfeatures: [FEATURES[GPU]] + list(SCALES.keys())
for GPU based partitions. The first tells the EESSI test suite that this partition can only run CPU-based tests, whereas second indicates that this partition can only run GPU-based tests. You can define a single partition to have both the CPU and GPU features (sincefeatures
is a Python list). However, since the CPU-based tests will not ask your batch scheduler for GPU resources, this may fail on batch systems that force you to ask for at least one GPU on GPU-based nodes. Also, running CPU-only code on a GPU node is typically considered bad practice, thus testing its functionality is typically not relevant. Thelist(SCALES.keys())
adds all the scales that may be used by EESSI tests to thefeatures
list. These scales are defined ineessi.testsuite.constants.SCALES
and define at which scales tests should be run, e.g. single core, half a node, a full node, two nodes, etc. This can be used to exclude running at certain scales on systems that would not support it. E.g. some systems might not support requesting multiple partial nodes, which is what the1cpn_2nodes
(1 core per node, on two nodes) and1cpn_4nodes
scales do. One could exclude these by setting e.g.features: [FEATURES[CPU]] + [s for s in SCALES if s not in ['1cpn_2nodes', '1cpn_4nodes']]
. With this configuration setting, ReFrame will run all the scales listed in `eessi.testsuite.constants.SCALES except those two. In a similar way, one could exclude all multinode tests if one just has a single node available.devices
: This field specifies information on devices (for example) present in the partition. Device types are standardized in the EESSI test suite in theeessi.testsuite.constants.DEVICE_TYPES
dictionary. This is used by the EESSI test suite to determine how many of these devices it can/should use per node. Typically, there is no need to definedevices
for CPU partitions. For GPU partitions, you want to define something like:extras
: This field specifies extra information on the partition, such as the GPU vendor. Valid fields forextras
are standardized as constants ineessi.testsuite.constants
(for exampleGPU_VENDOR
). This is used by the EESSI test suite to decide if a partition can run a test that specifically requires a certain brand of GPU. Typically, there is no need to defineextras
for CPU partitions. For GPU partitions, you typically want to specify the GPU vendor, for example:
Note that as more tests are added to the EESSI test suite, the use of features
, devices
and extras
by the EESSI test suite may be extended, which may require an update of your configuration file to define newly recognized fields.
Note
Keep in mind that ReFrame partitions are virtual entities: they may or may not correspond to a partition as it is configured in your batch system. One might for example have a single partition in the batch system, but configure it as two separate partitions in the ReFrame configuration file based on additional constraints that are passed to the scheduler, see for example the AWS CitC example configuration.
The EESSI test suite (and more generally, ReFrame) assumes the hardware within a partition defined in the ReFrame configuration file is homogeneous.
Environments¶
ReFrame needs a programming environment to be defined in its configuration file for tests that need to be compiled before they are run. While we don't have such tests in the EESSI test suite, ReFrame requires some programming environment to be defined:
site_configuration = {
...
'environments': [
{
'name': 'default', # Note: needs to match whatever we set for 'environs' in the partition
'cc': 'cc',
'cxx': '',
'ftn': '',
}
]
}
Note
The name
here needs to match whatever we specified for the environs
property of the partitions.
Logging¶
ReFrame allows a large degree of control over what gets logged, and where. For convenience, we have created a common logging
configuration in eessi.testsuite.common_config
that provides a reasonable default. It can be used by importing common_logging_config
and calling it as a function
to define the 'logging
setting:
from eessi.testsuite.common_config import common_logging_config
site_configuration = {
...
'logging': common_logging_config(),
}
$RFM_PREFIX
environment variable, the output, performance log, and
regular ReFrame logs will all end up in the directory specified by $RFM_PREFIX
, which we recommend doing.
Alternatively, a prefix can be passed as an argument like common_logging_config(prefix)
, which will control where
the regular ReFrame log ends up. Note that the performance logs do not respect this prefix: they will still end up
in the standard ReFrame prefix (by default the current directory, unless otherwise set with $RFM_PREFIX
or --prefix
).
Auto-detection of processor information¶
You can let ReFrame auto-detect the processor information for your system.
Creation of topology file by ReFrame¶
ReFrame will automatically use auto-detection when two conditions are met:
- The
partitions
section of you configuration file does not specifyprocessor
information for a particular partition (as per our recommendation in the previous section); - The
remote_detect
option is enabled in thegeneral
part of the configuration, as follows:
To trigger the auto-detection of processor information, it is sufficient to let ReFrame list the available tests:
ReFrame will store the processor information for your system in ~/.reframe/topology/<system>-<partition>/processor.json
.
Create topology file¶
You can also use the reframe option --detect-host-topology
to create the topology file yourself.
Run the following command on the cluster of which you need the topology.
The output will be put in a file if this is specified or printed in the output. It will look something like this:
{
"arch": "skylake_avx512",
"topology": {
"numa_nodes": [
"0x111111111",
"0x222222222",
"0x444444444",
"0x888888888"
],
"sockets": [
"0x555555555",
"0xaaaaaaaaa"
],
"cores": [
"0x000000001",
"0x000000002",
"0x000000004",
"0x000000008",
"0x000000010",
"0x000000020",
"0x000000040",
"0x000000080",
"0x000000100",
"0x000000200",
"0x000000400",
"0x000000800",
"0x000001000",
"0x000002000",
"0x000004000",
"0x000008000",
"0x000010000",
"0x000020000",
"0x000040000",
"0x000080000",
"0x000100000",
"0x000200000",
"0x000400000",
"0x000800000",
"0x001000000",
"0x002000000",
"0x004000000",
"0x008000000",
"0x010000000",
"0x020000000",
"0x040000000",
"0x080000000",
"0x100000000",
"0x200000000",
"0x400000000",
"0x800000000"
],
"caches": [
{
"type": "L2",
"size": 1048576,
"linesize": 64,
"associativity": 16,
"num_cpus": 1,
"cpusets": [
"0x000000001",
"0x000000002",
"0x000000004",
"0x000000008",
"0x000000010",
"0x000000020",
"0x000000040",
"0x000000080",
"0x000000100",
"0x000000200",
"0x000000400",
"0x000000800",
"0x000001000",
"0x000002000",
"0x000004000",
"0x000008000",
"0x000010000",
"0x000020000",
"0x000040000",
"0x000080000",
"0x000100000",
"0x000200000",
"0x000400000",
"0x000800000",
"0x001000000",
"0x002000000",
"0x004000000",
"0x008000000",
"0x010000000",
"0x020000000",
"0x040000000",
"0x080000000",
"0x100000000",
"0x200000000",
"0x400000000",
"0x800000000"
]
},
{
"type": "L1",
"size": 32768,
"linesize": 64,
"associativity": 8,
"num_cpus": 1,
"cpusets": [
"0x000000001",
"0x000000002",
"0x000000004",
"0x000000008",
"0x000000010",
"0x000000020",
"0x000000040",
"0x000000080",
"0x000000100",
"0x000000200",
"0x000000400",
"0x000000800",
"0x000001000",
"0x000002000",
"0x000004000",
"0x000008000",
"0x000010000",
"0x000020000",
"0x000040000",
"0x000080000",
"0x000100000",
"0x000200000",
"0x000400000",
"0x000800000",
"0x001000000",
"0x002000000",
"0x004000000",
"0x008000000",
"0x010000000",
"0x020000000",
"0x040000000",
"0x080000000",
"0x100000000",
"0x200000000",
"0x400000000",
"0x800000000"
]
},
{
"type": "L3",
"size": 25952256,
"linesize": 64,
"associativity": 11,
"num_cpus": 18,
"cpusets": [
"0x555555555",
"0xaaaaaaaaa"
]
}
]
},
"num_cpus": 36,
"num_cpus_per_core": 1,
"num_cpus_per_socket": 18,
"num_sockets": 2
}
Note
ReFrame 4.5.1 will generate more parameter than it can parse. To resolve this issue you can remove the following parameters: vendor
, model
and/or platform
.
For ReFrame to find the topology file it needs to be in the following path ~/.reframe/topology/<system_name>-<partition_name>/processor.json