Copyright
Copyright © Enea Software AB 2014.
This User Documentation consists of confidential information and is protected by Trade Secret Law. This notice of copyright does not indicate any actual or intended publication of this information.
Except to the extent expressly stipulated in any software license agreement covering this User Documentation and/or corresponding software, no part of this User Documentation may be reproduced, transmitted, stored in a retrieval system, or translated, in any form or by any means, without the prior written permission of Enea Software AB. However, permission to print copies for personal use is hereby granted.
Trademarks
Enea®, Enea OSE®, and Polyhedra® are the registered trademarks of Enea AB and its subsidiaries. Enea OSE®ck, Enea OSE® Epsilon, Enea® Element, Enea® Optima, Enea® Linux, Enea® LINX, Enea® LWRT, Enea® Accelerator, Polyhedra® Flash DBMS, Polyhedra® Lite, Enea® dSPEED, Accelerating Network Convergence™, Device Software Optimized™, and Embedded for Leaders™ are unregistered trademarks of Enea AB or its subsidiaries. Any other company, product or service names mentioned in this document are the registered or unregistered trademarks of their respective owner.
Table of Contents
Table of Contents
As hardware has become ever more competent there has been a trend to implement real-time applications on Linux. Linux is designed from the beginning for server and desktop applications, not for real-time applications. This means that achieving real-time properties on Linux is not trivial. This document tries to guide anyone attempting to implement a real-time application using Linux.
This document is divided into the following sections:
This section is intended as school book explaining areas of interest for designers of real-time systems.
This section lists a number of things that can be done to improve the real-time behaviour of Enea Linux. Some are of general nature and easy to apply, while others are suitable in fewer situation and/or requires a greater effort.
This sections gives tips and hints about how to design your application for real-time requirements.
This section gives some hints about how to handle some hardware aspects that do impact the real-time properties.
Readers who already have a good understanding of how Linux works may want to go directly to Chapter 3, Improving the Real-Time Properties.
When a task is waiting for an event, the task is said to be blocked.
Kernel boot parameters, i.e. parameters given to the kernel when booting. See kernel source file https://www.kernel.org/doc/Documentation/kernel-parameters.txt.
Work needed to be done initiated by an interrupt, but which can be done without interrupts being disabled. This is also referred to as deferred interrupt work, and this work is typically scheduled using soft IRQ, tasklet, or a work queue. See also top half.
A hardware resource capable of executing a thread of code in parallell with other cores.
Either a core or a hyper-thread. In Linux this represents a hardware resource capable of executing a thread of code in parallell with other CPUs.
Tasks, interrupts etc can have a CPU affinity, which means that they will only run on the CPUs given as an affinity mask or affinity CPU list.
Kernel feature for adding and removing CPUs at runtime.
Make a CPU as independent as possible of work done on other CPUs.
CPU partitioning is about grouping CPUs together. In the scope of this document, the intention is to group a set of CPUs in order to perform CPU isolation on each CPU within this group.
Kernel feature used to perform CPU partitioning.
A code path which needs to be protected from concurrency. It could be a global state that takes several operations to update and where all those operations must appear to be one atomic update. Such sections are usually protected using e.g. mutex.
Used to denote idle dynamic ticks as well as full dynamic ticks.
Kernel feature to inhibit tick interrupts when running a single task on a CPU.
A CPU partition intended for general purpose applications, i.e. applications that do not have real-time requirements.
When a response to an event must be done within a well defined time limit, or else the application will fail miserably, this is referred to as hard real-time.
A technique to allow two executions to occur simultaneously on the same core as long as they don't need the same core resources. This means that these multiple executions heavily affects each other, even though Linux presents them as being two different CPUs.
Kernel feature to inhibit tick interrupts when CPU is idle.
Hardware functionality for indicating asynchronous events. Interrupts will preempt the current execution and start executing the top half of an interrupt handler.
Interrupt request - Each type of interrupt is assigned its own vector, which is called IRQ vector or simply IRQ.
Technique for making code execution occur in parallell with I/O latencies to make the execution less dependent on hardware latencies.
Since this document is about real-time it is variance in latencies that jitter is referring to. This includes both jitter caused by work done by other threads of execution, application induced jitter, and work done by the kernel.
This refers to the kernel configuration being done before compiling the kernel, e.g. using make menuconfig from the root directory of the kernel source.
Code executed in the kernel, either compiled into the kernel or as a kernel module.
Task executing in kernel space. All code executing in the kernel share the same memory.
Time from when an event occurs until a response has been produced. The most important part of latency for the scope of this document is the latency caused by anything else except from the actual work needed to be done as a response for the event.
Light-weight process, kernel side of a pthread. LWP is used to make it possible to put pthreads on different CPUs while still sharing memory with a single parent process.
Task with SCHED_OTHER
,
SCHED_IDLE
or SCHED_BATCH
scheduling policy. SCHED_OTHER
is by far the most
common one.
Non-uniform memory access is a design of a multi-core system where the access time for a specific memory range can depend on which CPU is accessing it, excluding any effects that caches might have.
A tool for performing CPU partitioning. Available from https://github.com/OpenEneaLinux/rt-tools.
When an execution is unvoluntarily interrupted to begin another thread of execution, it is said that the execution is preempted.
Set of patches to achieve a fully pre-emptible kernel. See https://rt.wiki.kernel.org/index.php/Main_Page for more information.
When a task is waiting for a lock owned by a less prioritised task and the lock has priority inheritance, then the task owning the lock will be raised to the same priority as the waiting task for as long as the lock is being held. This technique avoids priority inversion problems, where a more prioritised task is forced to wait for a less prioritised task.
Task which does not share memory with its parent and is running in user space.
POSIX implementation of threads. In Linux, every pthread is
associated with a kernel side task called LWP. Therefore pthreads has a process ID
which can be retrieved using the gettid
()
system call.
Read, copy, update. A lock mechanism that makes read very cheap on the expense of more work when writing. See http://lwn.net/Articles/262464 for an article how it works.
An application with real-time requirements. It is enough that there is one task within the application that has those requirements for the application to be called real-time application.
Properties of a system with predictable latency.
A task with scheduling policy SCHED_FIFO
or
SCHED_RR
.
How to distribute resources as a function of time, e.g. I/O scheduling is how to distribute I/O accesses as a function of time and task scheduling is how to distribute CPU execution as a function of time.
A technique for implementing the bottom half handling of an interrupt. Also see tasklet.
Real-time partition, a CPU partition intended for real-time applications.
System management interrupt. This is an x86 specific feature used for CPU overheating protection and fixing microcode bugs. The interrupt is handled by BIOS, which is outside Linux control.
System management mode. When a CPU issues an SMI, it enters system management mode. Code running in this mode is found in the BIOS.
Symmetric multiprocessor. This describes a multi-core system when all cores are handled by one operating system and treated as one resource.
When there are requirements on maximum latency, but where failing those requirements causes a graceful degradation, this is referred to as soft real-time.
Function calls that are implemented in the kernel are called system calls. When a system call is issued, the currently running task goes from user space to kernel space and runs the implementation in the kernel, and then back to user mode again.
The scheduling entity for a task scheduler. Can be a process, LWP (pthread) or kthread.
When a task stop running and an other task get to run on the same CPU. This can happen when a task is preempted or it yields.
A bottom half implementation that is more generic than soft IRQ. It is implemented as a soft IRQ vector.
When registering an interrupt handler it is possible to register an associated interrupt thread. In this case the interrupt handler is expected to be very short, and the main part of the interrupt handling should be done in this interrupt thread.
Throughput is measured by how much useful work can be achieved for a longer period of time, and is a way of measuring efficiency. It is often in conflict with latency.
A per CPU periodic interrupt used to ensure time slices are being disitributed fairly among the currently running tasks on the CPU.
Interrupt handling code that needs to be run with interrupts disabled. Also see threaded interrupts and bottom half.
Code executed in a process or pthread, as opposed to in kernel space.
Kernel feature for handling small work packages in a
kthread
scope. Can be used for bottom half work
or for work induced by a system call, but that can be deferred to a
later time.
When a task voluntarily gives up execution this is referred to as yielding. This could either be because the task has requested to sleep for an amount of time, or because it is blocking on a lock owned by another task.
Table of Contents
A system with real-time constraints aim to perform work with a guarantee on the time when the work will be finished. This time is often called a deadline, and the system is designed with the purpose of not missing any, or as few as possible, deadlines.
A system where the consequences of missing deadlines are severe, for example with respect to danger for personnel or damage of equipment, is called a hard real-time system.
A system where deadlines occasionally can be missed is called a soft real-time system.
The work done by a real-time system is often initiated by an external event, such as an interrupt. The nature of the work often requires participation of one or more concurrently executing tasks. Each task participates by doing processing, combined with interaction with other tasks. The interaction typically leads to one or more task switches. A deadline for the work is often associated with the completion of the work, i.e. when the last participating task has finished its processing.
There are several challenges when implementing real-time systems. One challenge is to obtain as much determinism as possible. In this way, it becomes easier to make predictions, and calculations, of the actual execution time that will be required. From these predictions, the risk of the system missing deadlines can be evaluated.
When implementing a soft real-time system, and using Linux as an operating system, it is important to try to characterize possible sources of indeterminism. This knowledge can then be used to configure, and perhaps also modify Linux, so that its real-time properties become more deterministic, and hence that the risk of missing deadlines is minimized, although not guaranteed.
The remainder of this section gives a selection of areas in Linux where sources of indeterminism are present. The purpose is to give a brief technical overview, to serve as general advice for projects aiming at using Linux in real-time systems.
Each section in this chapter objectively describes:
The topic/mechanism
Default configuration
The real-time impact
It does not describe how to optimize for real-time behavior. That is done in Chapter 3, Improving the Real-Time Properties. The reason is to make it possible for a reader to skip this basic chapter.
A task switch occurs when the currently running task is replaced by another task. In Linux, a task switch can be the result of two types of events:
As a side effect of a kernel interaction, e.g. a system call or
when the kernel function schedule
() is called.
This is referred to as yield. The function schedule
() can be
used by kernel threads to explicitly suggest a yield.
As a result of an asynchronous event, e.g. an interrupt. This is referred to as preemption and occurs asynchronously from the preempted tasks point of view.
In the kernel documentation the terms voluntary preemption is used instead of yield and forced preemption for what here is called preemption. The terms were chosen since, strictly speaking, preemption means "to interrupt without the interrupted threads cooperation", see http://en.wikipedia.org/wiki/Preemption_%28computing%29.
Note that the preemption model only determines when a task switch may occur. The algorithms used to determine if a switch shall be done and then which task to swap in belong to the scheduler and are not affected by the preemption model. See Section 2.2, Scheduling for more info.
Where a task can be preempted depends on if it executes in user space or in kernel space. A task executes in user space if it is a thread in a user space application, except when in a system calls. Otherwise it executes in kernel space, i.e. system calls, kernel threads, etc.
In user space, tasks can always be preempted. In kernel space, you either allow preemption at specific places, or you disallow preemption at specific places, depending on the preemption model kernel configuration.
Simplified, the choice of preemption model is a balance between responsiveness (latency) and scheduler overhead. Lower latency requires more frequent opportunities for task switches which results in higher overhead and possibly more frequent task switches.
Linux offers several different models, specified at build time:
*) These models require the PREEMPT_RT patch plus kernel configuration. See the "Target Guide" section in the Enea® Linux User's Guide for build instructions.
The server and desktop configurations both rely entirely on yield (voluntary preemption). The difference is mainly that with the desktop option there are more system calls that may yield.
Low-latency desktop introduces kernel preemption. This means that the code is preemptible everywhere except in parts of the kernel where preemption has been explicitly disabled, as for example in spinlocks.
The preemption models RT and basic RT require the PREEMPT_RT patch, https://rt.wiki.kernel.org/index.php/Main_Page. They are not only additional preemption models as they also add a number of modifications that further improve the worst case latency. Read more about PREEMPT_RT in Section 2.4, The PREEMPT_RT Patch.
Basic RT model is mainly for debugging of RT, use RT instead. RT aims to minimize parts of the kernel where preemption is explicitly disabled.
A scheduling policy is a set of rules that determines if a tasks shall be swapped out and then which task to swap in. Linux supports a number of different scheduling policies:
The scheduling policy is a task property, i.e. different tasks can have different policies.
SCHED_FIFO and SCHED_RR are sometimes referred to as the real-time scheduling classes.
The standard way of scheduling tasks in Linux is known as fair scheduling. This means that Linux aims to give each task a fair share of the CPU time available in the system.
In a real-time system, work is deadline-constrained and the most important quality of task scheduling is to meet deadlines rather than fairness of CPU utilization. Fair scheduling may affect the time passed from when a task becomes ready for execution to when it actually starts to execute, since there may be other tasks that have not yet been allowed their share of the processor time. As an additional complication, the actual length of the delay depends on the number of tasks being active in the system and is therefore difficult to predict. In this way, the system becomes indeterministic.
There are other methods of scheduling in Linux that can be used. For
instance, it is possible to use priority-based scheduling, similar in its
policy to scheduling methods used in real-time operating systems. This
scheduling method, referred to as SCHED_FIFO
, increases
the determinism when interacting tasks perform work together. It requires,
however, that explicit priorities are assigned to the tasks.
SCHED_FIFO and SCHED_RR are the two real-time scheduling policies. Each task that is scheduled according to one of these policies has an associated static priority value that ranges from 1 (lowest priority) to 99 (highest priority).
The scheduler keeps a list of ready-to-run tasks for each priority level. Using these lists, the scheduling principles are quite simple:
SCHED_FIFO tasks are allowed to run until they have completed their work or voluntarily yields.
SCHED_RR tasks are allowed to run until they have completed their work, voluntarily yields, or until they have consumed a specified amount of CPU time.
When the currently running task is to be swapped out, it is put at the end of the list and the task to swap in is selected as described below:
From all non-empty lists, pick the one with the highest priority.
From that list, pick the task at the beginning of the list.
As long as there are real-time tasks that are ready to run, they might consume all CPU power. There is a mechanism called RT throttling, that can help the system to avoid that problem; see Section 2.3, Real-Time Throttling.
SCHED_OTHER
is the most widely used
policy. These tasks do no have the static priorities. Instead they have
a "nice" value ranging from -20 (highest) to +19 (lowest). The
scheduling policy is quite different from the real-time policies in that
the scheduler aims at a "fair" distribution of the CPU. "Fair" means
that each task shall get an average share of the execution time, that
relates to its nice value. See http://en.wikipedia.org/wiki/Completely_Fair_Scheduler
and https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt
for more info.
SCHED_BATCH is very similar to SCHED_OTHER. The difference is that SCHED_BATCH is optimized for throughput. The scheduler will assume that the process is CPU-intensive, and treat it slightly different. The tasks will get the same CPU share as SCHED_OTHER tasks, but worse latency.
As long as there are real-time tasks, i.e. tasks scheduled as SCHED_FIFO or SCHED_RR that are ready to run, they would consume all CPU power if the scheduling principles were followed. Sometimes that is the wanted behaviour, but it will also allow bugs in real-time threads to completely block the system.
To prevent this from happening, there is a real time throttling mechanism. This makes it possible to limit the amount of CPU power that the real-time threads can consume.
The mechanism is controlled by 2 parameters:
rt_period
and rt_runtime
. The
semantics is that the total execution time for all real-time threads may
not exceed rt_runtime
during each
rt_period
. As a special case,
rt_runtime
can be set to -1 to disable the real-time
throttling.
More specifically, the throttling mechanism allows the real-time
tasks to consume rt_runtime
times the number of CPUs
for every rt_period
of elapsed time. A consequence is
that a real-time task can utilize 100% of a single CPU as long as the
total utilization does not exceed the limit.
The default settings are: rt_period=1000000 µs, rt_runtime=950000 µs, which gives a limit of 95% CPU utilization.
The parameters are linked to two files in the
/proc
file system:
/proc/sys/kernel/sched_rt_period_us
/proc/sys/kernel/sched_rt_runtime_us
Changing a value is done by writing the new number to the corresponding file. E.g.:
$ echo 900000 > /proc/sys/kernel/sched_rt_runtime_us $ echo 1200000 > /proc/sys/kernel/sched_rt_period_us
See also https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt and https://lwn.net/Articles/296419
PREEMPT_RT
is a set of changes to the Linux kernel source code, which
when applied will make the Linux kernel more responsive to external
interrupts, and more deterministic in the time for performing a task
involving cooperating processes.
PREEMPT_RT aims to minimize the amount of kernel code that is non-preemptible (http://lwn.net/Articles/146861). This is accomplished by adding and modifying functionality in the Linux kernel.
The main functional changes done by PREEMPT_RT are:
Converting spin locks to sleeping locks. This allows preemption while holding a lock.
Running interrupt handlers as threads. This allows preemption while servicing an interrupt.
Adding priority inheritance to different kinds of sleeping locks, and to semaphores. This avoids scenarios where a lower prioritized process hinders the progress of a higher prioritized process, due to the lower priority process holding a lock. See http://info.quadros.com/blog/bid/103505/RTOS-Explained-Understanding-Priority-Inversion.
Lazy preemption. This increases throughput for applications with
tasks that use ordinary SCHED_OTHER
scheduling.
PREEMPT_RT is managed as a set of patches for the Linux kernel. It is available for a selection of kernel versions. They can be downloaded from https://www.kernel.org/pub/linux/kernel/projects/rt. General information about the patches can be found from the PREEMPT_RT wiki https://rt.wiki.kernel.org/index.php/Main_Page.
It is also possible to obtain a Linux kernel with the patches already applied. A central source is https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git. Another source is Linux from a hardware vendor, providing a Linux kernel with the patches applied. Enea Linux can be configured with or without the patches, depending on the specific customer use case.
Comparing with the total number of lines in the Linux kernel, the PREEMPT_RT patches affect a small percentage. However, it changes central parts of the kernel, which can be seen e.g. by noting that the latest patch set for the Linux 3.12 kernel contains 32 patches that affect the code in the file kernel/sched/core.c. The actual size of the PREEMPT_RT patches can be estimated from the number of files affected, and from the total number of source code lines affected. As an example, the latest patch set for the Linux 3.12 kernel https://www.kernel.org/pub/linux/kernel/projects/rt/3.12 contains 321 patches, affecting (by adding or removing lines of code) 14241 source code lines.
Functionality has been moved, over time, from the PREEMPT_RT patches into the mainline kernel. In this way, the PREEMPT_RT patch set has become smaller. As an example, the latest patch set for the older Linux 3.0 kernel https://www.kernel.org/pub/linux/kernel/projects/rt/3.0 contains 385 patches, adding or removing 16928 lines of source code, which can be compared with the corresponding numbers for the Linux 3.12 kernel (321 patches, 14241 lines), and shows that the PREEMPT_RT patch set has decreased in size.
The PREEMPT_RT functionality is activated, after the PREEMPT_RT
patches have been applied and the kernel has been built, by activating a
kernel configuration menu alternative. The menu alternative is named
Fully Preemptible Kernel (RT)
.
The performance of a Linux kernel with the PREEMPT_RT patches
applied can then be evaluated. A common evaluation methodology involves
measuring the interrupt latency. The interrupt latency is often measured
from the time of an interrupt until the time when a task, that is
activated as a result of the interrupt, begins executing in user space. A
commonly used tool for this purpose is
cyclictest
.
Additional information on results from evaluating the PREEMPT_RT performance is given e.g. in http://www.mpi-sws.org/~bbb/papers/pdf/ospert13.pdf and http://sigbed.seas.upenn.edu/archives/2014-02/ewili13_submission_1.pdf.
When deciding on the use of PREEMPT_RT, its costs and benefits should be evaluated. The use of PREEMPT_RT implies a significant change to the kernel source code. This change involves code that may be less tested and therefore less proven in use than the remaining parts of the mainline kernel.
Another aspect is maintenance. The development of the PREEMPT_RT patch set follows the development of the kernel. When the kernel version is changed, the corresponding PREEMPT_RT patch set, which may be available only after a certain time period, must then be applied to the kernel and the associated tests must be performed.
On the other hand, the use of a PREEMPT_RT-enabled kernel can lead to a system with a decreased worst case interrupt latency and a more deterministic scheduling, which may be necessary to fulfil the real-time requirements for a specific product. For a uni-core system many of the other methods, e.g. full dynamic ticks and CPU isolation, cannot be used, and can therefore make PREEMPT_RT an attractive alternative.
For additional information about the technical aspects of PREEMPT_RT see e.g. http://elinux.org/images/b/ba/Elc2013_Rostedt.pdf.
For additional information about the PREEMPT_RT development status, see e.g. http://lwn.net/Articles/572740.
Linux performs allocation of tasks to CPUs. In the default setting, this is done automatically. In a similar spirit as the default fair task scheduling, which aims at dividing the CPU time for an individual CPUs so that each task gets its fair share of processing time, the scheduler is free to move tasks around, with the goal of evenly distributing the processing load among the available CPUs.
The moving of a task is referred to as task migration. The migration is done by a part of the scheduler called the load balancer. It is invoked, on a regular basis, as a part of the processing done during the scheduler tick. The decision to move a task is based on a variety of input data such as CPU load, task behaviour etc.
For an application which requires determinism, load balancing is problematic. The time to activate a certain task, as a result of an interrupt being serviced, may depend, not only on the scheduling method used, but also on where this task is currently executing.
It is possible to statically assign tasks to CPUs. One reason for doing this is to increase determinism, for example by making the response time to external events more predictable. Assigning a task to a CPU, or set of CPUs, is referred to as setting the affinity of the task.
A mechanism both to set affinity of task and, to turn off automatic
load balancing is cpuset
. Cpusets make use of the Linux
subsystem cgroup
. See https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt
and https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt.
If we would like to get real time performance on single CPUs systems it is necessary to adapt the entire system, e.g. using the PREEMPT_RT patch or an RTOS. This is not always necessary in a multicore system. Recently added features in the Linux kernel makes it possible to aggressively migrate sources of kernel introduced jitter away from selected CPUs. See Section 3.3.1, CPU Isolation for more information. Doing this provides bare metal-like performance on the CPUs where sources of jitter have been removed. Please note, you can also use a multicore system with CPU isolation to achieve higher throughput, although that is not the focus of this document.
One way to get real-time performance in Linux is by creating a bare
metal-like environment in Linux user space. On a default setup, this is
not possible since the Linux kernel need to do some regular housekeeping.
It is possible to move much of this housekeeping to some dedicated CPUs,
provided we have a multicore system. That leaves the other CPUs relatively
untouched by the Linux kernel, unless user space triggers some kernel
activity. The application that executes in this bare metal environment
should avoid using libc
calls and Linux system calls.
See Chapter 4, Designing Real-Time Applications for more information. Since
the kernel is not real-time safe, executing kernel code can have serious
impacts on real-time performance.
The biggest sources of kernel jitter are the scheduler and external interrupts. The scheduler's main duty is to switch between tasks. Switching between tasks can of course cause a lot of jitter. This is only a problem if the performance critical task runs with a completely fair scheduler (CFS) policy and therefore gets preempted because of time slicing. A solution for this problem can be to move all non-critical tasks to other CPUs and/or run the critical task with real time policy and an appropriate priority.
The load balancer will try to effectively utilize all the CPUs. That might be good for throughput, but it could damage real-time performance. The obvious problem is that general purpose tasks could be moved to the real-time CPUs and real-time tasks could be moved to general purpose CPUs. The other problem is the actual work of migrating threads. This is easily solved by disabling load balancing on the CPUs that should be isolated.
The scheduler tick causes significant jitter and has negative
real-time impact unless the PREEMPT_RT kernel is used. The tick can be
removed with the CONFIG_NO_HZ_FULL
kernel
configuration. Read more about NO_HZ
and the tick in
Section 2.7, Dynamic Ticks (NO_HZ).
Interrupts can be a big source of jitter. Some interrupts like inter processor interrupts (IPI) and per-CPU timer interrupts need to be bound to a certain CPU. Other interrupts may be handled by any CPU in a multicore system and should be moved away from the isolated CPUs. Many timer interrupts can be removed by changing kernel configurations. See Section 2.11, Interrupts for more info.
The purpose of the tick is to balance CPU execution time between
several tasks running on the same CPU. The tick is also used as a timer
source for timeouts. Ticks are interrupts generated by a hardware timer
and occur at a regular interval determined by the
CONFIG_HZ
kernel configuration, which for most
architectures can be configured when compiling the kernel. The tick
interrupt is a per-CPU interrupt.
Starting from Linux 2.6.21, the idle dynamic ticks feature can be
configured by using the CONFIG_NO_HZ
kernel
configuration option. The goal was to eliminate tick interrupts while in
idle, to be able to go into deeper sleep modes. This is important for
laptops but can also cut down power bills for server rooms.
Linux 3.10.0 introduced the full dynamic ticks feature to eliminate
tick interrupts when running a single task on a CPU. The goal here was to
better support high performance computing and real-time use cases by
making sure that the thread would be run undisturbed. The earlier
configuration CONFIG_NO_HZ
was renamed to
CONFIG_NO_HZ_IDLE
, and the new feature got the new
configuration option CONFIG_NO_HZ_FULL
.
The current implementation requires that ticks are kept on CPU 0 when using full dynamic ticks, which is not required for idle dynamic ticks. The only exception is when the whole system is idle, then the ticks can be turned off for CPU 0 as well.
Whether dynamic ticks turn tick interrupts off is a per-CPU decision.
The following timer tick options are available, extracted from the kernel configuration:
This option keeps the tick running periodically at a constant rate, even when the CPU doesn't need it.
This option enables a tickless idle system: timer interrupts will only trigger on an as-needed basis when the system is idle. This is usually interesting for energy saving. Most of the time you want to say Y here.
Adaptively try to shutdown the tick whenever possible, even
when the CPU is running tasks. Typically this requires running a
single task on the CPU. Chances for running tickless are maximized
when the task mostly runs in user space and has few kernel activity.
You need to fill up the nohz_full
boot parameter
with the desired range of dynticks CPUs. This is implemented at the
expense of some overhead in user <-> kernel transitions:
syscalls, exceptions and interrupts. Even when it's dynamically off.
Say N.
In order to make use of the full dynamic ticks system configuration you must ensure that only one task, including kernel threads, is running on a given CPU at any time. Futhermore there must not be any pending RCU callbacks or timeouts attached to the tick.
The official documentation for dynamic ticks can be found in the Linux kernel source tree https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt. There is also an excellent article about it at LWN, http://lwn.net/Articles/549580.
This section will describe several power saving techniques available in Linux. These techniques does often have impact on the systems real time properties.
A technique not described here is hibernation, e.g. suspend to RAM or disk. The reason is that it is difficult to combine with real-time properties and therefore outside the scope of this manual.
When there is little CPU bound work to be done, the CPU frequency can be reduced as a way to reduce power consumption. This is known as dynamic frequency scaling, see http://en.wikipedia.org/wiki/Dynamic_frequency_scaling.
The function is enabled at compile time by the configuration
parameter CONFIG_CPU_FREQ
. If enabled, the system
will include functionality, called a governor, for controlling the
frequency. There are several governors optimized for different types of
systems. Which governors available in the system is also chosen with a
compile time configurations with names starting with
CONFIG_CPU_FREQ_GOV_
.
The possibility to use dynamic frequency scaling in a real-time system is strongly related to the time it takes to ramp up the frequency and that time's relation to the latency requirements.
If the CPU is idle, i.e. no ready to run task, the CPU can be put in sleep state. A sleep state means that the CPU does not do any execution, while still ready to respond on certain events, e.g. an external interrupt.
CPUs usually have a range of power modes. See http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Processor_states for an example. Deeper sleep means lower power consumption at the price of increased wake-up time. As with dynamic frequency scaling, the transition between the power states is controlled by a governor.
To configure the functionality for entering sleep states when
idle, use the compile time configuration parameter
CONFIG_CPU_IDLE
.
The Linux kernel's CPU-hotplug facility allows CPUs to be added to or removed from a running kernel. CPU hotplug can for example be used to isolate failing CPUs or be used in systems where the hardware in it's self is hot pluggable. Linux's CPU hotplug implementation is based on notifiers, which are callbacks into the subsystems that need to be aware of CPUs coming and going.
Since the benefits of hotplugging the CPUs from a system are well
known from the power saving point of view, the most obvious disadvantage
from a real-time computing perspective is the latency jitter. In a
perfect system, a CPU could go online or go offline fast and without
alarming the rest of the system. Unfortunately, handling the CPU's
per-CPU kthreads
is one of the disadvatages that the
hotplug was not used in a real-time approach (long time operations,
close to seconds). Nowadays, the CPU hotplug implementation http://sigbed.seas.upenn.edu/archives/2012-11/7.pdf is
based on the CPU affinity of kthreads by not remove and recreate again,
but migrate them. In this way the hotplug latencies are around 5ms,
making Linux more useable into both area of technologies power saving
and real-time application.
To enable the usage of this feature, the kernel supports the
CONFIG_CPU_HOTPLUG
parameter. This opens the
availability of runtime isolation through the file
/sys/devices/system/cpu/«cpu id»/online
.
I/O scheduling determines in which order block I/O reads and writes are done. The algorithms require collection of statistics concerning block device activity, which decreases determinism for an eventual real-time application writing/reading to a block device.
In such a scenario, you may want to select the Noop I/O elevator for the block device which your determinism sensitive application is reading from/writing to. However, the effect is expected to be small, and this will have side-effects for other applications accessing the same block device. It may even have negative side effects on your application depending on the type of block device, and the read/write behaviour of the application. Fortunately, the I/O scheduler can be switched in runtime, and should be selected based on the user-specific I/O load scenario.
Table 2.1 Available I/O schedulers in kernel 3.14:
Name | Description |
---|---|
Noop | |
Anticipatory | |
Deadline | http://en.wikipedia.org/wiki/Deadline_scheduler https://www.kernel.org/doc/Documentation/block/deadline-iosched.txt |
Completely Fair Queuing | http://en.wikipedia.org/wiki/CFQ https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt |
User-space programs access services provided by the kernel by using system calls. Usually, applications do not invoke system calls directly, but rather use library wrapper functions that in turn execute system calls. A system call always involves a transition from user-space to kernel-space and back. It also involves passing parameters from the application program into the kernel, and back.
A system call becomes a source of indeterminism for the calling task, since the duration of a system call depends on the work done while executing in the kernel. The work may for example involve allocation of resources, such as memory. This may take different amounts of time, depending on the availability of the resources. A system call may even result in the calling task being blocked as a result of a resource not being available. For example, a task reading data from a file may be blocked if there is no data immediately available. At some later time, when the requested resource becomes available, the task resumes its execution. The system call can then be completed and the task can return to user space.
There may also be other tasks wanting to execute during the execution of a system call. If it is possible to switch task during the system call, one or more of these other tasks can be allowed to execute. This can clearly be advantageous if these tasks have deadlines. This type of task switch is referred to as kernel preemption.
It is possible to configure the Linux kernel so that it allows more or less possibilities of kernel preemption. In general, if the level of preemptibility is increased, the Linux kernel becomes more complex and consumes larger part of the cpu time. As a consequence, the application gets a smaller part of the cpu time and the througput performance is decreased.
Hardware indicate events to the software using interrupts. When an interrupt occurs the following is done:
The interrupt is mapped to a registered interrupt handler.
The registered interrupt handler is run. There can be several handlers for the same interrupt. Therefore, this step is retried for all handlers registered for this interrupt as long as they return IRQ_NONE.
If registered interrupt handler returns IRQ_WAKE_THREAD, the interrupt thread corresponding to the registered interrupt handler is set in ready state, i.e. is now a schedulable task.
Interrupt is acknowledged.
All steps above are executed with all interrupts disabled, i.e. interrupt nestling is not supported. See http://lwn.net/Articles/380931/ for a discussion why nestled interrupts was removed from Linux. The patch found at https://lkml.org/lkml/2010/3/25/434 remove nestled interrupts.
Interrupt handlers can either be registered using request_irq(), or using request_threaded_irq() which registers a threaded interrupt. In both cases the interrupt handler will determine whether the interrupt is to be handled by this interrupt handler or not. The handler returns IRQ_NONE if not.
Interrupt work is normally divided into two parts: Top half and bottom half. The top half is implemented by the interrupt handler. The bottom half is implemented by soft IRQs, tasklets or work queues initiated from the top half, or by the interrupt thread in case of threaded interrupt.
See also Section 2.12, Soft IRQs and Tasklets, Section 2.13, Work Queues and Section 2.14, Threaded Interrupts.
The latency that interrupts induce on a real-time application is determined by the top half, soft IRQs and tasklets. For threaded interrupts, the priority of the interrupt thread can be adjusted to only affect the latency of less critical tasks.
Soft IRQs are intended for deferred interrupt work that should be run without all interrupts being disabled.
Soft IRQ work is executed at certain occasions, such as when interrupts are enabled, or when calling certain functions in the kernel, e.g. local_bh_enable() and spin_unlock_bh(). The soft IRQs can also be executed in the ksoftirqd kernel thread. All this makes it very hard to know when the soft IRQ work will actually be executed. See https://lwn.net/Articles/520076/ for more information about soft IRQ and real-time.
Tasklets builds on soft IRQs, and is basically a generic interface for soft IRQs. While tasklets are much preferred over soft IRQs, the synchronization between interrupt handlers and corresponding soft IRQ or tasklet is non-trivial. For this reason and for achieving better real-time properties it is recommended to use work queues or threaded interrupts whenever possible.
Work queues executes in kernel threads. This means that for work queues, preemption can occur. It also means that work performed in a work queue may block, which may be desirable in situations where resources are requested but not currently available. The reason for using work queues rather than having your own kernel thread is basically to keep things simple. Small jobs are better handled in batches in a few kernel threads rather than having a large number of kernel threads each doing their little thing.
See http://lwn.net/Articles/239633/ for a discussion about why tasklets and soft IRQs are bad and why work queues are better.
When an interrupt handler is registered using request_thread_irq() it will also have an associated interrupt thread. In this case, the interrupt handler returns IRQ_WAKE_THREAD to invoke the associated interrupt thread. If the interrupt thread is already running, the interrupt will simply be ignored.
Even if the interupt handler has an associated interrupt thread, it may still return IRQ_HANDLED to indicate that the interrupt thread does not need to be invoked this time.
There are two main advantages with using threaded interrupts:
No synchronization needed between the top half and bottom half of the interrupt handler, the kernel will do it for you.
Possibility to adjust priority of each interrupt handler, even when interrupt handlers share the same IRQ.
An interrupt handler that was registered using request_irq() can be forced to run as a threaded interrupt using a kernel boot parameter named threadirqs. This will result in a short default interrupt handler to be executed instead of the registered interrupt handler, and the registered interrupt handler is run in an interrupt thread. But since the registered interrupt handler is not designed for this it may still invoke soft IRQs causing hard to predict latencies. Interrupt handlers that has been forced to be threaded runs with soft IRQ handling disabled, since they do not expect to be preempted by them.
The Linux scheduler, which handles the process scheduling, is invoked periodically. This invocation is referred to as the Linux scheduler tick.
The Linux scheduler is also invoked on demand, for example when a task voluntarily gives up the CPU. This happens when a task decides to block, for example when data that the task needs are not available.
The frequency of the scheduler tick can be configured when building the kernel. The value typically vary between 100 and 1000 Hz, depending on target.
The scheduler tick is triggered by an interrupt. During the execution of this interrupt kernel preemption is disabled. The amount of time passing, before the tick execution is finished, depends on the amount of work that needs to be done, which in turn depends on the number of tasks in the system and how these tasks are allocated among the CPUs.
In this way, the presence of ticks, with varying completion times, contribute to the indeterminism of Linux, in the sense that a task with a deadline cannot know beforehand how long it takes until it has completed its work, due to it being interrupted by the periodic ticks.
By default, the Linux kernel allows applications to allocate (but not use) more memory than is actually available in the system. This feature is known as memory overcommit. The idea is to provide a more efficient memory usage, under the assumption that processes typically ask for more memory than they will actually need.
However, overcommitting also means there is a risk that processes will try to utilize more memory than is available. If this happens, the kernel invokes the Out-Of-Memory Killer (OOM killer). The OOM killer scans through the tasklist and selects a task to kill to reclaim memory, based on a set of heuristics.
When an out of memory situation occurs, the whole system may become unresponsive for a significant amount of time, or even end up in a deadlock.
For embedded and real-time critical systems, the allocation policy
should be changed so that memory overcommit is not allowed. In this mode,
malloc()
will fail if an application tries to allocate
more memory than is strictly available, and the OOM killer is avoided. To
disable memory overcommit:
$ echo 2 > /proc/sys/vm/overcommit_memory
For
more information, see the man page for proc
(5) and
https://www.kernel.org/doc/Documentation/vm/overcommit-accounting.
RCU is an algorithm for updating non-trivial structures, e.g. linked lists, in a way that does not enforce any locks on the readers. This is done by allowing modifications to the structures to be done on a copy, and then have a publish method that atomically replaces the old version with the new.
After the data has been replaced there will still be readers that hold references to the old data. The period during which the readers can hold references to the old data is called a grace period.
This grace period ends when the last reader seeing the old version has finished reading. When this happens, an RCU callback is issued in order to free resources allocated by the old version of the structure.
These callbacks are called within the context of a kernel thread. By default these callbacks are done as a soft IRQ, adding hard to predict latencies to applications. There is a kernel configuration option named CONFIG_RCU_NOCB_CPU which combined with the boot parameter rcu_nocb=<cpu list> will relocate RCU callbacks to kernel threads. The threads can be migrated away from the CPUs in the <cpu list> giving these CPUs better real-time properties.
When using the nohz_full kernel boot parameter, nohz_full implies rcu_nocb.
For further reading, see the following links:
RCU concepts: https://www.kernel.org/doc/Documentation/RCU/rcu.txt
What is RCU (contains a lot of good links to LWN articles): https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt
Relocating RCU callbacks: http://lwn.net/Articles/522262/
Table of Contents
Improving the real-time properties is very much about reducing latency. In a hard real-time system, it is all about worst case, while in a soft real-time system, it is about reducing the probability for high latency numbers. Often, the improvements come at a price, e.g. lower average throughput, increased average latency, reduced functionality, etc.
This chapter starts with a set of actions that are of low complexity, easy to apply and/or well established. For some systems these actions will be enough. If not, it is suggested to isolate the real-time critical application(s). The idea is to avoid sources of jitter on the real-time applications. Finally, a set of actions with larger impact are presented.
Finding the right balance is an iterative process that preferably
starts with establishing some benchmark. A common way to evaluate
real-time properties is to measure interrupt latency by using the test
application cyclictest
(https://rt.wiki.kernel.org/index.php/Cyclictest),
which sets a timer and measures the time interval from the expected
expiration time until the task that set the timer becomes ready for
execution.
For the result to be relevant, the system should also be put under stress, see e.g. http://people.seas.harvard.edu/~apw/stress, which can generate various types of stress.
Lack of preemption can result in very high, possibly unbound, latencies. That makes the preemption models without preemption, server and desktop, unsuitable for real-time systems. See Section 2.1, Kernel Preemption Model for a description of the kernel preemption model.
The recommendation is to start with the low-latency desktop model.
While the RT model in many cases gives a lower worst case latency, this comes at a cost of larger overhead. The RT model also requires the PREEMPT_RT patch in contrast to low-latency desktop, which is a part of the standard kernel. Therefore the RT model may have implications on quality and functional stability.
The preemption model is set at compile time by kernel
configuration parameters, menu Kernel options/Preemption Model. To use
low-latency desktop, set
CONFIG_PREEMPT=y
or
CONFIG_PREEMPT__LL=y
(the latter from the PREEMPT_RT
patch), and other CONFIG_PREEMPT_* parameters to "n".
Table 3.1 Kernel Configuration Parameters for the Preemption Model
*) These models require the PREEMPT_RT patch. See the "Target Guide" section in the Enea® Linux User's Guide for build instructions.
Benchmarks
The table below shows the worst case latency for the different preemption models, measured with various stress scenarios.
Table 3.2 Worst Case Latency [µs] for the Different Preemption Models on P2041RDB
Preemption model | Stress type (see Appendix B, Benchmark Details) | ||||
---|---|---|---|---|---|
cpu | hdd | io | vm | full | |
Server | 48 | 431 | 155 | 6247 | 8473 |
Desktop | 52 | 161 | 139 | 6343 | 8423 |
Low-latency desktop | 72 | 396 | 711 | 1084 | 977 |
RT (from PREEMPT_RT patch) | 29 | 72 | 62 | 74 | 64 |
As described in Section 2.8, Power Save, power save mechanisms interact poorly with real-time requirements. The reason is that exiting a power save state cannot be done instantly, e.g. 200 µs wake-up latency from sleep mode C3 and 3 µs from C1 on an Intel i5 - 2GHz.
This does not have to be a problem in e.g. a soft real-time system where the accepted latency is longer than the wake-up time or in a multicore system where power save techniques may be used in a subset of the cores. It is however recommended to start with the power save mechanisms disabled.
It may be noted that the dynamic ticks function described in Section 2.7, Dynamic Ticks (NO_HZ), originated as a power save function, idle dynamic ticks. We will later describe how to use it for real-time purposes, full dynamic ticks, Section 3.3, Isolate the Application.
Disable Dynamic Frequency Scaling
Frequency scaling is disabled by setting the kernel configuration
parameter CONFIG_CPU_FREQ=n
. See Section 2.8.1, Dynamic Freqency Scaling and see https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt
for more information.
Disable Transitions to Low-Power States
Disable transitions to low-power states by setting
CONFIG_CPU_IDLE=n
. See Section 2.8.2, CPU Power States.
Interrupt handlers can choose to run most part of their interrupt
handling in a kernel thread when registering. This is called threaded
interrupts. It is also possible to force interrupt handlers that has not
done this choice to become threaded interrupts by using threadirq
boot parameter. Interrupt
handles that has registered using IRQF_NO_THREAD will not be threaded
even if the boot parameter is given.
This will make interrupt handlers more preemptible, but some interrupt handlers might not work with this option, so make sure to test well when using it.
Benchmarks of threaded interrupts
Table 3.3 Benchmarking of threaded interrupts on P2041RDB, times in µs
cpu | full | hdd | io | no_stress | vm | |
---|---|---|---|---|---|---|
Threaded irqs | 46 | 994 | 705 | 88 | 185 | 1018 |
Regular irqs | 59 | 961 | 481 | 217 | 110 | 1029 |
Ratio [%] | 78.0 | 103.4 | 146.6 | 40.6 | 168.2 | 98.9 |
As described in Section 2.3, Real-Time Throttling,
the default settings for the real-time throttling allows real-time tasks
to run for 950000 µs every 1000000 µs. This may lead to a situation
where real-time tasks are blocked for 50000 µs at the end of the
throttling period. In the generic case, execution of the real-time tasks
may be blocked for a time equal to the difference between
rt_runtime
and rt_period
. This
situation should however be quite rare since it requires that there are
real-time tasks (i.e. tasks scheduled as SCHED_FIFO
or SCHED_RR
) that are ready to run on all CPUs, a
condition that should rarely be met since real-time systems are
typically designed to have an average real-time load of significantly
less than 100%.
Consequently, it is recommended to keep the real-time throttling enabled.
For systems that do not have any real-time tasks, the real-time throttling will never be activated and the settings will not have any impact.
An alternative when using CPU isolation is to avoid using SCHED_FIFO and SCHED_RR, since the CPU is supposed to run a single task anyway. In this case, real-time throttling should not be activated.
Benchmarks of real-time throttling
The benchmarks was done using an application that repeatedly reads the time base register and keeps track of the largest increment of it:
set cpu affinity set scheduling class to FIFO/99 get current time (T0) and current value of the time base register (TB0) for some number of cycles { read time base register calculate the diff between the current and the previous value. (delta_TB) if the diff is the largest so far, update delta_TBmax } get current time (T1) and current value of the time base register (TB1) use T0, T1, TB0 & TB1 to translate delta_TBmax to a delay in microseconds.
The longest delay is interpreted as the longest time the task has been preempted. Multiple instances of the test application are started on different cores. The results are summarised below:
Table 3.4 Max preemption period for different throttling parameter settings, run on P2041RDB
Test Id | Throttling parameters [µs] | Max preemption period [µs] | |||||
---|---|---|---|---|---|---|---|
rt_runtime | rt_period | CPU 0 | CPU 1 | CPU 2 | CPU3 | Average | |
1 | 900 000 | 1 000 000 | 101 961 | 101 951 | 97 883 | 101 961 | 100 939 |
2 | 950 000 | 1 000 000 | 48 959 | 48 958 | 53 043 | 53 032 | 50 998 |
3 | 1 000 000 | 1 000 000 | 44 | 56 | 118 | 59 | 69 |
4 | -1 | 1 000 000 | 45 | 144 | 38 | 56 | 71 |
5 | 900 000 | 1 100 000 | 203 860 | 203 856 | 203 862 | 203 847 | 203 856 |
6 | 95 000 | 100 000 | 4 132 | 4 132 | 4 121 | 8 210 | 5 149 |
7 | 500000 | 1 000 000 | 342 445 | 342 440 | 338 369 | N/A. (No real-time task was started on this cpu) | 341 085 (CPU0-2) |
There are some observations that can be done:
The real-time throttling is evenly distributed among the cores. This will be the case as long as there are real-time work available to all cores. See test 7 for an exception.
There is no significant difference between setting
rt_runtime
to -1 and setting
rt_runtime
= rt_period
. (Tests
3 & 4)
In test 6, the preemption periods are quite far from the expected 5000 µs. The reason is that the system frequency was set to 250Hz, which implies a granularity of 4 000 µs.
In test 7, real-time tasks on CPU 0 to CPU 2 uses on average 66 % of the CPU, which is higher than the expected 50%. CPU 3 on the other hand uses 0%, which keeps the system as a whole in line with the limit.
CPU isolation is a topic that has been discussed in Section 2.6, CPU Isolation. By using this method correctly it is possible to enhance throughput and real-time performance by reducing the overhead of interrupt handling.This is also good for the application using very high throughput and frequent interrupt device (e.g. 10GbE ethernet).. This section explains how to achieve basic CPU isolation.
Linux has a function called cpuset that associates tasks with CPUs and memory nodes. By
adding and configuring cpusets on a system, a user can control what
memory and CPU resources a task is allowed to use. It is also possible
to tune how the scheduler behaves for different cpusets. The
configuration parameter is called CONFIG_CPUSETS
, see
https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt
for more info on cpusets.
This section will describe how to setup CPU isolation using an imaginary SMP system with 4 CPUs. CPU0-CPU1 will be used for general purpose tasks and CPU2-CPU3 will be dedicated to real-time tasks. The system is a NUMA (non-uniform memory access, http://en.wikipedia.org/wiki/Non-uniform_memory_access) system with CPU0 - CPU1 belonging to node 0 and CPU2 - CPU3 belonging to node 1. Here we call the cpuset used for the general purpose domain nRT, non-RT set.
Setting up a Partitioned System
This section describes step by step, how to set up basic CPU isolation. Two cpusets will be created. One for non real-time (nRT, for general purpose use) and one for real-time tasks. The setup can also be done using a tool (Section 3.3.6, The CPU Partitioning Tool - partrt) that wraps the sequence in this chapter.
Configure the CPU sets
Create the cpusets (RT & nRT):
#enable the creation of cpuset folder $ mount -t tmpfs none /sys/fs/cgroup #create the cpuset folder and mount the cgroup filesystem $ mkdir /sys/fs/cgroup/cpuset/ $ mount -t cgroup -o cpuset none /sys/fs/cgroup/cpuset/ #create the partitions $ mkdir /sys/fs/cgroup/cpuset/rt $ mkdir /sys/fs/cgroup/cpuset/nrt
Add the general purpose CPUs to the nRT set.
$ echo 0,1 > /sys/fs/cgroup/cpuset/nrt/cpuset.cpus
Add the real-time CPUs to the RT set
$ echo 2,3 > /sys/fs/cgroup/cpuset/rt/cpuset.cpus
Make the CPUs in the RT set exclusive, i.e. do not let tasks in other sets use them.
$ echo 1 > /sys/fs/cgroup/cpuset/rt/cpuset.cpu_exclusive
Restart Real-Time CPUs
If the system supports CPU hotplug, it could be worthwhile to restart the real-time CPUs to migrate all cpu specific timers. If you choose to restart hotplug CPUs, you need to re-create the RT partition. The reason for requiring to migrate twice is that it might not be possible to restart the CPU if tasks are running on it. Restart hotplug CPUs like this: For all CPUs in the real-time partition, do the following to turn them off (example for CPU3):
$ echo 0 > /sys/devices/system/cpu/cpu3/online
Then turn them on:
$ echo 1 > /sys/devices/system/cpu/cpu3/online
Configure NUMA
The following sequence sets up a suitable configuration for NUMA-based systems.
Associate the nRT set with NUMA node 0.
$ echo 0 > /sys/fs/cgroup/cpuset/nrt/cpuset.mems
Associate the RT set with NUMA node 1.
$ echo 1 > /sys/fs/cgroup/cpuset/rt/cpuset.mems
Make NUMA node 1 exclusive to the RT cpuset. I.e. only tasks in the real-time cpuset will be able to allocate memory from node 1.
$ echo 1 > /sys/fs/cgroup/cpuset/rt/cpuset.mem_exclusive
Note that also tasks in nRT can "access" memory controlled by NUMA node 1.
Note that it is important to set the memory nodes configuration even if the system is not NUMA-based due to the fact that the initial non available value (assigned at cpuset creation) do not grant memory access.
Associate the nRT set with NUMA node 0.
$ echo 0 > /sys/fs/cgroup/cpuset/nrt/cpuset.mems
Associate the RT set with NUMA node 0.
$ echo 0 > /sys/fs/cgroup/cpuset/rt/cpuset.mems
Configure load balancing
Load balancing, i.e. task migration, is an activity that introduces nondeterministic jitter. It is therefore necessary to disable load balancing in the real time cpuset. This means that it is necessary to specify the correct affinity for the threads that should execute within the real time CPUs.
Disable load balancing in the root cpuset, this is necessary for settings in the child cpusets to take effect.
$ echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
Disable load balancing in the RT cpuset:
$ echo 0 > /sys/fs/cgroup/cpuset/rt/cpuset.sched_load_balance
Enable load balancing in the nRT cpuset:
$ echo 1 > /sys/fs/cgroup/cpuset/nrt/cpuset.sched_load_balance
Move general purpose tasks to the GP partition
For each task in the root cpuset, run the following command, each pid of task should be on a newline:
$ echo pid_of_task > /sys/fs/cgroup/cpuset/nrt/tasks
Note that it is not possible to move all tasks. Some tasks require that they can execute on all available CPUs. All future child tasks that are created from the nRT partition will also be placed in the nRT partition. That includes tasks started from the current shell, since it should have been moved to nRT as well.
Move IRQs to the general purpose CPUs
Some interrupts are not CPU bound. Unwanted interrupts
introduce jitter and can have serious negative impact on real time
performance. They should be handled on the general purpose CPUs
whenever possible. The affinity of these interrupts can be
controlled using the proc
file system.
Set the default affinity to CPU0 or CPU1 to make sure that new interrupts won’t be handled by the real-time CPUs. The set {CPU0, CPU1} is represented as a bitmask set to 3, (20 + 21)..
$ echo 3 > /proc/irq/default_smp_affinity
Move IRQs to the nRT partition
$ echo 3 > /proc/irq/<irq>/smp_affinity
If it is not known what interrupts to move, since this is highly architecture dependent, try to move all of them.
Interrupts that can not be moved will be printed to stderr. When it is known what interrupts can not be moved, consult the hardware and driver documentation to see if this will be an issue. It might be possible to disable the device that causes the interrupt. Typical interrupts that should and can be moved are: certain timer interrupts, network related interrupts and serial interface interrupts. If there are any interrupts that are part of the real-time application, they should of course be configured to fire in the real-time partition.
Execute a task in the real-time partition
Now it is possible to run a real-time task in the real-time partition:
$ echo pid_of_task > /sys/fs/cgroup/cpusets/rt/tasks
Since we have a RT partition with more then one CPU we might want to choose a specific CPU to run on. Change the task affinity to only include CPU3 in the real time partition:
$ taskset -p 0x8 pid_of_task
The system should now be partitioned in two sets. The next step to further improve real-time properties is to get rid of the tick interrupts, which is described in Section 3.3.2, Full Dynamic Ticks.
The full dynamic ticks feature is described in Section 2.7, Dynamic Ticks (NO_HZ). Several conditions must be met to really turn ticks off from a CPU while running a tasks on this CPU. Some of these conditions are of static nature such as kernel configurations, and some of them are runtime conditions such as having only one runnable task, no POSIX timers, etc.
The current implementation of full dynamic ticks will not disable ticks entirely, but rather set it to 1 Hz. This is because the system still needs to synchronize every now and then. For those wanting to experiment with turning off the ticks entirely there is a patch from Kevin Hilman to do this.
Note that CPU partitioning will be needed to make sure that only one task is running on a specific CPU. See section Section 3.3.1, CPU Isolation for more information about CPU isolation.
Prerequisites for full dynamic ticks
To be able to enable full dynamic ticks, the following prerequisites need to be met:
Linux kernel 3.10.0 or newer.
SMP capable hardware with at least two real cores, excluding hyperthreads if any.
No more perf
events than what the hardware
supports being active on the CPU.
Kernel configuration
To select at boot time which CPUs should use the full dynamic ticks feature, the following kernel configurations need to be enabled:
CONFIG_NO_HZ_FULL=y CONFIG_RCU_NOCB_CPU=y
CONFIG_NO_HZ_FULL
is the kernel name for full
dynamic ticks.
If you want all CPUs except CPU 0 to use the full dynamic ticks feature, enable the following kernel configurations:
CONFIG_NO_HZ_FULL_ALL=y
In this latter case CONFIG_RCU_NOCB_CPU_ALL should be selected by default.
RCU is a synchronization mechanism used by Linux which uses kernel helper threads to finish updates to shared data. For more information read the excellent LWN article found at http://lwn.net/Articles/262464.
Kernel boot parameters
Linux has a number of boot parameters that enhances core isolation.
This parameter specifies a set of CPUs that will be excluded from the Linux scheduler load balancing algorithm. The set is specified as a comma separated list of cpu numbers or ranges. E.g. "0", "1-2" or "0,3-4". The set specification must not contain any spaces.
It is definitely recommended to use this parameter, if the target kernel lacks support for CPU hotplug.
This boot parameter expects a list of CPUs that full dynamic ticks should be enabled for. If the kernel configuration CONFIG_NO_HZ_FULL_ALL was given, then this list will be all CPUs except CPU 0, and this boot option is not needed.
To achieve isolation in the RT domain (CPU2 and CPU3), use the following parameter:
isolcpus=2,3 nohz_full=2,3
See https://www.kernel.org/doc/Documentation/kernel-parameters.txt for more about boot parameters.
After the system has booted, check the boot messages that full dynamic ticks was enabled, e.g. using the shell command dmesg. Search for entries similar to the following:
NO_HZ: Full dynticks CPUs: 2-3.
Also make sure there is an entry similar to the following:
Experimental no-CBs CPUs: 0-7.
The no-CB CPU list must include the CPU list for full dynticks.
When choosing the CPU lists on hardware using simulated CPUs, such as hyperthreads, ensure you include real cores and not half a core. The latter could occur if one hyperthread is in the set of CPUs using full dynamic ticks feature while the other hyperthread on the same core does not. This can cause problems when pinning interrupts to a CPU and the two hyperthreads might also affect each other depending on load.
Application considerations
To achieve full dynamic ticks on a CPU, there are some requirements on the application being run on this CPU. First of all, it must not run more than one thread on each CPU. It must also not use any POSIX timers, directly or indirectly. This usually excludes any kernel calls that will access the network, but also excludes a number of other kernel calls. Keeping the kernel calls to a minimum will maximize the likelihood of achieving full dynamic ticks.
The application must utilize the CPU partitioning described in
previous section, which is done by writing the application thread's PID
into the file /sys/fs/cgroup/cpuset/rt/cpuset.tasks
(assuming the real-time partition is called "rt
").
After this, the shell command taskset can be used to
bind the task to a specific CPU in the "rt
"
partition. Binding can also be done by the application itself using the
sched_setaffinity
() function defined in
sched.h
.
Cost of enabling full dynamic ticks
Full dynamic ticks incurs the following costs:
Transitions to and from idle are more expensive. This is
inherited from CONFIG_NO_HZ_IDLE
since
CONFIG_NO_HZ_FULL
builds on the same code as
CONFIG_NO_HZ_IDLE
.
Transitions between user and kernel space are slightly more expensive, since some book-keeping must be done.
Scheduling statistics normally involve periodic timeouts, and
are therefore implemented slightly differently for
CONFIG_NO_HZ_FULL
.
Benchmarks
Below is an example trace log for a call to the
scheduler_tick
() function in the kernel:
0) | scheduler_tick() { 0) | _raw_spin_lock() { 0) 0.113 us | add_preempt_count(); 0) 0.830 us | } 0) 0.085 us | update_rq_clock.part.72(); 0) 0.146 us | __update_cpu_load(); 0) 0.071 us | task_tick_idle(); 0) | _raw_spin_unlock() { 0) 0.076 us | sub_preempt_count(); 0) 0.577 us | } 0) | trigger_load_balance() { 0) 0.098 us | raise_softirq(); 0) 0.065 us | idle_cpu(); 0) 1.715 us | } 0) 6.585 us | }
As can be seen from the above trace, the tick took more than 6 us, excluding interrupt overhead. This was a common time for this target, HP Compaq Elite 8300 which is an Intel Core i5 3570.
If the above sections does not offer enough real-time properties, then this chapter gives some extra hints of what can be done.
tsc boot parameter - x86 only
The time stamp counter is a per-CPU counter for producing time stamps. Since the counters might drift a bit Linux will periodically check that they are synchronized. But this periodicity means that the tick might appear despite using full dynamic ticks.
By telling Linux that the counters are reliable, Linux will no longer perform the periodic synchronization. The side-effect of this is that the counters may start to drift, something that can be visible in trace logs for example.
Here is an example of how to use it:
isolcpus=2,3 nohz_full=2,3 tsc=reliable
See https://www.kernel.org/doc/Documentation/kernel-parameters.txt for more about boot parameters.
Delay vmstat Timer
It is used for collecting virtual memory statistics.The statistics are updated at an interval specified as
seconds in /proc/sys/vm/stat_interval
. The amount
of jitter can be reduced by writing a large value to this file. However,
that will not solve the issue with worst case latency.
Example (1000 seconds);
$ echo 1000 > /proc/sys/vm/stat_interval
There is one kernel patch (https://lkml.org/lkml/2013/9/4/379) that removes the periodic statistics collection and replaces it with a solution that only triggers if there is actual activity that needs to be monitored.
BDI Writeback Affinity
It is possible to configure the affinity of the block device write back flusher threads. Since block I/O can have a serious negative impact on real-time performance, it should be moved to the general purpose partition. Disable NUMA affinity for the writeback threads
$ echo 0 > /sys/bus/workqueue/devices/writeback/numa
Set the affinity to only include the general purpose CPUs (CPU0 and CPU1).
$ echo 3 > /sys/bus/workqueue/devices/writeback/cpumask
Real-time throttling in partitioned system
Real time throttling (RTT) is a kernel feature that limits the amount of CPU
time given to Linux tasks with real-time priority. If any process that
executes on an isolated CPU runs with real-time priority, the CPU will
get interrupts with the interval specified in
/proc/sys/kernel/sched_rt_period_us
. If the system
is configured with CONFIG_NO_HZ_FULL
and a real-time
process executes on a CONFIG_NO_HZ_FULL
CPU, note
that real-time throttling will cause the kernel to schedule extra ticks.
See Section 2.3, Real-Time Throttling and Section 3.2.4, Optimize Real-Time Throttling for more
information.
Disable real-time throttling by the following command:
$ echo -1 > /proc/sys/kernel/sched_rt_runtime_us
Disable Power Management
The CPU frequency governor https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt
causes jitter because it is periodically monitoring the CPUs. The actual
activity of changing the frequency can also have a serious impact.
Disable the frequency governor with the following configuration
CONFIG_CPU_FREQ=n
.
An alternative is, at runtime, to change the governor policy to
performance
. The advantage is the policy affinity per
each cpu:
$ echo "performance" > /sys/devices/system/cpu/cpu<cpu_id>/cpufreq/scaling_governor
An example based on the RT partition:
$ echo "performance" > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor $ echo "performance" > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
Note that this could damage your hardware because of overheating. Make sure that you understand what works for your specific hardware.
Machine Check - x86 only
The x86 architecture has a periodic check for corrected machine check errors (MCE). The periodic machine check requires a timer that causes unwanted jitter. The periodic check can be disabled. Note that this might lead to that silently corrected MCEs goes unlogged. Turn it off on the RT CPUs. For each CPU in the real-time partition, do the following:
$ echo 0 >/sys/devices/system/machinecheck/machinecheck<cpu>/check_interval
$ echo 0 >/sys/devices/system/machinecheck/machinecheck2/check_interval $ echo 0 >/sys/devices/system/machinecheck/machinecheck3/check_interval
It has been observed that it is enough to only disable this for CPU0, it will then be disabled on all CPUs. Read more about the x86 machine-check exception at https://www.kernel.org/doc/Documentation/x86/x86_64/machinecheck.
Disable the Watchdog
The watchdog timer is used to detect and recover from software faults . It requires a regular timer interrupt. This interrupt is a jitter source that can be removed. At the obvious cost of less error detection.
The watchdog can be disabled at compile time by setting
CONFIG_LOCKUP_DETECTOR=n
or in runtime in the
:
$ echo 0 > /proc/sys/kernel/watchdog
For more information, see http://en.wikipedia.org/wiki/Watchdog_timer.
Disabling the NMI Watchdog - x86 only
Disable the debugging feature for catching hardware hangs and cause a kernel panic. On some systems it can generate a lot of interrupts, causing a noticeable increase in power usage:
$ echo 0 > /proc/sys/kernel/nmi_watchdog
Increase flush time to disk
To make write-backs of dirty memory pages occur less often than default, you can do the following:
$ echo 1500 > /proc/sys/vm/dirty_writeback_centisecs
Disable tick maximum deferment
To have the full tickless configuration, the https://lkml.org/lkml/2013/9/16/499 patch should be included. This allows the tick interval to be maximized by setting sched_tick_max_deferment variable in the /proc filesystem. To disable the maximum deferment, it should be set to -1.
$ echo -1 > /sys/kernel/debug/sched_tick_max_deferment
Network queues affinity
Linux can route the packets handling on different cpus in an SMP system. Also this handling can create timers on the specific CPUs, an example is the ARP timer management, based on neigh_timer. There are couple of solutions that can be adopted to minimize the effect of rerouting packets on different CPUs, like migrating all the timers on the non-realtime partition if possible, specifing the affinity of network queues on some architecture.
If the application needs the packets to be received only in the nRT partition then the affinity should be set as follows:
$ echo <NRT cpus mask> > /sys/class/net/<ethernet interface>/queues/<queue>/<x/r>ps_cpus
Example for TCI6638k2k board:
$ echo 1 > /sys/class/net/eth1/queues/tx-0/xps_cpus $ echo 1 > /sys/class/net/eth1/queues/tx-1/xps_cpus $ echo 1 > /sys/class/net/eth1/queues/tx-2/xps_cpus $ echo 1 > /sys/class/net/eth1/queues/tx-3/xps_cpus $ echo 1 > /sys/class/net/eth1/queues/tx-4/xps_cpus $ echo 1 > /sys/class/net/eth1/queues/tx-5/xps_cpus $ echo 1 > /sys/class/net/eth1/queues/tx-6/xps_cpus $ echo 1 > /sys/class/net/eth1/queues/rx-0/rps_cpus
Note that if there is a need of network traffic on both partitions, the affinity should not be set. In this case, the neigh_timer can be handled by any cpu (including those in the RT partition).
The measurements that compares latency numbers with and without
isolation on a stressed system can be found below. The benchmark program
used is cyclictest
(https://rt.wiki.kernel.org/index.php/Cyclictest)
and the load was generated with the stress
application (http://people.seas.harvard.edu/~apw/stress).
This is how the non-isolated benchmark was executed:
$ ./stress -d 4 --hdd-bytes 20M -c 4 -i 4 -m 4 --vm-bytes 15M & $ ./cyclictest -m -p 99 -l 300000 -q;
In the isolated case the following boot parameter is used:
isolcpus=3
This is how the isolated benchmark was executed:
$ partrt create 0x8; $ partrt run nrt ./stress -d 4 --hdd-bytes 20M -c 4 -i 4 -m 4 --vm-bytes 15M & $ partrt run rt ./cyclictest -m -p 99 -l 300000 -q;
Please read more about partrt in Section 3.3.6, The CPU Partitioning Tool - partrt. The benchmark was executed on a TCI6638k2k board.
Table 3.5 Latencies on partitionated system
Latency (µs) | |||
---|---|---|---|
Min | Max | Avg | |
Not isolated | 43 | 1 428 | 68 |
Isolated | 40 | 151 | 59 |
The benchmark shows that the worst case latency drops with a factor 10 when the benchmark runs on an isolated CPU.
https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt
https://www.kernel.org/doc/Documentation/kernel-per-CPU-kthreads.txt
Enea Linux includes a tool, partrt, for dividing an SMP Linux system into partitions. By using the methods described in the previous section, this tool provides an easy to use interface to set up CPU isolation in an intelligent way. The tool can be downloaded from https://github.com/OpenEneaLinux/rt-tools.
Usage Examples
Create the RT partition on CPU 2 and CPU 3. Default names will be
"rt
" for the real time partition and
"nrt
" for the general purpose partition:
$ partrt create 0xc
Or on the NUMA system assumed in previous section:
$ partrt create -n 1
Show the recently created partitions like this:
$ partrt list
Run cyclictest on CPU 3 in the RT partition:
$ partrt run -c 0x8 rt cyclictest -n -i 10000 -l 10000
Move cyclictest to NRT partition:
$ partrt move «pid of cyclictest» nrt
Undo partitioning (restore environment)
$ partrt undo
See full partrt help text like this:
$ partrt -h
If the attempts in the previous sections to improve the real-time performance are not enough, consider those described in this section.
As mentioned on several places in this manual, the standard Linux kernel is not real-time safe. Read more about this in Chapter 2, Basic Linux from a Real-Time Perspective.
Two proposed solutions to tackle real-time in Linux are PREEMPT_RT Section 2.4, The PREEMPT_RT Patch and CPU isolation Section 2.6, CPU Isolation. Both solutions has consequences. PREEMPT_RT has throughput issues and CPU isolation on a standard kernel requires that only a subset of the libc can be used if real-time shall be maintained. One approach that can improve throughput and real-time properties is to not use the kernel at all. Instead a runtime entirely in user space can be used.
Below are some issues that appear with PREEMPT_RT and/or CPU isolation.
For application writers that move from a bare metal or RTOS environment to Linux, the overhead of the Linux API might be unacceptable. PREEMPT_RT adds even more overhead.
CPU isolation can be a good way to get real-time performance on a standard Linux kernel. The big problem is that the Linux API is not real-time safe and has to be avoided whenever real-time determinism is required. This limits the application developer, since standard APIs and programming models can not be used. IPC between real-time tasks and general purpose tasks is an other issue. Most IPC implementations on Linux relies on system calls and can add real-time problems if used on a standard kernel. Some IPC implementations might be unsafe to use on PREEMPT_RT, depending on how the IPC implementation handles dynamic memory. This subject is discussed in Chapter 4, Designing Real-Time Applications.
Running a real-time deterministic runtime completly in user space can be a good way to increase determinism and throughput. One example is Enea LWRT which provides deterministic multithreading, memory management and LINX IPC. See http://www.enea.com/Embedded-hub/documents/Linux/Enea_LWRT_datasheet/ for more details. A specialized user space runtime can solve the mentioned problems:
The runtime can implement leightweight threads in userspace. This can greatly decrease overhead and increase determinism. Voluntary context switches can be done entierly in user space. A comparision between LWRT light weight user space threads and standard Linux pthreads can be seen below. The benchmark is done on a Texas Instruments TCI6638k2k (1.2 GHz).
Table 3.6 Task switch latency. LWRT vs. pthreads (micro seconds)
Latency (µs) | |||
---|---|---|---|
Min | Max | Avg | |
Linux pthread | 4.25 | 31.35 | 8.58 |
LWRT process | 0.26 | 1.88 | 0.41 |
The table shows that the pthread context switch has
about a factor of 30 higher overhead compared with LWRT
processes in the average case. This could be a problem in, for
example, a telecommunication application that uses a high amount
of threads.
A real time safe user space runtime can provide real-time deterministic API that can replace the undeterministic parts of the glibc API. Typical undeterministic calls that can be replaced are related to dynamic memory management, multi-threading and IPC.
For completeness, it should be mentioned that using an RTOS may be the best alternative if all ways to improve the real-time properties in a Linux system have been exhausted without reaching acceptable performance. Doing that is outside the scope of this manual, but an example is using one of the RTOS products provided by Enea.
Table of Contents
Optimizing the Linux system itself for real-time is only half the solution to give applications optimal real-time properties. The Linux applications themselves must be designed in a proper way to allow for real-time properties. This section will provide some hints how to do this.
The C function library,
libc
, is a part of the runtime for applications under Linux. It
provides basic facilities like fopen
,
malloc
, printf
,
exit
, etc. The C library shall provide all functions
that are specified by the ISO C standard. Usually, additional functions
specific to POSIX are also supported.
GNU libc
(https://www.gnu.org/software/libc)
is the most widely used libc, but there are alternative implementations like
newlib
(https://www.sourceware.org/newlib)
and uClibc
(http://www.uclibc.org/other_libs.html),
all of them supporting the POSIX.1b (IEEE 1003.1b) real-time
extensions.
The libc
provides a powerful toolbox with many
useful and frequently used features, but from a real-time perspective, it
must be used with some caution.
The first problem to deal with is the level of real-time support in
the libc
code itself. The code is often considered to be
proven in use and therefore used without deeper analysis. This is probably a
valid assumption for a typical Linux system where average performance is
more important than worst case behavior, but for real-time systems it might
be an unreasonable attitude. This should however not be a major issue since
the source code is available for analysis.
Another issue is the fact that the functions in
libc
may use system calls interacting with the kernel.
Depending on the kernel preemption model, this may lead to execution of
different non-preemptible sections. The kernel can also decide to execute
other tasks like soft IRQs on its way back to user space.
Further application design challenges are:
Memory handling - memory access latency, shared memory
Multi-threaded applications - task switch overhead, locks
I/O - blocking, driver-induced jitter
For further reading, see https://rt.wiki.kernel.org/index.php/RT_PREEMPT_HOWTO.
Linux applications access memory by using virtual addresses. Each virtual address translates into a physical address with the help of hardware that uses translation tables. This feature makes it possible to address more virtual memory than there is physical memory available, assuming that not all applications need all their allocated memory at the same time.
Allocating memory will by default only reserve a virtual memory range. When the first memory access to this newly allocated virtual memory range occurs, this causes a page fault, which is a hardware interrupt indicating that the translation table does not contain the addressed virtual memory. The page fault interrupt will be handled by the Linux kernel, which will provide the virtual-to-physical memory mapping. Then the program execution can continue.
Most architectures use a cache called translation lookaside buffer, TLB, for the translation table. The TLB cache is used to speed up virtual-to-physical memory translations. A translation causes latency jitter when a looked-up address is not in the TLB, which is referred to as a TLB miss.
Virtual memory makes it possible for Linux to have memory content
stored in a file, e.g. by loading an executed binary in an on-demand
fashion or by swapping out seldom used memory. This is called
demand paging, see http://en.wikipedia.org/wiki/Demand_paging
for more information on this topic. Demand paging can cause unbound
latency since it involves accessing a file or a device. Therefore, the
application needs to disable demand paging by using the
mlockall()
function call:
mlockall(MCL_CURRENT | MCL_FUTURE)
The MCL_CURRENT
will make sure that all physical memory has the expected
content and that the translation table contains the needed
virtual-to-physical memory mapping. This includes code, global variables,
shared libraries, shared memory, stack and heap.
MCL_FUTURE
means that updating the translation table and initializing
the physical memory, if applicable, is done during future allocations, not
when accessing the memory. Future allocations can be stack growth, heap
growth, shm_open()
, malloc()
, or
similar calls like mmap()
.
When using mlockall()
, there is a risk that Linux
will allow allocating less memory. For example,
the allocated memory must be available as physical memory.
Note that a call to malloc()
can still show
large latency variation, since now the translation table update is done
within this function call instead of when accessing the memory. Not to
mention that a malloc()
may or may not need to ask
for more virtual memory from the kernel. It is therefore recommended to
allocate all needed dynamic memory at start, to avoid this jitter.
In case dynamic memory allocation needs to be done within the real-time application, there are
some actions that can be performed to mitigate the
malloc()
latency variation. The
glibc
has a function called
mallopt()
which can be used to change the behavior of
malloc()
. Two interesting options are
M_MMAP_MAX
and
M_TRIM_THRESHOLD
.
M_MMAP_MAX
controls at what allocation size
the mmap()
function will be used instead of
sbrk()
. The advantage with
mmap()
is that the memory can then be returned to
the system as soon as free()
is called, since
each allocation is separate. The disadvantage with
mmap()
is that it is slower than
sbrk()
.
M_TRIM_THRESHOLD
controls when part of an
sbrk()
allocated memory area, where a large
contiguous area of memory at the top has been freed, shall be returned
to the system. Turning this feature off will not allow the application
to release any memory to the system once it has been allocated, but it
can somewhat improve the real-time properties.
See the following table for some measurements done on
malloc()
call time and memory access time. The
measurements have been done on a system with CPU isolation.
Table 4.1 malloc() and memory access measurements on TCI6638k2k board
Partition | Scenario | Latency (µs) | Operation | ||
---|---|---|---|---|---|
Min | Max | Avg | |||
RT | normal | 4 | 178 | 108.8 | Mem access |
4 | 331 | 19.6 | Malloc call | ||
mlockall() | 4 | 13 | 5.0 | Mem access | |
4 | 515 | 96.5 | Malloc call | ||
mlockall(), M_MMAP_MAX = 0, M_TRIM_THRESHOLD = -1 | 4 | 11 | 4.8 | Mem access | |
4 | 417 | 5.7 | Malloc call | ||
GP | normal | 4 | 1384 | 109.0 | Mem access |
7 | 164 | 19.7 | Malloc call | ||
mlockall() | 4 | 125 | 5.1 | Mem access | |
4 | 1607 | 95.4 | Malloc call | ||
mlockall(), M_MMAP_MAX = 0, M_TRIM_THRESHOLD = -1 | 4 | 91 | 4.8 | Mem access | |
4 | 463 | 5.7 | Malloc call |
There are a large number of "4" in the minimum latency column. This
is due to the timer resolution that will not allow any smaller values. For
a real-time application, the maximum latency is of the greatest
importance. The measurements show that when using
mlockall()
, the memory access becomes more
predictable, while malloc()
will always have a
latency that is hard to predict. Letting the real-time tasks run on
dedicated CPUs, here referred to as an RT partition, also results in lower
latency jitter.
An alternative is to use another heap allocator than the one
provided by glibc;
, an allocator which is better
adapted for embedded and real-time requirements. Here are two
examples:
O(1) complexity for allocation and free calls and only 4 bytes
overhead per allocation. It can be slower than the
glibc
default allocator on average, but should
have better worst case behavior. See http://www.gii.upv.es/tlsf/
for more information.
O(1) complexity for allocation and free. It uses bitmaps and segregated lists. See http://www.gii.upv.es/tlsf/alloc/others for more information.
The glibc
itself uses
ptmalloc
(http://malloc.de/en/), which is a more
SMP-friendly version of DLMalloc
(http://gee.cs.oswego.edu/dl/html/malloc.html).
Since memory is slow compared to the CPU, there is usually a fast but small memory area called cache between the CPU and the memory. When an access is done to non-cached memory, the CPU needs to wait until the cache has been updated. Accessing a memory location that has not been accessed for a long time will therefore take more time than accessing a recently accessed memory location.
One obvious way to the cache problem is to disable the cache. While this would help the worst case latency it will make the average latency horrible, given that the selected architecture supports disabling the cache. You can also use CPU partitioning to let the real-time application run on the CPU alone, and making sure that this application does not access more memory than fits in the cache.
Another consequence of the per-CPU caches becomes obvious when using
shared memory among tasks running on different CPUs. Such memory can be
shared either by threads in the same process or by using the
shm_open()
function call. When such memory is updated
by one task and being read by another task running on another CPU, the
memory contents need to be copied. This copy usually has a minimum size,
called cache line, which for many architectures is 32
bytes. A one byte write will therefore end up copying 32 bytes in this
situation, which is 32 times more than might be expected.
Summary:
Use mlockall(MCL_CURRENT | MCL_FUTURE)
to
lower memory access latency jitter.
Pre-allocate all needed dynamic memory, since the time needed for memory allocation is hard to predict.
Avoid sparse memory accesses to better utilize hardware caches.
Be careful about sharing memory between tasks running on different CPUs, as each access may end up copying 32 bytes on many architectures.
Consider using the M_MMAP_MAX
and
M_TRIM_THRESHOLD options in
mallopt()
in case dynamic memory cannot be
pre-allocated.
Consider using the TLSF heap allocator if lower worst-case
latency for malloc()
is needed.
There are two driving forces to make applications multi-threaded:
Easier to design event driven applications.
Make use of concurrency offered by multicore systems.
A rule of thumb when designing a multi-threaded application is to assign one thread for each source of event, output destination and state machine for controller logic. This will usually lead to many threads, which can result in the application spending a lot of time doing task switches between those threads. A user-space scheduler can solve that problem but then the threads cannot run on different cores. You can also merge threads so that work with higher real-time requirements are kept in separate threads from work with lower real-time requirements.
To make the application truly concurrent, a good choice is to use POSIX threads, pthreads. Each pthread can run on its own core, and all pthreads within the same process are sharing memory, making it easy to communicate between the threads. Be careful however that if the threads need a lot of synchronization, the application will no longer be concurrent despite the fact that several CPUs are used. In this case the application might even become slower than having it as a single thread.
Asynchronous message passing can solve some of the synchronization problems, especially
use cases where a thread with high real-time requirements can delegate
work to a thread with less real-time requirements. An example of such
mechanism is the POSIX functions mq_send()
and
mq_receive()
. One challenge with message passing is
how to handle flow control, i.e what to do when the queue is full.
Message passing as well as other synchronization mechanisms, e.g.
the POSIX pthread_mutex_lock()
, can suffer from priority inversion where a high priority
task is forced to wait for a low priority task. For
pthread_mutex_lock()
it is possible to enable
priority inheritance to avoid this problem, while for message passing this
needs to be considered when designing the messaging protocol.
Example on how to set priority inheritance for a pthread mutex:
pthread_mutex_t mutex; pthread_mutexattr_t attr; pthread_mutexattr_init(&attr); pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT); pthread_mutex_init(&mutex, &attr);
When using
gcc
, you need to compile the code with options
"-pthread -D_GNU_SOURCE
".
Summary:
When designing an event-driven application, it is often easier to use multiple threads rather than callback functions.
Compared to singe-threaded applications, multi-threaded applications using pthreads can scale better on a multicore systems.
In a multi-threaded application where the threads require heavy synchronization, the threads will spend most of the time waiting for each other. This will make the application less scalable.
A user-space scheduler, compared to scheduling in the kernel, will allow more synchronization and more threads before efficiency goes down.
Properly designed message passing applications can make the synchronization less painful.
When synchronizing threads, beware of priority inversion. Message passing protocol needs proper design, and POSIX mutexes can use priority inheritance to avoid priority inversion.
I/O accesses usually end up as system calls accessing a kernel
driver. This kernel driver then accesses hardware, which can have an
unbound latency. The calling task will by default block during this
period. It is however possible to perform asynchronous I/O to avoid being
blocked. See the aio(7)
man page for more information
about asynchronous I/O.
The driver can add deferred work to a work queue or to soft IRQs, which can make it hard to predict latencies for a real-time application.
Furthermore, the driver might need some timeouts. If using the full dynamic ticks feature, such timeouts may cause tick interrupts to be triggered. One possible solution to this is to delegate the I/O calls to a task running on another CPU than the real-time tasks. Then the latencies caused by deferred driver work will only affect the delegated task, but not the real-time task.
Note that the I/O concerns are also applicable to text messages sent
to stdout
or stderr
, or text read
from stdin
. If a device driver writes a diagnostic
message from the kernel, e.g. by using the kernel function
printk()
, the I/O concerns applies to this message as
well.
Summary:
Delegate I/O to a task running on another CPU than the real-time task.
If delegation is not possible, asynchronous I/O might help. See
the aio(7)
man page.
Table of Contents
The real-time properties of a system do not only depend on the operating system. The hardware is also important. This chapter contains a brief discussion on what to consider regarding real-time capable hardware.
Real-time capable hardware requires, at the minimum, that resources can be accessed within a deterministic time interval. If there is a risk for resource congestion, the hardware must implement deterministic multiple access algorithms which guarantee that there will be no starvation. Another typical hardware requirement in a real-time system is that CPUs have access to reliable high resolution timers.
Modern CPUs use a number of techniques to speed up code execution, such as instruction pipelines, out-of-order execution, and branch prediction. They all contribute to better average speed in code execution, but will also cause latency jitter. Read more about these topics on http://en.wikipedia.org/wiki/Instruction_pipeline, http://en.wikipedia.org/wiki/Out-of-order_execution, and http://en.wikipedia.org/wiki/Branch_prediction.
Designing real-time applications for SMP systems is more complex than designing for single-CPU systems. Typically, on SMP, all hardware resources such as I/O devices and memory are shared by the CPUs. Since all resources are shared, a low priority task could starve a high priority task, i.e. cause a priority inversion.
There are methods for scheduling multiple access to I/O devices, see Section 2.9, I/O Scheduling, but it is more tricky to manage shared memory. Currently there are no good tools for controlling multiple access to shared memory resources as there are for multiple access to I/O devices. Multiple access to shared memory is more hidden from a software perspective and is largely implemented in hardware.
A proper real-time design should consider how to deal with shared memory and memory congestion, how hyper-threading affects real-time performance, how to use NUMA to improve the execution time, and how to decrease impact of shared resource congestion; all of these topics are covered in this section.
For a deeper study in the area of resource sharing on multi-core devices, see http://atcproyectos.ugr.es/wicert/downloads/wicert_papers/wicert2013_submission_2.pdf.
SMP systems typically share a system bus, cache, MMU and one or several memory channels. An application that causes heavy load on a shared resource can significantly degrade performance for other CPUs. Not only CPUs, but also other devices that support DMA, will increase the congestion.
Memory congestion can have various sources. In Chapter 4, Designing Real-Time Applications you can study the software driven congestion caused by a shared-memory programming model. This section covers the impact from hardware resource congestion due to shared memory and how it affects the worst case execution time.
A real-time application designer probably wants to test the system to find a worst case execution time. The methods below describe how to stress the shared memory and measure an approximate worst case execution time. The methods are suitable for soft and firm real-time systems.
A pessimistic indication of what impact shared memory congestion has on worst case latency can be estimated like this:
Turn off caches. On some architectures it is possible to disable the caches. Above all, this will give the absolutely worst impact that cache misses can have on worst case execution time. It will also make it possible to measure the impact of congestion on the memory bus and on the off-chip memory.
Start memory load on each general-purpose CPU by calling the command specified below. This gives an indication about what impact the memory bus and off-chip memory congestion has on worst case execution time. This will give a good indication even if it isn't possible to disable the caches.
The application that is used to generate load is called
stress
and is described in Appendix B, Benchmark Details. Start the stress application on
each non-real-time CPU. Use memory stress with a stride that equals
to the cache line size of the target architecture. Make sure that
the value passed to --vm is larger than the last level shared
cache.
taskset <GP-CPU-n> ./stress --vm <LAST_LEVEL_CACHE_SIZE> \ --vm-stride <CACHE_LINE_SIZE>
That will likely trash the shared cache and cause a minimal number of cache hits for that real-time application.
If the impact of MMU congestion is of interest, repeat step 2 but use a stride size that is equal to the system page size.
taskset <GP-CPU-n> ./stress --vm <LAST_LEVEL_CACHE_SIZE> --vm-stride <PAGE_SIZE>
The impact of the generated load in the above examples will vary significantly depending on the CPU clock speed, memory clock speed, cache size, coherency algorithm and cache/memory hierarchy. Changing these hardware parameters will create different congestion thresholds. Processor architectures that cannot guarantee CPU access to a specific bus or device within a deterministic amount of time cannot be used for real-time applications.
For hard real-time systems, Linux is probably not a suitable operating system. If you anyway go for Linux, a static analysis should be done instead of using the methods above. The static analysis is needed to calculate a theoretical worst case execution time based on number of clock cycles for a worst case scenario which also takes the hardware latency into account.
Hyper-threading means that there are multiple virtual CPUs per physical CPU. Two or more instruction pipelines will share hardware resources. A low priority process and a high priority process can run on separate virtual CPUs belonging to the same physical CPU. This can lead to a situation where the low priority process decreases the performance of the high priority process. It is recommended to disable hyper-threading when real-time performance is required. Another approach is to make sure that each real-time task has exclusive access to a physical CPU.
The negative impacts on worst case execution time can at large be
eliminated if the target hardware supports NUMA. By using the Linux cpuset
feature it is easy to give the real-time application its
own memory node. Read more about this in Section 2.6, CPU Isolation. Note that memory congestion will also occur if the real-time application runs on
multiple CPUs in the real-time partition. However, that should be more
manageable.
Below is a list with suggestions on how to decrease the impact of shared resource congestion.
If the platform has NUMA: Dedicate one NUMA node to the RT application. See Section 3.3.1, CPU Isolation.
Disable hyper-threading. If that isn't possible, use CPU isolation with static affinity so that only one real-time task executes per physical CPU.
Disable the cache if the architecture allows it. Do this to avoid possible indeterminism added by cache misses. If this is needed, it could indicate that Linux as operating system or the hardware platform is unsuitable for the application.
On some architectures it might be possible to lock real-time application pages into the cache. Consult the processor manual and, if available, the hardware platform specific SDK manual.
The x86 architecture has an operating mode called System Management Mode, also known as SMM. It is "a special-purpose operating mode provided for handling system-wide functions like power management, system hardware control, or proprietary OEM designed code." The SMM is entered via an event called system management interrupt, SMI. SMM/SMI has a higher priority than the operating system and will therefore affect the latency. It cannot be disabled by the OS, and even if there might be other ways to disable it, it should probably be kept since it also handles thermal protection. Consequently, there is not much that can be done about it except for adding enough margins to tolerate it. Read more about SMM in http://en.wikipedia.org/wiki/System_Management_Mode.
Table of Contents
When someone states the goal to "optimize a specific Linux target for real-time" and provides a benchmark result, it is very important to be clear on what capabilities the measured system actually has. Benchmark results may be interesting to read, but they are only valid and relevant if they are somewhat comparable with each other and if the setup is relevant for real-world use cases.
This appendix states the goal to optimize for real-time, but it actually tries to reach as far as possible regarding both throughput performance and low worst-case latency response time since the use case we focus on is an embedded control system within the communications domain, which normally has both fairly high soft real-time requirements and performance requirements.
In a real-time system, the characteristic behavior of the entire operating system is very important. To start with, a deterministic response time from an external event until the application is invoked is what we normally refer to when talking about real-time characteristics of an operating system. This implies not only the interrupt latency, but also the event chain until the application gets scheduled. Since a chain is not stronger than its weakest link, it is also important to provide a deterministic runtime environment for the entire application execution flow so that it can respond within the specified operational deadline on system level. This implies that also the task scheduler and the resource handling API in the OS must behave in a deterministic way. When designing a system for high throughput performance, the goal is to keep down the average latency, while designing a real-time system aims to keep the worst case latency under a specified limit. As a consequence, the design of a both high-performing and real-time capable system must take both average and maximum latency into account, and this is something we will strive for in this application note.
The selected target for this exercise is the Enea Linux PowerPC kernel for the p2041rdb board. The p2041rdb kernel (and most other Enea Linux PPC kernels) is built in two different ways:
A high-performance kernel image that is optimized for high throughput performance and low footprint, for example intended for IP benchmarking. This image, and its corresponding kernel configuration file, is tagged "RELEASE".
A demo/debug kernel image that has the goal to be fully
configured regarding all kinds of online- and off-line debug
capabilities. The demo/debug image is not tagged, and this is also the
one that you can modify via the command bitbake –c
menuconfig virtual/kernel
, and also rebuild via the command
bitbake virtual/kernel
.
Both kernels are configured as a Low-Latency Desktop (LLD) kernel, i.e. the most preemptive standard kernel variant. (selected by CONFIG_PREEMPT).
The strategy we will follow in the tuning effort is to go through a number of steps, each of which we briefly describe the configuration level and the latency benchmark result:
The first attempt is with the default "demo/debug" kernel image, for the main reason to highlight the difference to the end result caused by both debug overhead and selective tuning of important parameters.
The performance-optimized "RELEASE" kernel image, which is clearly configured for speed. However, it will show here that it is not optimized for latency, and we can do additional tuning to improve both performance and worst case latency.
A standard LLD kernel image highly tuned and optimized for real-time AND performance. This kernel is based on the RELEASE kernel configuration but with additional build configuration changes and boot time configuration options that give the smallest worst-case latency figure possible while not compromising the performance.
Finally, we enable the PREEMPT_RT patch as an "overlay" on the previous LLD kernel in 3) in order to see our possible best result regarding worst case latency.
Note that the goal here is to optimize for performance and real-time; in both the development phase, the deployment phase and the maintenance phase in a real-life production system we will very likely have to add some of the tracing and debugging features we have now removed because otherwise the system will become unmaintainable. This has a price in overhead, but the performance/debug tradeoffs are different from case to case.
The Enea Linux PowerPC kernel is built so called "staged"; first the
"RELEASE" image is configured and built, then after that the normal
demo/debug image is configured and built. The recipe for the kernel build
can be found under
poky/meta-enea-recipes-kernel/linux
. The file
linux-qoriq-sdk.bbappend
is essential here; it
describes exactly what configuration that shall go into both the RELEASE
kernel and the demo kernel. The kernel configuration file (.config) is
built up incrementally by merging configuration fragments in a specific
order from the sub-directory files/cfg
according to
specific variable definitions.
The configuration file for the high-performance RELEASE kernel is
defined by the incremental merge of the fragments specified in
KERNEL_FEATURES
variable, and the resulting
.config
file can also be found as the
config-p2041rdb-RELEASE.config
file in the deployment
images directory. The default demo/debug kernel has got additional
configuration fragments merged to its .config
file,
specified by the STAGING_KERNEL_FEATURES variable, and the aggregated
.config
file is named
config-p2041rdb.config
.
The worst case latency benchmark uses a combination of cyclictest ( https://rt.wiki.kernel.org/index.php/Cyclictest ) and stress ( http://people.seas.harvard.edu/~apw/stress ) . The values of the buffer sizes used in the stress are chosen in order both to generate much stress load on network via NFS traffic in the hdd test, and also in the attempt to resemble a real live embedded application. The values are:
Table A.1 Details of stress scenarios
Test scenario | Stress |
---|---|
hdd | ./stress –d 4 –hdd-bytes 1M |
vm | ./stress –m 4 –vm-bytes 4096 |
full | ./stress –c 4 –i 4 –m 4 –vm-bytes 4096 –d 4 –hdd-bytes 4096 |
The benchmark runs one stress instance per core in parallel with the cyclictest program:
./cyclictest –S –m –p99 –l 100000
This kernel is configured to contain all kinds of debug features, and thus it has a lot of overhead. Below is an enumeration of the features added, briefly described by the name of the configuration fragment:
files/cfg/00020-debug.cfg
: misc collection
of numerous ftrace options, performance events, counters, stacktrace
support, file system debug options.
files/cfg/00033-kprobes.cfg
files/cfg/00007-oprofile.cfg
files/cfg/00019-i2c.cfg
files/cfg/00027-lttng.cfg
files/cfg/00025-powertop.cfg
files/cfg/00004-systemtap.cfg
files/cfg/00014-kgdb.cfg
Benchmark results for the default build
uImage-p2041rdb.bin
image (prebuilt in the
distribution):
Table A.2 Benchmark numbers for the "default" LLD kernel (demo/debug)
Latency [µs] | Stress type | |||||
---|---|---|---|---|---|---|
no stress | cpu | Io | vm | hdd | full | |
Min | 12 | 10 | 19 | 10 | 11 | 10 |
average | 24 | 17 | 32 | 18 | 30 | 25 |
max | 382 | 33 | 90 | 39 | 388 | 230 |
The table above shows the printed resulting min, average, max, latency in microseconds from the cyclictest program rhat run one instance on each core in parallell with the stress program.
The result shows a fairly long and fluctuating worst case latency, as well as a significant overhead in min- and average values. The conclusion is that this is neither a suitable kernel configuration for production systems with any kind of real-time requirements, nor for systems where the performance is important. This kernel is fully-featured with its pros and cons.
As described earlier, this is a kernel configuration without the
debug features found in the default demo/debug kernel. Benchmark results
for the build uImage-p2041rdb-RELEASE.bin
(prebuilt
in the distribution):
Table A.3 Benchmark numbers for the performance-optimized RELEASE kernel
Latency [µs] | Stress type | |||||
---|---|---|---|---|---|---|
no stress | cpu | Io | vm | hdd | full | |
Min | 4 | 4 | 5 | 4 | 4 | 4 |
average | 7 | 5 | 6 | 4 | 7 | 6 |
max | 67 | 32 | 18 | 25 | 126 | 56 |
The result still shows a somewhat fluctuating worst case latency, but now the min- and average values are significantly improved. The conclusion is that this is still not a suitable kernel configuration for production systems with any kind of real-time requirements. However, for systems where you want to minimize the kernel overhead in order to maximize the application performance, this is a potential base configuration from which you may add debug features deemed necessary in field.
The two kernel builds previously described exist "out-of-the-box" in the Enea Linux p2041rdb distribution, so in order to be able to do further benchmarking we need to describe here how we can modify the kernel configuration for new builds. We can do this either by using the command
bitbake –c menuconfig
or, we can
temporarily modify the kernel recipe in the meta-enea layer. We will
choose to modify the kernel recipe, mainly in order to enable reproduction
but also because we otherwise has to do a substantial amount of
"reversing" of options in the menuconfig command since the
.config
file we have to work with is the demo/debug
one with all debug features enabled. The modifications we have to do can
be described in three steps:
Add a new, latency-optimizing, fragment file
000xx-latency.cfg
under
files/cfg
that contains:
# CONFIG_HIGHMEM is not set # CONFIG_FSL_ERRATUM_A_004801 is not set # CONFIG_FSL_ERRATUM_A_005337 is not set CONFIG_HZ_1000=y CONFIG_HZ=1000 CONFIG_JUMP_LABEL=y # CONFIG_DEBUG_PREEMPT is not set # CONFIG_DEBUG_SHIRQ is not set # CONFIG_RCU_TRACE is not set # CONFIG_TREE_RCU_TRACE is not set CONFIG_RCU_BOOST=y CONFIG_RCU_BOOST_PRIO=1 CONFIG_RCU_BOOST_DELAY=500 CONFIG_RCU_NOCB_CPU=y
Our intended target system for the exercise is still an embedded 32 bit target and we want the tick to be at least 1ms. As of today, most p2041 processor devices are at least rev 2 or later, which means that we can safely disable the HW errata workaround that disables HW MMU table walk and reverts to SW MMU table walk, which makes kernel slower. GCC for PowerPC produces slightly better code using likely/unlikely, so this we enable. We will also remove some additional potential tracing overhead. A fundamental contribution to improving the real-time characteristics and to reduce OS jitter is to enable the priority boosting for RCU readers and enable offloading of RCU callback invocation to kernel threads (from kernel version 3.7).
Edit the file linux-qoriq-sdk.bbappend
.
Replace the content of STAGING_KERNEL_FEATURES
(debug, kprobes, oprofile, i2c, lttng, powertop, systemtap, kgdb) with
the one single fragment cfg/00040-latency
from
step 1 above.
Add the argument threadirqs to the linux kernel boot argument list using uboot: E.g.:
setenv bootargs threadirqs root=/dev/nfs rw …
The actions above will further improve the worst case latency
figures as much as possible for a standard LLD PowerPC Linux
kernel. The RCU callbacks can be fairly heavy and if the execution is
offloaded and balanced in pre-emptible kernel threads, we will get a lower
jitter in the response of external interrupts and thus better worst case
latency figures. Similarly, some ISR:s (Interrupt Service Routines), can
be very long and since these normally execute with interrupts (and thus
preemption) disabled, the risk for such ISR:s adding to worst case latency
is very big. Since the kernel version 2.6.39 (may 2011, as a "spinoff"
from the PREEMPT_RT patch) it is possible to give the boot time parameter
threadirqs
, and this will instruct the kernel to
defer the execution of the ISR from the hardirq context to a pre-emptible
kernel thread instead, one per ISR. This will remove much of the driver’s
ISR execution time from the sources of latency jitter, and thus it
contributes to the improvement of the overall determinism. It will however
due to the added context switch potentially increase the overhead
slightly, but since this is a subset of the PREEMPT_RT patch we know that
the corresponding overhead also is less than in the PREEMPT_RT
case.
Benchmark results for the build uImage configured according to above and with boot time threadirqs:
Table A.4 Benchmark numbers for the "rt-optimized" mainstream LLD kernel
Latency [µs] | Stress type | |||||
---|---|---|---|---|---|---|
no stress | cpu | Io | Vm | hdd | full | |
Min | 3 | 3 | 4 | 3 | 3 | 3 |
average | 6 | 3 | 5 | 3 | 6 | 4 |
max | 31 | 16 | 19 | 20 | 45 | 44 |
The resulting figures here demonstrates a significant decrease in the jitter fluctuation and we end up getting a fairly good worst-case latency of 44 µs also while loading the system with network traffic via the nfs file system operations in hdd test. The minimal and average values are also very low, which indicates that the average code path executed when preemption is disabled has decreased. The conclusion here is that this kernel configuration would be a possible option for production systems with soft real-time requirements in the area of 100µs.
Just as for the rt-optimized LLD kernel in the previous section, we have to modify the kernel recipe temporarily in order to configure and build this kernel. The modifications we have to do can be described as follows:
1. Repeat steps 1 & 2 in Section A.7, The "rt-optimized" mainstream Low-Latency Desktop kernel.
2. Make another copy of the
STAGING_KERNEL_FEATURES_RT_p4080ds
statement and call
it STAGING_KERNEL_FEATURES_p2041rdb
. This will
generate the merge of the fragment 00018-rt on top of the LLD kernel and
enable the preempt_rt patch.
3. The existence of the _RT statement triggers a third stage kernel build, named uImage-p2041rdb-rt.bin and a corresponding config file.
Benchmark results for the build
uImage-p2041rdb-rt.bin
:
Table A.5 Benchmark numbers for the "rt-optimized" PREEMPT_RT kernel
Latency in [µs] | Stress type | |||||
---|---|---|---|---|---|---|
no stress | cpu | Io | vm | hdd | full | |
Min | 3 | 3 | 3 | 3 | 3 | 3 |
average | 6 | 3 | 7 | 3 | 7 | 4 |
max | 16 | 12 | 19 | 13 | 27 | 18 |
The benchmark results shows that the preempt_rt patched kernel has even more improved worst-case latency figures, 27 µs compared to 45 µs we could reach with the standard LLD kernel. The other observation is that the minimal and average figures are very similar, perhaps with a slightly longer average latency and thus overhead for the preempt_rt kernel, but this is not significant. The conclusion here is that a preempt_rt kernel configuration would be a possible option for production systems with soft real-time requirements in the area of 50-100µs.
As a short summary, the result as we have seen ranges from around 300 µs in worst case latency for the demo/debug LLD kernel down to as low as 40-45 µs with the rt-optimized LLD kernel with threadirqs and 25-30 µs when PREEMPT_RT is enabled. We have also seen a significant decrease of the minimal and average latency from around 20-25 µs down to about 3-6 µs, which implies that we also have got an overall significant throughput performance increase.
The benchmark indicates that the last years development in the mainline Low-Latency Desktop kernel with for example the threaded irqs feature, and the offloaded RCU callback feature, has made it possible to reduce the OS jitter and worst case latency down to a level where it actually starts to be a real alternative to the preempt_rt patched kernel as an option for OS choice in an embedded Linux system with soft real-time requirements.
The benchmarrk above are constructed only to indicate potential ways forward to reach soft-realtime requirements. The chosen test case does not in any way guarantee that the results are valid for any type of BSP in any type of system. It is important to note that other or different versions of drivers may affect the result, as well as different versions of the kernel or application use case pattern.
The combination of cyclictest and stress has been used for the benchmarking presented in this document. Details are given in the following tables:
Table B.1 Parameters to cyclictest used in this document
Parameter | Description |
---|---|
-S | Alias for '-t -a -n', i.e. one thread bound to each core. Use clock_nanosleep insterad of POSIX timers |
-m | Lock current and future memory allocations in memory |
-p99 | Set worker thread priority o 99 |
-l 100000 | Set number of cycles |
-q | Quiet run. Print only a summary on exit |
Table B.2 Stress scenarios
Id | Command | Description |
---|---|---|
cpu | ./stress -c
<n > | <n > worker threads
spinning on sqrt() |
hdd | ./stress -d <n >
--hdd-bytes 20M | <n > worker threads s
spinning on write()/unlink(). Each worker writes 20MB. |
io | ./stress -i
<n > | <n > worker threads
spinning on sync() |
vm | ./stress -m <n >
--vm-bytes 10M | <n > worker threads
spinning on malloc()/free(). Buffer size: 15MB |
full | ./stress -c <n > -i
<n > -m <n >
--vm-bytes 15M | <n > worker threads
each doing the cpu, io & vm stress. Buffer sizes are malloc:
256MB, write: 15MB. |
The number of worker threads (<n
>) is set
to the number of cores for the target.
Example
$ ./stress -c 4 & $ ./cyclictest -S -m -p99 -l 300000 -q ... $ killall stress