Enea® Linux Real-Time Guide

4.0-docupdate1

Copyright

Copyright © Enea Software AB 2014.

This User Documentation consists of confidential information and is protected by Trade Secret Law. This notice of copyright does not indicate any actual or intended publication of this information.

Except to the extent expressly stipulated in any software license agreement covering this User Documentation and/or corresponding software, no part of this User Documentation may be reproduced, transmitted, stored in a retrieval system, or translated, in any form or by any means, without the prior written permission of Enea Software AB. However, permission to print copies for personal use is hereby granted.

Disclaimer

The information in this User Documentation is subject to change without notice, and unless stipulated in any software license agreement covering this User Documentation and/or corresponding software, should not be construed as a commitment of Enea Software AB.

Trademarks

Enea®, Enea OSE®, and Polyhedra® are the registered trademarks of Enea AB and its subsidiaries. Enea OSE®ck, Enea OSE® Epsilon, Enea® Element, Enea® Optima, Enea® Linux, Enea® LINX, Enea® LWRT, Enea® Accelerator, Polyhedra® Flash DBMS, Polyhedra® Lite, Enea® dSPEED, Accelerating Network Convergence™, Device Software Optimized™, and Embedded for Leaders™ are unregistered trademarks of Enea AB or its subsidiaries. Any other company, product or service names mentioned in this document are the registered or unregistered trademarks of their respective owner.

Acknowledgements and Open Source License Conditions

Information is found in the Release Information manual.

Table of Contents

1 - Introduction
1.1 - How to read this document
1.2 - Terminology
2 - Basic Linux from a Real-Time Perspective
2.1 - Kernel Preemption Model
2.2 - Scheduling
2.2.1 - SCHED_FIFO and SCHED_RR
2.2.2 - SCHED_OTHER
2.2.3 - SCHED_BATCH
2.2.4 - SCHED_IDLE
2.3 - Real-Time Throttling
2.4 - The PREEMPT_RT Patch
2.5 - CPU Load Balancing
2.6 - CPU Isolation
2.7 - Dynamic Ticks (NO_HZ)
2.8 - Power Save
2.8.1 - Dynamic Freqency Scaling
2.8.2 - CPU Power States
2.8.3 - CPU Hotplug
2.9 - I/O Scheduling
2.10 - System calls
2.11 - Interrupts
2.12 - Soft IRQs and Tasklets
2.13 - Work Queues
2.14 - Threaded Interrupts
2.15 - Ticks
2.16 - Memory Overcommit
2.17 - RCU- Read, Copy and Update
3 - Improving the Real-Time Properties
3.1 - Evaluating Real-Time Properties
3.2 - First Attempt
3.2.1 - Configure the Proper Kernel Preemption Model
3.2.2 - Optimize Power Save
3.2.3 - Use Threaded Interrupts
3.2.4 - Optimize Real-Time Throttling
3.3 - Isolate the Application
3.3.1 - CPU Isolation
3.3.2 - Full Dynamic Ticks
3.3.3 - Optimizing a Partitioned System
3.3.4 - Benchmarks for CPU isolation
3.3.5 - Further reading about CPU isolation
3.3.6 - The CPU Partitioning Tool - partrt
3.4 - Further Actions if Needed
3.4.1 - User Space Runtime
3.4.2 - Use an RTOS
4 - Designing Real-Time Applications
4.1 - Application Memory Handling
4.2 - Multi-Threaded Applications
4.3 - Application I/O
5 - Hardware Aspects
5.1 - CPU
5.2 - Shared Resources
5.2.1 - Shared Memory and Memory Congestion
5.2.2 - Hyper-Threading
5.2.3 - NUMA
5.2.4 - Decrease Impact of Shared Resource Congestion
5.3 - The System Management Mode (x86)
A - Optimizing Example - P2041
A.1 - In reality, "real-time" is not only about interrupt latency
A.2 - Performance and latency tuning strategy
A.3 - Configuring and building the Enea Linux PowerPC p2041rdb kernel(s)
A.4 - Benchmark description
A.5 - The "default" LLD kernel targeting demo/debug
A.6 - The performance-optimized RELEASE kernel
A.7 - The "rt-optimized" mainstream Low-Latency Desktop kernel
A.8 - The "rt-optimized" PREEMPT_RT kernel
A.9 - Summary and Conclusion
B - Benchmark Details
Index

1. Introduction

As hardware has become ever more competent there has been a trend to implement real-time applications on Linux. Linux is designed from the beginning for server and desktop applications, not for real-time applications. This means that achieving real-time properties on Linux is not trivial. This document tries to guide anyone attempting to implement a real-time application using Linux.

1.1 How to read this document

This document is divided into the following sections:

Chapter 2, Basic Linux from a Real-Time Perspective

This section is intended as school book explaining areas of interest for designers of real-time systems.

Chapter 3, Improving the Real-Time Properties

This section lists a number of things that can be done to improve the real-time behaviour of Enea Linux. Some are of general nature and easy to apply, while others are suitable in fewer situation and/or requires a greater effort.

Chapter 4, Designing Real-Time Applications

This sections gives tips and hints about how to design your application for real-time requirements.

Chapter 5, Hardware Aspects

This section gives some hints about how to handle some hardware aspects that do impact the real-time properties.

Readers who already have a good understanding of how Linux works may want to go directly to Chapter 3, Improving the Real-Time Properties.

1.2 Terminology

blocked

When a task is waiting for an event, the task is said to be blocked.

boot parameters

Kernel boot parameters, i.e. parameters given to the kernel when booting. See kernel source file https://www.kernel.org/doc/Documentation/kernel-parameters.txt.

bottom half

Work needed to be done initiated by an interrupt, but which can be done without interrupts being disabled. This is also referred to as deferred interrupt work, and this work is typically scheduled using soft IRQ, tasklet, or a work queue. See also top half.

core

A hardware resource capable of executing a thread of code in parallell with other cores.

CPU

Either a core or a hyper-thread. In Linux this represents a hardware resource capable of executing a thread of code in parallell with other CPUs.

CPU affinity

Tasks, interrupts etc can have a CPU affinity, which means that they will only run on the CPUs given as an affinity mask or affinity CPU list.

CPU hotplug

Kernel feature for adding and removing CPUs at runtime.

CPU isolation

Make a CPU as independent as possible of work done on other CPUs.

CPU partitioning

CPU partitioning is about grouping CPUs together. In the scope of this document, the intention is to group a set of CPUs in order to perform CPU isolation on each CPU within this group.

cpuset

Kernel feature used to perform CPU partitioning.

critical section

A code path which needs to be protected from concurrency. It could be a global state that takes several operations to update and where all those operations must appear to be one atomic update. Such sections are usually protected using e.g. mutex.

dynamic ticks

Used to denote idle dynamic ticks as well as full dynamic ticks.

full dynamic ticks

Kernel feature to inhibit tick interrupts when running a single task on a CPU.

GP partition

A CPU partition intended for general purpose applications, i.e. applications that do not have real-time requirements.

hard real-time

When a response to an event must be done within a well defined time limit, or else the application will fail miserably, this is referred to as hard real-time.

hyper-threading

A technique to allow two executions to occur simultaneously on the same core as long as they don't need the same core resources. This means that these multiple executions heavily affects each other, even though Linux presents them as being two different CPUs.

idle dynamic ticks

Kernel feature to inhibit tick interrupts when CPU is idle.

interrupt

Hardware functionality for indicating asynchronous events. Interrupts will preempt the current execution and start executing the top half of an interrupt handler.

IRQ

Interrupt request - Each type of interrupt is assigned its own vector, which is called IRQ vector or simply IRQ.

I/O isolation

Technique for making code execution occur in parallell with I/O latencies to make the execution less dependent on hardware latencies.

jitter

Since this document is about real-time it is variance in latencies that jitter is referring to. This includes both jitter caused by work done by other threads of execution, application induced jitter, and work done by the kernel.

kernel configuration

This refers to the kernel configuration being done before compiling the kernel, e.g. using make menuconfig from the root directory of the kernel source.

kernel space

Code executed in the kernel, either compiled into the kernel or as a kernel module.

kthread

Task executing in kernel space. All code executing in the kernel share the same memory.

latency

Time from when an event occurs until a response has been produced. The most important part of latency for the scope of this document is the latency caused by anything else except from the actual work needed to be done as a response for the event.

LWP

Light-weight process, kernel side of a pthread. LWP is used to make it possible to put pthreads on different CPUs while still sharing memory with a single parent process.

normal tasks

Task with SCHED_OTHER, SCHED_IDLE or SCHED_BATCH scheduling policy. SCHED_OTHER is by far the most common one.

NUMA

Non-uniform memory access is a design of a multi-core system where the access time for a specific memory range can depend on which CPU is accessing it, excluding any effects that caches might have.

partrt

A tool for performing CPU partitioning. Available from https://github.com/OpenEneaLinux/rt-tools.

preemption

When an execution is unvoluntarily interrupted to begin another thread of execution, it is said that the execution is preempted.

PREEMPT_RT

Set of patches to achieve a fully pre-emptible kernel. See https://rt.wiki.kernel.org/index.php/Main_Page for more information.

priority inheritance

When a task is waiting for a lock owned by a less prioritised task and the lock has priority inheritance, then the task owning the lock will be raised to the same priority as the waiting task for as long as the lock is being held. This technique avoids priority inversion problems, where a more prioritised task is forced to wait for a less prioritised task.

process

Task which does not share memory with its parent and is running in user space.

pthreads

POSIX implementation of threads. In Linux, every pthread is associated with a kernel side task called LWP. Therefore pthreads has a process ID which can be retrieved using the gettid() system call.

RCU

Read, copy, update. A lock mechanism that makes read very cheap on the expense of more work when writing. See http://lwn.net/Articles/262464 for an article how it works.

real-time application

An application with real-time requirements. It is enough that there is one task within the application that has those requirements for the application to be called real-time application.

real-time properties

Properties of a system with predictable latency.

real-time tasks

A task with scheduling policy SCHED_FIFO or SCHED_RR.

scheduling

How to distribute resources as a function of time, e.g. I/O scheduling is how to distribute I/O accesses as a function of time and task scheduling is how to distribute CPU execution as a function of time.

soft IRQ

A technique for implementing the bottom half handling of an interrupt. Also see tasklet.

RT partition

Real-time partition, a CPU partition intended for real-time applications.

SMI

System management interrupt. This is an x86 specific feature used for CPU overheating protection and fixing microcode bugs. The interrupt is handled by BIOS, which is outside Linux control.

SMM

System management mode. When a CPU issues an SMI, it enters system management mode. Code running in this mode is found in the BIOS.

SMP

Symmetric multiprocessor. This describes a multi-core system when all cores are handled by one operating system and treated as one resource.

soft real-time

When there are requirements on maximum latency, but where failing those requirements causes a graceful degradation, this is referred to as soft real-time.

system call

Function calls that are implemented in the kernel are called system calls. When a system call is issued, the currently running task goes from user space to kernel space and runs the implementation in the kernel, and then back to user mode again.

task

The scheduling entity for a task scheduler. Can be a process, LWP (pthread) or kthread.

task switch

When a task stop running and an other task get to run on the same CPU. This can happen when a task is preempted or it yields.

tasklet

A bottom half implementation that is more generic than soft IRQ. It is implemented as a soft IRQ vector.

threaded interrupts

When registering an interrupt handler it is possible to register an associated interrupt thread. In this case the interrupt handler is expected to be very short, and the main part of the interrupt handling should be done in this interrupt thread.

throughput

Throughput is measured by how much useful work can be achieved for a longer period of time, and is a way of measuring efficiency. It is often in conflict with latency.

ticks

A per CPU periodic interrupt used to ensure time slices are being disitributed fairly among the currently running tasks on the CPU.

top half

Interrupt handling code that needs to be run with interrupts disabled. Also see threaded interrupts and bottom half.

user space

Code executed in a process or pthread, as opposed to in kernel space.

work queue

Kernel feature for handling small work packages in a kthread scope. Can be used for bottom half work or for work induced by a system call, but that can be deferred to a later time.

yielding

When a task voluntarily gives up execution this is referred to as yielding. This could either be because the task has requested to sleep for an amount of time, or because it is blocking on a lock owned by another task.

2. Basic Linux from a Real-Time Perspective

A system with real-time constraints aim to perform work with a guarantee on the time when the work will be finished. This time is often called a deadline, and the system is designed with the purpose of not missing any, or as few as possible, deadlines.

A system where the consequences of missing deadlines are severe, for example with respect to danger for personnel or damage of equipment, is called a hard real-time system.

A system where deadlines occasionally can be missed is called a soft real-time system.

The work done by a real-time system is often initiated by an external event, such as an interrupt. The nature of the work often requires participation of one or more concurrently executing tasks. Each task participates by doing processing, combined with interaction with other tasks. The interaction typically leads to one or more task switches. A deadline for the work is often associated with the completion of the work, i.e. when the last participating task has finished its processing.

There are several challenges when implementing real-time systems. One challenge is to obtain as much determinism as possible. In this way, it becomes easier to make predictions, and calculations, of the actual execution time that will be required. From these predictions, the risk of the system missing deadlines can be evaluated.

When implementing a soft real-time system, and using Linux as an operating system, it is important to try to characterize possible sources of indeterminism. This knowledge can then be used to configure, and perhaps also modify Linux, so that its real-time properties become more deterministic, and hence that the risk of missing deadlines is minimized, although not guaranteed.

The remainder of this section gives a selection of areas in Linux where sources of indeterminism are present. The purpose is to give a brief technical overview, to serve as general advice for projects aiming at using Linux in real-time systems.

Each section in this chapter objectively describes:

  • The topic/mechanism

  • Default configuration

  • The real-time impact

It does not describe how to optimize for real-time behavior. That is done in Chapter 3, Improving the Real-Time Properties. The reason is to make it possible for a reader to skip this basic chapter.

2.1 Kernel Preemption Model

A task switch occurs when the currently running task is replaced by another task. In Linux, a task switch can be the result of two types of events:

  1. As a side effect of a kernel interaction, e.g. a system call or when the kernel function schedule() is called. This is referred to as yield. The function schedule() can be used by kernel threads to explicitly suggest a yield.

  2. As a result of an asynchronous event, e.g. an interrupt. This is referred to as preemption and occurs asynchronously from the preempted tasks point of view.

In the kernel documentation the terms voluntary preemption is used instead of yield and forced preemption for what here is called preemption. The terms were chosen since, strictly speaking, preemption means "to interrupt without the interrupted threads cooperation", see http://en.wikipedia.org/wiki/Preemption_%28computing%29.

Note that the preemption model only determines when a task switch may occur. The algorithms used to determine if a switch shall be done and then which task to swap in belong to the scheduler and are not affected by the preemption model. See Section 2.2, Scheduling for more info.

Where a task can be preempted depends on if it executes in user space or in kernel space. A task executes in user space if it is a thread in a user space application, except when in a system calls. Otherwise it executes in kernel space, i.e. system calls, kernel threads, etc.

In user space, tasks can always be preempted. In kernel space, you either allow preemption at specific places, or you disallow preemption at specific places, depending on the preemption model kernel configuration.

Simplified, the choice of preemption model is a balance between responsiveness (latency) and scheduler overhead. Lower latency requires more frequent opportunities for task switches which results in higher overhead and possibly more frequent task switches.

Linux offers several different models, specified at build time:

  • No Forced Preemption (Server)

  • Voluntary Kernel Preemption (Desktop)

  • Preemptible Kernel (Low-Latency Desktop)

  • Preemptible Kernel (Basic RT) *

  • Fully Preemptible Kernel (RT) *

*) These models require the PREEMPT_RT patch plus kernel configuration. See the "Target Guide" section in the Enea® Linux User's Guide for build instructions.

The server and desktop configurations both rely entirely on yield (voluntary preemption). The difference is mainly that with the desktop option there are more system calls that may yield.

Low-latency desktop introduces kernel preemption. This means that the code is preemptible everywhere except in parts of the kernel where preemption has been explicitly disabled, as for example in spinlocks.

The preemption models RT and basic RT require the PREEMPT_RT patch, https://rt.wiki.kernel.org/index.php/Main_Page. They are not only additional preemption models as they also add a number of modifications that further improve the worst case latency. Read more about PREEMPT_RT in Section 2.4, The PREEMPT_RT Patch.

Basic RT model is mainly for debugging of RT, use RT instead. RT aims to minimize parts of the kernel where preemption is explicitly disabled.

2.2 Scheduling

A scheduling policy is a set of rules that determines if a tasks shall be swapped out and then which task to swap in. Linux supports a number of different scheduling policies:

  • SCHED_FIFO

  • SCHED_RR (also called Round-Robin scheduling)

  • SCHED_OTHER (Also called SCHED_NORMAL)

  • SCHED_BATCH

  • SCHED_IDLE

The scheduling policy is a task property, i.e. different tasks can have different policies.

SCHED_FIFO and SCHED_RR are sometimes referred to as the real-time scheduling classes.

The standard way of scheduling tasks in Linux is known as fair scheduling. This means that Linux aims to give each task a fair share of the CPU time available in the system.

In a real-time system, work is deadline-constrained and the most important quality of task scheduling is to meet deadlines rather than fairness of CPU utilization. Fair scheduling may affect the time passed from when a task becomes ready for execution to when it actually starts to execute, since there may be other tasks that have not yet been allowed their share of the processor time. As an additional complication, the actual length of the delay depends on the number of tasks being active in the system and is therefore difficult to predict. In this way, the system becomes indeterministic.

There are other methods of scheduling in Linux that can be used. For instance, it is possible to use priority-based scheduling, similar in its policy to scheduling methods used in real-time operating systems. This scheduling method, referred to as SCHED_FIFO, increases the determinism when interacting tasks perform work together. It requires, however, that explicit priorities are assigned to the tasks.

2.2.1 SCHED_FIFO and SCHED_RR

SCHED_FIFO and SCHED_RR are the two real-time scheduling policies. Each task that is scheduled according to one of these policies has an associated static priority value that ranges from 1 (lowest priority) to 99 (highest priority).

The scheduler keeps a list of ready-to-run tasks for each priority level. Using these lists, the scheduling principles are quite simple:

  • SCHED_FIFO tasks are allowed to run until they have completed their work or voluntarily yields.

  • SCHED_RR tasks are allowed to run until they have completed their work, voluntarily yields, or until they have consumed a specified amount of CPU time.

  • When the currently running task is to be swapped out, it is put at the end of the list and the task to swap in is selected as described below:

    1. From all non-empty lists, pick the one with the highest priority.

    2. From that list, pick the task at the beginning of the list.

As long as there are real-time tasks that are ready to run, they might consume all CPU power. There is a mechanism called RT throttling, that can help the system to avoid that problem; see Section 2.3, Real-Time Throttling.

2.2.2 SCHED_OTHER

SCHED_OTHER is the most widely used policy. These tasks do no have the static priorities. Instead they have a "nice" value ranging from -20 (highest) to +19 (lowest). The scheduling policy is quite different from the real-time policies in that the scheduler aims at a "fair" distribution of the CPU. "Fair" means that each task shall get an average share of the execution time, that relates to its nice value. See http://en.wikipedia.org/wiki/Completely_Fair_Scheduler and https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt for more info.

2.2.3 SCHED_BATCH

SCHED_BATCH is very similar to SCHED_OTHER. The difference is that SCHED_BATCH is optimized for throughput. The scheduler will assume that the process is CPU-intensive, and treat it slightly different. The tasks will get the same CPU share as SCHED_OTHER tasks, but worse latency.

2.2.4 SCHED_IDLE

SCHED_IDLE is also similar to SCHED_OTHER and can be described as a SCHED_OTHER with a nice value even weaker than +19.

2.3 Real-Time Throttling

As long as there are real-time tasks, i.e. tasks scheduled as SCHED_FIFO or SCHED_RR that are ready to run, they would consume all CPU power if the scheduling principles were followed. Sometimes that is the wanted behaviour, but it will also allow bugs in real-time threads to completely block the system.

To prevent this from happening, there is a real time throttling mechanism. This makes it possible to limit the amount of CPU power that the real-time threads can consume.

The mechanism is controlled by 2 parameters: rt_period and rt_runtime. The semantics is that the total execution time for all real-time threads may not exceed rt_runtime during each rt_period. As a special case, rt_runtime can be set to -1 to disable the real-time throttling.

More specifically, the throttling mechanism allows the real-time tasks to consume rt_runtime times the number of CPUs for every rt_period of elapsed time. A consequence is that a real-time task can utilize 100% of a single CPU as long as the total utilization does not exceed the limit.

The default settings are: rt_period=1000000 µs, rt_runtime=950000 µs, which gives a limit of 95% CPU utilization.

The parameters are linked to two files in the /proc file system:

  • /proc/sys/kernel/sched_rt_period_us

  • /proc/sys/kernel/sched_rt_runtime_us

Changing a value is done by writing the new number to the corresponding file. E.g.:

$ echo 900000 > /proc/sys/kernel/sched_rt_runtime_us
$ echo 1200000 > /proc/sys/kernel/sched_rt_period_us

See also https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt and https://lwn.net/Articles/296419

2.4 The PREEMPT_RT Patch

PREEMPT_RT is a set of changes to the Linux kernel source code, which when applied will make the Linux kernel more responsive to external interrupts, and more deterministic in the time for performing a task involving cooperating processes.

PREEMPT_RT aims to minimize the amount of kernel code that is non-preemptible (http://lwn.net/Articles/146861). This is accomplished by adding and modifying functionality in the Linux kernel.

The main functional changes done by PREEMPT_RT are:

  • Converting spin locks to sleeping locks. This allows preemption while holding a lock.

  • Running interrupt handlers as threads. This allows preemption while servicing an interrupt.

  • Adding priority inheritance to different kinds of sleeping locks, and to semaphores. This avoids scenarios where a lower prioritized process hinders the progress of a higher prioritized process, due to the lower priority process holding a lock. See http://info.quadros.com/blog/bid/103505/RTOS-Explained-Understanding-Priority-Inversion.

  • Lazy preemption. This increases throughput for applications with tasks that use ordinary SCHED_OTHER scheduling.

PREEMPT_RT is managed as a set of patches for the Linux kernel. It is available for a selection of kernel versions. They can be downloaded from https://www.kernel.org/pub/linux/kernel/projects/rt. General information about the patches can be found from the PREEMPT_RT wiki https://rt.wiki.kernel.org/index.php/Main_Page.

It is also possible to obtain a Linux kernel with the patches already applied. A central source is https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git. Another source is Linux from a hardware vendor, providing a Linux kernel with the patches applied. Enea Linux can be configured with or without the patches, depending on the specific customer use case.

Comparing with the total number of lines in the Linux kernel, the PREEMPT_RT patches affect a small percentage. However, it changes central parts of the kernel, which can be seen e.g. by noting that the latest patch set for the Linux 3.12 kernel contains 32 patches that affect the code in the file kernel/sched/core.c. The actual size of the PREEMPT_RT patches can be estimated from the number of files affected, and from the total number of source code lines affected. As an example, the latest patch set for the Linux 3.12 kernel https://www.kernel.org/pub/linux/kernel/projects/rt/3.12 contains 321 patches, affecting (by adding or removing lines of code) 14241 source code lines.

Functionality has been moved, over time, from the PREEMPT_RT patches into the mainline kernel. In this way, the PREEMPT_RT patch set has become smaller. As an example, the latest patch set for the older Linux 3.0 kernel https://www.kernel.org/pub/linux/kernel/projects/rt/3.0 contains 385 patches, adding or removing 16928 lines of source code, which can be compared with the corresponding numbers for the Linux 3.12 kernel (321 patches, 14241 lines), and shows that the PREEMPT_RT patch set has decreased in size.

The PREEMPT_RT functionality is activated, after the PREEMPT_RT patches have been applied and the kernel has been built, by activating a kernel configuration menu alternative. The menu alternative is named Fully Preemptible Kernel (RT).

The performance of a Linux kernel with the PREEMPT_RT patches applied can then be evaluated. A common evaluation methodology involves measuring the interrupt latency. The interrupt latency is often measured from the time of an interrupt until the time when a task, that is activated as a result of the interrupt, begins executing in user space. A commonly used tool for this purpose is cyclictest.

Additional information on results from evaluating the PREEMPT_RT performance is given e.g. in http://www.mpi-sws.org/~bbb/papers/pdf/ospert13.pdf and http://sigbed.seas.upenn.edu/archives/2014-02/ewili13_submission_1.pdf.

When deciding on the use of PREEMPT_RT, its costs and benefits should be evaluated. The use of PREEMPT_RT implies a significant change to the kernel source code. This change involves code that may be less tested and therefore less proven in use than the remaining parts of the mainline kernel.

Another aspect is maintenance. The development of the PREEMPT_RT patch set follows the development of the kernel. When the kernel version is changed, the corresponding PREEMPT_RT patch set, which may be available only after a certain time period, must then be applied to the kernel and the associated tests must be performed.

On the other hand, the use of a PREEMPT_RT-enabled kernel can lead to a system with a decreased worst case interrupt latency and a more deterministic scheduling, which may be necessary to fulfil the real-time requirements for a specific product. For a uni-core system many of the other methods, e.g. full dynamic ticks and CPU isolation, cannot be used, and can therefore make PREEMPT_RT an attractive alternative.

For additional information about the technical aspects of PREEMPT_RT see e.g. http://elinux.org/images/b/ba/Elc2013_Rostedt.pdf.

For additional information about the PREEMPT_RT development status, see e.g. http://lwn.net/Articles/572740.

2.5 CPU Load Balancing

Linux performs allocation of tasks to CPUs. In the default setting, this is done automatically. In a similar spirit as the default fair task scheduling, which aims at dividing the CPU time for an individual CPUs so that each task gets its fair share of processing time, the scheduler is free to move tasks around, with the goal of evenly distributing the processing load among the available CPUs.

The moving of a task is referred to as task migration. The migration is done by a part of the scheduler called the load balancer. It is invoked, on a regular basis, as a part of the processing done during the scheduler tick. The decision to move a task is based on a variety of input data such as CPU load, task behaviour etc.

For an application which requires determinism, load balancing is problematic. The time to activate a certain task, as a result of an interrupt being serviced, may depend, not only on the scheduling method used, but also on where this task is currently executing.

It is possible to statically assign tasks to CPUs. One reason for doing this is to increase determinism, for example by making the response time to external events more predictable. Assigning a task to a CPU, or set of CPUs, is referred to as setting the affinity of the task.

A mechanism both to set affinity of task and, to turn off automatic load balancing is cpuset. Cpusets make use of the Linux subsystem cgroup. See https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt and https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt.

2.6 CPU Isolation

If we would like to get real time performance on single CPUs systems it is necessary to adapt the entire system, e.g. using the PREEMPT_RT patch or an RTOS. This is not always necessary in a multicore system. Recently added features in the Linux kernel makes it possible to aggressively migrate sources of kernel introduced jitter away from selected CPUs. See Section 3.3.1, CPU Isolation for more information. Doing this provides bare metal-like performance on the CPUs where sources of jitter have been removed. Please note, you can also use a multicore system with CPU isolation to achieve higher throughput, although that is not the focus of this document.

One way to get real-time performance in Linux is by creating a bare metal-like environment in Linux user space. On a default setup, this is not possible since the Linux kernel need to do some regular housekeeping. It is possible to move much of this housekeeping to some dedicated CPUs, provided we have a multicore system. That leaves the other CPUs relatively untouched by the Linux kernel, unless user space triggers some kernel activity. The application that executes in this bare metal environment should avoid using libc calls and Linux system calls. See Chapter 4, Designing Real-Time Applications for more information. Since the kernel is not real-time safe, executing kernel code can have serious impacts on real-time performance.

The biggest sources of kernel jitter are the scheduler and external interrupts. The scheduler's main duty is to switch between tasks. Switching between tasks can of course cause a lot of jitter. This is only a problem if the performance critical task runs with a completely fair scheduler (CFS) policy and therefore gets preempted because of time slicing. A solution for this problem can be to move all non-critical tasks to other CPUs and/or run the critical task with real time policy and an appropriate priority.

The load balancer will try to effectively utilize all the CPUs. That might be good for throughput, but it could damage real-time performance. The obvious problem is that general purpose tasks could be moved to the real-time CPUs and real-time tasks could be moved to general purpose CPUs. The other problem is the actual work of migrating threads. This is easily solved by disabling load balancing on the CPUs that should be isolated.

The scheduler tick causes significant jitter and has negative real-time impact unless the PREEMPT_RT kernel is used. The tick can be removed with the CONFIG_NO_HZ_FULL kernel configuration. Read more about NO_HZ and the tick in Section 2.7, Dynamic Ticks (NO_HZ).

Interrupts can be a big source of jitter. Some interrupts like inter processor interrupts (IPI) and per-CPU timer interrupts need to be bound to a certain CPU. Other interrupts may be handled by any CPU in a multicore system and should be moved away from the isolated CPUs. Many timer interrupts can be removed by changing kernel configurations. See Section 2.11, Interrupts for more info.

2.7 Dynamic Ticks (NO_HZ)

The purpose of the tick is to balance CPU execution time between several tasks running on the same CPU. The tick is also used as a timer source for timeouts. Ticks are interrupts generated by a hardware timer and occur at a regular interval determined by the CONFIG_HZ kernel configuration, which for most architectures can be configured when compiling the kernel. The tick interrupt is a per-CPU interrupt.

Starting from Linux 2.6.21, the idle dynamic ticks feature can be configured by using the CONFIG_NO_HZ kernel configuration option. The goal was to eliminate tick interrupts while in idle, to be able to go into deeper sleep modes. This is important for laptops but can also cut down power bills for server rooms.

Linux 3.10.0 introduced the full dynamic ticks feature to eliminate tick interrupts when running a single task on a CPU. The goal here was to better support high performance computing and real-time use cases by making sure that the thread would be run undisturbed. The earlier configuration CONFIG_NO_HZ was renamed to CONFIG_NO_HZ_IDLE, and the new feature got the new configuration option CONFIG_NO_HZ_FULL.

The current implementation requires that ticks are kept on CPU 0 when using full dynamic ticks, which is not required for idle dynamic ticks. The only exception is when the whole system is idle, then the ticks can be turned off for CPU 0 as well.

Whether dynamic ticks turn tick interrupts off is a per-CPU decision.

The following timer tick options are available, extracted from the kernel configuration:

Periodic timer ticks (constant rate, no dynticks)

This option keeps the tick running periodically at a constant rate, even when the CPU doesn't need it.

Idle dynticks system (tickless idle)

This option enables a tickless idle system: timer interrupts will only trigger on an as-needed basis when the system is idle. This is usually interesting for energy saving. Most of the time you want to say Y here.

Full dynticks system (tickless)

Adaptively try to shutdown the tick whenever possible, even when the CPU is running tasks. Typically this requires running a single task on the CPU. Chances for running tickless are maximized when the task mostly runs in user space and has few kernel activity. You need to fill up the nohz_full boot parameter with the desired range of dynticks CPUs. This is implemented at the expense of some overhead in user <-> kernel transitions: syscalls, exceptions and interrupts. Even when it's dynamically off. Say N.

In order to make use of the full dynamic ticks system configuration you must ensure that only one task, including kernel threads, is running on a given CPU at any time. Futhermore there must not be any pending RCU callbacks or timeouts attached to the tick.

The official documentation for dynamic ticks can be found in the Linux kernel source tree https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt. There is also an excellent article about it at LWN, http://lwn.net/Articles/549580.

2.8 Power Save

This section will describe several power saving techniques available in Linux. These techniques does often have impact on the systems real time properties.

A technique not described here is hibernation, e.g. suspend to RAM or disk. The reason is that it is difficult to combine with real-time properties and therefore outside the scope of this manual.

2.8.1 Dynamic Freqency Scaling

When there is little CPU bound work to be done, the CPU frequency can be reduced as a way to reduce power consumption. This is known as dynamic frequency scaling, see http://en.wikipedia.org/wiki/Dynamic_frequency_scaling.

The function is enabled at compile time by the configuration parameter CONFIG_CPU_FREQ. If enabled, the system will include functionality, called a governor, for controlling the frequency. There are several governors optimized for different types of systems. Which governors available in the system is also chosen with a compile time configurations with names starting with CONFIG_CPU_FREQ_GOV_.

The possibility to use dynamic frequency scaling in a real-time system is strongly related to the time it takes to ramp up the frequency and that time's relation to the latency requirements.

2.8.2 CPU Power States

If the CPU is idle, i.e. no ready to run task, the CPU can be put in sleep state. A sleep state means that the CPU does not do any execution, while still ready to respond on certain events, e.g. an external interrupt.

CPUs usually have a range of power modes. See http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Processor_states for an example. Deeper sleep means lower power consumption at the price of increased wake-up time. As with dynamic frequency scaling, the transition between the power states is controlled by a governor.

To configure the functionality for entering sleep states when idle, use the compile time configuration parameter CONFIG_CPU_IDLE.

2.8.3 CPU Hotplug

The Linux kernel's CPU-hotplug facility allows CPUs to be added to or removed from a running kernel. CPU hotplug can for example be used to isolate failing CPUs or be used in systems where the hardware in it's self is hot pluggable. Linux's CPU hotplug implementation is based on notifiers, which are callbacks into the subsystems that need to be aware of CPUs coming and going.

Since the benefits of hotplugging the CPUs from a system are well known from the power saving point of view, the most obvious disadvantage from a real-time computing perspective is the latency jitter. In a perfect system, a CPU could go online or go offline fast and without alarming the rest of the system. Unfortunately, handling the CPU's per-CPU kthreads is one of the disadvatages that the hotplug was not used in a real-time approach (long time operations, close to seconds). Nowadays, the CPU hotplug implementation http://sigbed.seas.upenn.edu/archives/2012-11/7.pdf is based on the CPU affinity of kthreads by not remove and recreate again, but migrate them. In this way the hotplug latencies are around 5ms, making Linux more useable into both area of technologies power saving and real-time application.

To enable the usage of this feature, the kernel supports the CONFIG_CPU_HOTPLUG parameter. This opens the availability of runtime isolation through the file /sys/devices/system/cpu/«cpu id»/online.

2.9 I/O Scheduling

I/O scheduling determines in which order block I/O reads and writes are done. The algorithms require collection of statistics concerning block device activity, which decreases determinism for an eventual real-time application writing/reading to a block device.

In such a scenario, you may want to select the Noop I/O elevator for the block device which your determinism sensitive application is reading from/writing to. However, the effect is expected to be small, and this will have side-effects for other applications accessing the same block device. It may even have negative side effects on your application depending on the type of block device, and the read/write behaviour of the application. Fortunately, the I/O scheduler can be switched in runtime, and should be selected based on the user-specific I/O load scenario.


2.10 System calls

User-space programs access services provided by the kernel by using system calls. Usually, applications do not invoke system calls directly, but rather use library wrapper functions that in turn execute system calls. A system call always involves a transition from user-space to kernel-space and back. It also involves passing parameters from the application program into the kernel, and back.

A system call becomes a source of indeterminism for the calling task, since the duration of a system call depends on the work done while executing in the kernel. The work may for example involve allocation of resources, such as memory. This may take different amounts of time, depending on the availability of the resources. A system call may even result in the calling task being blocked as a result of a resource not being available. For example, a task reading data from a file may be blocked if there is no data immediately available. At some later time, when the requested resource becomes available, the task resumes its execution. The system call can then be completed and the task can return to user space.

There may also be other tasks wanting to execute during the execution of a system call. If it is possible to switch task during the system call, one or more of these other tasks can be allowed to execute. This can clearly be advantageous if these tasks have deadlines. This type of task switch is referred to as kernel preemption.

It is possible to configure the Linux kernel so that it allows more or less possibilities of kernel preemption. In general, if the level of preemptibility is increased, the Linux kernel becomes more complex and consumes larger part of the cpu time. As a consequence, the application gets a smaller part of the cpu time and the througput performance is decreased.

2.11 Interrupts

Hardware indicate events to the software using interrupts. When an interrupt occurs the following is done:

  1. The interrupt is mapped to a registered interrupt handler.

  2. The registered interrupt handler is run. There can be several handlers for the same interrupt. Therefore, this step is retried for all handlers registered for this interrupt as long as they return IRQ_NONE.

  3. If registered interrupt handler returns IRQ_WAKE_THREAD, the interrupt thread corresponding to the registered interrupt handler is set in ready state, i.e. is now a schedulable task.

  4. Interrupt is acknowledged.

All steps above are executed with all interrupts disabled, i.e. interrupt nestling is not supported. See http://lwn.net/Articles/380931/ for a discussion why nestled interrupts was removed from Linux. The patch found at https://lkml.org/lkml/2010/3/25/434 remove nestled interrupts.

Interrupt handlers can either be registered using request_irq(), or using request_threaded_irq() which registers a threaded interrupt. In both cases the interrupt handler will determine whether the interrupt is to be handled by this interrupt handler or not. The handler returns IRQ_NONE if not.

Interrupt work is normally divided into two parts: Top half and bottom half. The top half is implemented by the interrupt handler. The bottom half is implemented by soft IRQs, tasklets or work queues initiated from the top half, or by the interrupt thread in case of threaded interrupt.

See also Section 2.12, Soft IRQs and Tasklets, Section 2.13, Work Queues and Section 2.14, Threaded Interrupts.

The latency that interrupts induce on a real-time application is determined by the top half, soft IRQs and tasklets. For threaded interrupts, the priority of the interrupt thread can be adjusted to only affect the latency of less critical tasks.

2.12 Soft IRQs and Tasklets

Soft IRQs are intended for deferred interrupt work that should be run without all interrupts being disabled.

Soft IRQ work is executed at certain occasions, such as when interrupts are enabled, or when calling certain functions in the kernel, e.g. local_bh_enable() and spin_unlock_bh(). The soft IRQs can also be executed in the ksoftirqd kernel thread. All this makes it very hard to know when the soft IRQ work will actually be executed. See https://lwn.net/Articles/520076/ for more information about soft IRQ and real-time.

Tasklets builds on soft IRQs, and is basically a generic interface for soft IRQs. While tasklets are much preferred over soft IRQs, the synchronization between interrupt handlers and corresponding soft IRQ or tasklet is non-trivial. For this reason and for achieving better real-time properties it is recommended to use work queues or threaded interrupts whenever possible.

2.13 Work Queues

Work queues executes in kernel threads. This means that for work queues, preemption can occur. It also means that work performed in a work queue may block, which may be desirable in situations where resources are requested but not currently available. The reason for using work queues rather than having your own kernel thread is basically to keep things simple. Small jobs are better handled in batches in a few kernel threads rather than having a large number of kernel threads each doing their little thing.

See http://lwn.net/Articles/239633/ for a discussion about why tasklets and soft IRQs are bad and why work queues are better.

2.14 Threaded Interrupts

When an interrupt handler is registered using request_thread_irq() it will also have an associated interrupt thread. In this case, the interrupt handler returns IRQ_WAKE_THREAD to invoke the associated interrupt thread. If the interrupt thread is already running, the interrupt will simply be ignored.

Even if the interupt handler has an associated interrupt thread, it may still return IRQ_HANDLED to indicate that the interrupt thread does not need to be invoked this time.

There are two main advantages with using threaded interrupts:

  1. No synchronization needed between the top half and bottom half of the interrupt handler, the kernel will do it for you.

  2. Possibility to adjust priority of each interrupt handler, even when interrupt handlers share the same IRQ.

An interrupt handler that was registered using request_irq() can be forced to run as a threaded interrupt using a kernel boot parameter named threadirqs. This will result in a short default interrupt handler to be executed instead of the registered interrupt handler, and the registered interrupt handler is run in an interrupt thread. But since the registered interrupt handler is not designed for this it may still invoke soft IRQs causing hard to predict latencies. Interrupt handlers that has been forced to be threaded runs with soft IRQ handling disabled, since they do not expect to be preempted by them.

2.15 Ticks

The Linux scheduler, which handles the process scheduling, is invoked periodically. This invocation is referred to as the Linux scheduler tick.

The Linux scheduler is also invoked on demand, for example when a task voluntarily gives up the CPU. This happens when a task decides to block, for example when data that the task needs are not available.

The frequency of the scheduler tick can be configured when building the kernel. The value typically vary between 100 and 1000 Hz, depending on target.

The scheduler tick is triggered by an interrupt. During the execution of this interrupt kernel preemption is disabled. The amount of time passing, before the tick execution is finished, depends on the amount of work that needs to be done, which in turn depends on the number of tasks in the system and how these tasks are allocated among the CPUs.

In this way, the presence of ticks, with varying completion times, contribute to the indeterminism of Linux, in the sense that a task with a deadline cannot know beforehand how long it takes until it has completed its work, due to it being interrupted by the periodic ticks.

2.16 Memory Overcommit

By default, the Linux kernel allows applications to allocate (but not use) more memory than is actually available in the system. This feature is known as memory overcommit. The idea is to provide a more efficient memory usage, under the assumption that processes typically ask for more memory than they will actually need.

However, overcommitting also means there is a risk that processes will try to utilize more memory than is available. If this happens, the kernel invokes the Out-Of-Memory Killer (OOM killer). The OOM killer scans through the tasklist and selects a task to kill to reclaim memory, based on a set of heuristics.

When an out of memory situation occurs, the whole system may become unresponsive for a significant amount of time, or even end up in a deadlock.

For embedded and real-time critical systems, the allocation policy should be changed so that memory overcommit is not allowed. In this mode, malloc() will fail if an application tries to allocate more memory than is strictly available, and the OOM killer is avoided. To disable memory overcommit:

$ echo 2 > /proc/sys/vm/overcommit_memory

For more information, see the man page for proc (5) and https://www.kernel.org/doc/Documentation/vm/overcommit-accounting.

2.17 RCU- Read, Copy and Update

RCU is an algorithm for updating non-trivial structures, e.g. linked lists, in a way that does not enforce any locks on the readers. This is done by allowing modifications to the structures to be done on a copy, and then have a publish method that atomically replaces the old version with the new.

After the data has been replaced there will still be readers that hold references to the old data. The period during which the readers can hold references to the old data is called a grace period.

This grace period ends when the last reader seeing the old version has finished reading. When this happens, an RCU callback is issued in order to free resources allocated by the old version of the structure.

These callbacks are called within the context of a kernel thread. By default these callbacks are done as a soft IRQ, adding hard to predict latencies to applications. There is a kernel configuration option named CONFIG_RCU_NOCB_CPU which combined with the boot parameter rcu_nocb=<cpu list> will relocate RCU callbacks to kernel threads. The threads can be migrated away from the CPUs in the <cpu list> giving these CPUs better real-time properties.

When using the nohz_full kernel boot parameter, nohz_full implies rcu_nocb.

For further reading, see the following links:

RCU concepts: https://www.kernel.org/doc/Documentation/RCU/rcu.txt

What is RCU (contains a lot of good links to LWN articles): https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt

Relocating RCU callbacks: http://lwn.net/Articles/522262/

3. Improving the Real-Time Properties

Improving the real-time properties is very much about reducing latency. In a hard real-time system, it is all about worst case, while in a soft real-time system, it is about reducing the probability for high latency numbers. Often, the improvements come at a price, e.g. lower average throughput, increased average latency, reduced functionality, etc.

This chapter starts with a set of actions that are of low complexity, easy to apply and/or well established. For some systems these actions will be enough. If not, it is suggested to isolate the real-time critical application(s). The idea is to avoid sources of jitter on the real-time applications. Finally, a set of actions with larger impact are presented.

3.1 Evaluating Real-Time Properties

Finding the right balance is an iterative process that preferably starts with establishing some benchmark. A common way to evaluate real-time properties is to measure interrupt latency by using the test application cyclictest (https://rt.wiki.kernel.org/index.php/Cyclictest), which sets a timer and measures the time interval from the expected expiration time until the task that set the timer becomes ready for execution.

For the result to be relevant, the system should also be put under stress, see e.g. http://people.seas.harvard.edu/~apw/stress, which can generate various types of stress.

3.2 First Attempt

3.2.1 Configure the Proper Kernel Preemption Model

Lack of preemption can result in very high, possibly unbound, latencies. That makes the preemption models without preemption, server and desktop, unsuitable for real-time systems. See Section 2.1, Kernel Preemption Model for a description of the kernel preemption model.

The recommendation is to start with the low-latency desktop model.

While the RT model in many cases gives a lower worst case latency, this comes at a cost of larger overhead. The RT model also requires the PREEMPT_RT patch in contrast to low-latency desktop, which is a part of the standard kernel. Therefore the RT model may have implications on quality and functional stability.

The preemption model is set at compile time by kernel configuration parameters, menu Kernel options/Preemption Model. To use low-latency desktop, set CONFIG_PREEMPT=y or CONFIG_PREEMPT__LL=y (the latter from the PREEMPT_RT patch), and other CONFIG_PREEMPT_* parameters to "n".

Table 3.1 Kernel Configuration Parameters for the Preemption Model

NameConfiguration parameter
No Forced Preemption (Server)CONFIG_PREEMPT_NONE
Voluntary Kernel Preemption (Desktop)CONFIG_PREEMPT_VOLUNTARY
Preemptible Kernel (Low-Latency Desktop)CONFIG_PREEMPT

or

CONFIG_PREEMPT__LL*
Preemptible Kernel (Basic RT) *CONFIG_PREEMPT_RTB
Fully Preemptible Kernel (RT) *CONFIG_PREEMPT_RT_FULL

*) These models require the PREEMPT_RT patch. See the "Target Guide" section in the Enea® Linux User's Guide for build instructions.

Benchmarks

The table below shows the worst case latency for the different preemption models, measured with various stress scenarios.

Table 3.2 Worst Case Latency [µs] for the Different Preemption Models on P2041RDB

Preemption model

Stress type (see Appendix B, Benchmark Details)

cpuhddiovmfull
Server4843115562478473
Desktop5216113963438423
Low-latency desktop723967111084977
RT (from PREEMPT_RT patch)2972627464

3.2.2 Optimize Power Save

As described in Section 2.8, Power Save, power save mechanisms interact poorly with real-time requirements. The reason is that exiting a power save state cannot be done instantly, e.g. 200 µs wake-up latency from sleep mode C3 and 3 µs from C1 on an Intel i5 - 2GHz.

This does not have to be a problem in e.g. a soft real-time system where the accepted latency is longer than the wake-up time or in a multicore system where power save techniques may be used in a subset of the cores. It is however recommended to start with the power save mechanisms disabled.

It may be noted that the dynamic ticks function described in Section 2.7, Dynamic Ticks (NO_HZ), originated as a power save function, idle dynamic ticks. We will later describe how to use it for real-time purposes, full dynamic ticks, Section 3.3, Isolate the Application.

Disable Dynamic Frequency Scaling

Frequency scaling is disabled by setting the kernel configuration parameter CONFIG_CPU_FREQ=n. See Section 2.8.1, Dynamic Freqency Scaling and see https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt for more information.

Disable Transitions to Low-Power States

Disable transitions to low-power states by setting CONFIG_CPU_IDLE=n. See Section 2.8.2, CPU Power States.

3.2.3 Use Threaded Interrupts

Interrupt handlers can choose to run most part of their interrupt handling in a kernel thread when registering. This is called threaded interrupts. It is also possible to force interrupt handlers that has not done this choice to become threaded interrupts by using threadirq boot parameter. Interrupt handles that has registered using IRQF_NO_THREAD will not be threaded even if the boot parameter is given.

This will make interrupt handlers more preemptible, but some interrupt handlers might not work with this option, so make sure to test well when using it.

Benchmarks of threaded interrupts

Table 3.3 Benchmarking of threaded interrupts on P2041RDB, times in µs

 cpufullhddiono_stressvm
Threaded irqs46994705881851018
Regular irqs599614812171101029
Ratio [%]78.0103.4146.640.6168.298.9

3.2.4 Optimize Real-Time Throttling

As described in Section 2.3, Real-Time Throttling, the default settings for the real-time throttling allows real-time tasks to run for 950000 µs every 1000000 µs. This may lead to a situation where real-time tasks are blocked for 50000 µs at the end of the throttling period. In the generic case, execution of the real-time tasks may be blocked for a time equal to the difference between rt_runtime and rt_period. This situation should however be quite rare since it requires that there are real-time tasks (i.e. tasks scheduled as SCHED_FIFO or SCHED_RR) that are ready to run on all CPUs, a condition that should rarely be met since real-time systems are typically designed to have an average real-time load of significantly less than 100%.

Consequently, it is recommended to keep the real-time throttling enabled.

For systems that do not have any real-time tasks, the real-time throttling will never be activated and the settings will not have any impact.

An alternative when using CPU isolation is to avoid using SCHED_FIFO and SCHED_RR, since the CPU is supposed to run a single task anyway. In this case, real-time throttling should not be activated.

Benchmarks of real-time throttling

The benchmarks was done using an application that repeatedly reads the time base register and keeps track of the largest increment of it:

set cpu affinity
set scheduling class to FIFO/99

get current time (T0) and current value of the time base register (TB0)

for some number of cycles {
    read time base register
    calculate the diff between the current and the previous value. (delta_TB)
    if the diff is the largest so far, update delta_TBmax
}

get current time (T1) and current value of the time base register (TB1)

use T0, T1, TB0 & TB1 to translate delta_TBmax to a delay in microseconds.

The longest delay is interpreted as the longest time the task has been preempted. Multiple instances of the test application are started on different cores. The results are summarised below:

Table 3.4 Max preemption period for different throttling parameter settings, run on P2041RDB

Test IdThrottling parameters [µs]Max preemption period [µs]
rt_runtimert_periodCPU 0CPU 1CPU 2CPU3Average
1900 0001 000 000101 961101 95197 883101 961100 939
2950 0001 000 00048 95948 95853 04353 03250 998
31 000 0001 000 00044561185969
4-11 000 00045144385671
5900 0001 100 000203 860203 856203 862203 847203 856
695 000100 0004 1324 1324 1218 2105 149
75000001 000 000342 445342 440338 369N/A. (No real-time task was started on this cpu)341 085 (CPU0-2)

There are some observations that can be done:

  • The real-time throttling is evenly distributed among the cores. This will be the case as long as there are real-time work available to all cores. See test 7 for an exception.

  • There is no significant difference between setting rt_runtime to -1 and setting rt_runtime = rt_period. (Tests 3 & 4)

  • In test 6, the preemption periods are quite far from the expected 5000 µs. The reason is that the system frequency was set to 250Hz, which implies a granularity of 4 000 µs.

  • In test 7, real-time tasks on CPU 0 to CPU 2 uses on average 66 % of the CPU, which is higher than the expected 50%. CPU 3 on the other hand uses 0%, which keeps the system as a whole in line with the limit.

3.3 Isolate the Application

3.3.1 CPU Isolation

CPU isolation is a topic that has been discussed in Section 2.6, CPU Isolation. By using this method correctly it is possible to enhance throughput and real-time performance by reducing the overhead of interrupt handling.This is also good for the application using very high throughput and frequent interrupt device (e.g. 10GbE ethernet).. This section explains how to achieve basic CPU isolation.

Linux has a function called cpuset that associates tasks with CPUs and memory nodes. By adding and configuring cpusets on a system, a user can control what memory and CPU resources a task is allowed to use. It is also possible to tune how the scheduler behaves for different cpusets. The configuration parameter is called CONFIG_CPUSETS, see https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt for more info on cpusets.

This section will describe how to setup CPU isolation using an imaginary SMP system with 4 CPUs. CPU0-CPU1 will be used for general purpose tasks and CPU2-CPU3 will be dedicated to real-time tasks. The system is a NUMA (non-uniform memory access, http://en.wikipedia.org/wiki/Non-uniform_memory_access) system with CPU0 - CPU1 belonging to node 0 and CPU2 - CPU3 belonging to node 1. Here we call the cpuset used for the general purpose domain nRT, non-RT set.

Setting up a Partitioned System

This section describes step by step, how to set up basic CPU isolation. Two cpusets will be created. One for non real-time (nRT, for general purpose use) and one for real-time tasks. The setup can also be done using a tool (Section 3.3.6, The CPU Partitioning Tool - partrt) that wraps the sequence in this chapter.

  1. Configure the CPU sets

    1. Create the cpusets (RT & nRT):

      #enable the creation of cpuset folder
      $ mount -t tmpfs none /sys/fs/cgroup
      #create the cpuset folder and mount the cgroup filesystem
      $ mkdir /sys/fs/cgroup/cpuset/
      $ mount -t cgroup -o cpuset none /sys/fs/cgroup/cpuset/
      #create the partitions
      $ mkdir /sys/fs/cgroup/cpuset/rt
      $ mkdir /sys/fs/cgroup/cpuset/nrt
    2. Add the general purpose CPUs to the nRT set.

      $ echo 0,1 > /sys/fs/cgroup/cpuset/nrt/cpuset.cpus
    3. Add the real-time CPUs to the RT set

      $ echo 2,3 > /sys/fs/cgroup/cpuset/rt/cpuset.cpus
    4. Make the CPUs in the RT set exclusive, i.e. do not let tasks in other sets use them.

      $ echo 1 > /sys/fs/cgroup/cpuset/rt/cpuset.cpu_exclusive
  2. Restart Real-Time CPUs

    If the system supports CPU hotplug, it could be worthwhile to restart the real-time CPUs to migrate all cpu specific timers. If you choose to restart hotplug CPUs, you need to re-create the RT partition. The reason for requiring to migrate twice is that it might not be possible to restart the CPU if tasks are running on it. Restart hotplug CPUs like this: For all CPUs in the real-time partition, do the following to turn them off (example for CPU3):

    $ echo 0 > /sys/devices/system/cpu/cpu3/online

    Then turn them on:

    $ echo 1 > /sys/devices/system/cpu/cpu3/online
  3. Configure NUMA

    The following sequence sets up a suitable configuration for NUMA-based systems.

    1. Associate the nRT set with NUMA node 0.

      $ echo 0 > /sys/fs/cgroup/cpuset/nrt/cpuset.mems
    2. Associate the RT set with NUMA node 1.

      $ echo 1 > /sys/fs/cgroup/cpuset/rt/cpuset.mems
    3. Make NUMA node 1 exclusive to the RT cpuset. I.e. only tasks in the real-time cpuset will be able to allocate memory from node 1.

      $ echo 1 > /sys/fs/cgroup/cpuset/rt/cpuset.mem_exclusive

      Note that also tasks in nRT can "access" memory controlled by NUMA node 1.

    Note that it is important to set the memory nodes configuration even if the system is not NUMA-based due to the fact that the initial non available value (assigned at cpuset creation) do not grant memory access.

    1. Associate the nRT set with NUMA node 0.

      $ echo 0 > /sys/fs/cgroup/cpuset/nrt/cpuset.mems
    2. Associate the RT set with NUMA node 0.

      $ echo 0 > /sys/fs/cgroup/cpuset/rt/cpuset.mems
  4. Configure load balancing

    Load balancing, i.e. task migration, is an activity that introduces nondeterministic jitter. It is therefore necessary to disable load balancing in the real time cpuset. This means that it is necessary to specify the correct affinity for the threads that should execute within the real time CPUs.

    1. Disable load balancing in the root cpuset, this is necessary for settings in the child cpusets to take effect.

      $ echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
    2. Disable load balancing in the RT cpuset:

      $ echo 0 > /sys/fs/cgroup/cpuset/rt/cpuset.sched_load_balance
    3. Enable load balancing in the nRT cpuset:

      $ echo 1 > /sys/fs/cgroup/cpuset/nrt/cpuset.sched_load_balance
  5. Move general purpose tasks to the GP partition

    For each task in the root cpuset, run the following command, each pid of task should be on a newline:

    $ echo pid_of_task > /sys/fs/cgroup/cpuset/nrt/tasks

    Note that it is not possible to move all tasks. Some tasks require that they can execute on all available CPUs. All future child tasks that are created from the nRT partition will also be placed in the nRT partition. That includes tasks started from the current shell, since it should have been moved to nRT as well.

  6. Move IRQs to the general purpose CPUs

    Some interrupts are not CPU bound. Unwanted interrupts introduce jitter and can have serious negative impact on real time performance. They should be handled on the general purpose CPUs whenever possible. The affinity of these interrupts can be controlled using the proc file system.

    1. Set the default affinity to CPU0 or CPU1 to make sure that new interrupts won’t be handled by the real-time CPUs. The set {CPU0, CPU1} is represented as a bitmask set to 3, (20 + 21)..

      $ echo 3 > /proc/irq/default_smp_affinity
    2. Move IRQs to the nRT partition

      $ echo 3 > /proc/irq/<irq>/smp_affinity

      If it is not known what interrupts to move, since this is highly architecture dependent, try to move all of them.

    Interrupts that can not be moved will be printed to stderr. When it is known what interrupts can not be moved, consult the hardware and driver documentation to see if this will be an issue. It might be possible to disable the device that causes the interrupt. Typical interrupts that should and can be moved are: certain timer interrupts, network related interrupts and serial interface interrupts. If there are any interrupts that are part of the real-time application, they should of course be configured to fire in the real-time partition.

  7. Execute a task in the real-time partition

    Now it is possible to run a real-time task in the real-time partition:

    $ echo pid_of_task > /sys/fs/cgroup/cpusets/rt/tasks

    Since we have a RT partition with more then one CPU we might want to choose a specific CPU to run on. Change the task affinity to only include CPU3 in the real time partition:

    $ taskset -p 0x8 pid_of_task

The system should now be partitioned in two sets. The next step to further improve real-time properties is to get rid of the tick interrupts, which is described in Section 3.3.2, Full Dynamic Ticks.

3.3.2 Full Dynamic Ticks

The full dynamic ticks feature is described in Section 2.7, Dynamic Ticks (NO_HZ). Several conditions must be met to really turn ticks off from a CPU while running a tasks on this CPU. Some of these conditions are of static nature such as kernel configurations, and some of them are runtime conditions such as having only one runnable task, no POSIX timers, etc.

The current implementation of full dynamic ticks will not disable ticks entirely, but rather set it to 1 Hz. This is because the system still needs to synchronize every now and then. For those wanting to experiment with turning off the ticks entirely there is a patch from Kevin Hilman to do this.

Note that CPU partitioning will be needed to make sure that only one task is running on a specific CPU. See section Section 3.3.1, CPU Isolation for more information about CPU isolation.

Prerequisites for full dynamic ticks

To be able to enable full dynamic ticks, the following prerequisites need to be met:

  • Linux kernel 3.10.0 or newer.

  • SMP capable hardware with at least two real cores, excluding hyperthreads if any.

  • No more perf events than what the hardware supports being active on the CPU.

Kernel configuration

To select at boot time which CPUs should use the full dynamic ticks feature, the following kernel configurations need to be enabled:

CONFIG_NO_HZ_FULL=y
CONFIG_RCU_NOCB_CPU=y

CONFIG_NO_HZ_FULL is the kernel name for full dynamic ticks.

If you want all CPUs except CPU 0 to use the full dynamic ticks feature, enable the following kernel configurations:

CONFIG_NO_HZ_FULL_ALL=y

In this latter case CONFIG_RCU_NOCB_CPU_ALL should be selected by default.

RCU is a synchronization mechanism used by Linux which uses kernel helper threads to finish updates to shared data. For more information read the excellent LWN article found at http://lwn.net/Articles/262464.

Kernel boot parameters

Linux has a number of boot parameters that enhances core isolation.

isolcpus=<cpu set>

This parameter specifies a set of CPUs that will be excluded from the Linux scheduler load balancing algorithm. The set is specified as a comma separated list of cpu numbers or ranges. E.g. "0", "1-2" or "0,3-4". The set specification must not contain any spaces.

It is definitely recommended to use this parameter, if the target kernel lacks support for CPU hotplug.

nohz_full=<cpu set>

This boot parameter expects a list of CPUs that full dynamic ticks should be enabled for. If the kernel configuration CONFIG_NO_HZ_FULL_ALL was given, then this list will be all CPUs except CPU 0, and this boot option is not needed.

To achieve isolation in the RT domain (CPU2 and CPU3), use the following parameter:

isolcpus=2,3 nohz_full=2,3

See https://www.kernel.org/doc/Documentation/kernel-parameters.txt for more about boot parameters.

After the system has booted, check the boot messages that full dynamic ticks was enabled, e.g. using the shell command dmesg. Search for entries similar to the following:

NO_HZ: Full dynticks CPUs: 2-3.

Also make sure there is an entry similar to the following:

Experimental no-CBs CPUs: 0-7.

The no-CB CPU list must include the CPU list for full dynticks.

When choosing the CPU lists on hardware using simulated CPUs, such as hyperthreads, ensure you include real cores and not half a core. The latter could occur if one hyperthread is in the set of CPUs using full dynamic ticks feature while the other hyperthread on the same core does not. This can cause problems when pinning interrupts to a CPU and the two hyperthreads might also affect each other depending on load.

Application considerations

To achieve full dynamic ticks on a CPU, there are some requirements on the application being run on this CPU. First of all, it must not run more than one thread on each CPU. It must also not use any POSIX timers, directly or indirectly. This usually excludes any kernel calls that will access the network, but also excludes a number of other kernel calls. Keeping the kernel calls to a minimum will maximize the likelihood of achieving full dynamic ticks.

The application must utilize the CPU partitioning described in previous section, which is done by writing the application thread's PID into the file /sys/fs/cgroup/cpuset/rt/cpuset.tasks (assuming the real-time partition is called "rt"). After this, the shell command taskset can be used to bind the task to a specific CPU in the "rt" partition. Binding can also be done by the application itself using the sched_setaffinity() function defined in sched.h.

Cost of enabling full dynamic ticks

Full dynamic ticks incurs the following costs:

  • Transitions to and from idle are more expensive. This is inherited from CONFIG_NO_HZ_IDLE since CONFIG_NO_HZ_FULL builds on the same code as CONFIG_NO_HZ_IDLE.

  • Transitions between user and kernel space are slightly more expensive, since some book-keeping must be done.

  • Scheduling statistics normally involve periodic timeouts, and are therefore implemented slightly differently for CONFIG_NO_HZ_FULL.

Benchmarks

Below is an example trace log for a call to the scheduler_tick() function in the kernel:

          0)               |  scheduler_tick() {
          0)               |    _raw_spin_lock() {
          0)   0.113 us    |      add_preempt_count();
          0)   0.830 us    |    }
          0)   0.085 us    |    update_rq_clock.part.72();
          0)   0.146 us    |    __update_cpu_load();
          0)   0.071 us    |    task_tick_idle();
          0)               |    _raw_spin_unlock() {
          0)   0.076 us    |      sub_preempt_count();
          0)   0.577 us    |    }
          0)               |    trigger_load_balance() {
          0)   0.098 us    |      raise_softirq();
          0)   0.065 us    |      idle_cpu();
          0)   1.715 us    |    }
          0)   6.585 us    |  }

As can be seen from the above trace, the tick took more than 6 us, excluding interrupt overhead. This was a common time for this target, HP Compaq Elite 8300 which is an Intel Core i5 3570.

3.3.3 Optimizing a Partitioned System

If the above sections does not offer enough real-time properties, then this chapter gives some extra hints of what can be done.

tsc boot parameter - x86 only

The time stamp counter is a per-CPU counter for producing time stamps. Since the counters might drift a bit Linux will periodically check that they are synchronized. But this periodicity means that the tick might appear despite using full dynamic ticks.

By telling Linux that the counters are reliable, Linux will no longer perform the periodic synchronization. The side-effect of this is that the counters may start to drift, something that can be visible in trace logs for example.

Here is an example of how to use it:

isolcpus=2,3 nohz_full=2,3 tsc=reliable

See https://www.kernel.org/doc/Documentation/kernel-parameters.txt for more about boot parameters.

Delay vmstat Timer

It is used for collecting virtual memory statistics.The statistics are updated at an interval specified as seconds in /proc/sys/vm/stat_interval. The amount of jitter can be reduced by writing a large value to this file. However, that will not solve the issue with worst case latency.

Example (1000 seconds);

$ echo 1000 > /proc/sys/vm/stat_interval

There is one kernel patch (https://lkml.org/lkml/2013/9/4/379) that removes the periodic statistics collection and replaces it with a solution that only triggers if there is actual activity that needs to be monitored.

BDI Writeback Affinity

It is possible to configure the affinity of the block device write back flusher threads. Since block I/O can have a serious negative impact on real-time performance, it should be moved to the general purpose partition. Disable NUMA affinity for the writeback threads

$ echo 0 > /sys/bus/workqueue/devices/writeback/numa

Set the affinity to only include the general purpose CPUs (CPU0 and CPU1).

$ echo 3 > /sys/bus/workqueue/devices/writeback/cpumask

Real-time throttling in partitioned system

Real time throttling (RTT) is a kernel feature that limits the amount of CPU time given to Linux tasks with real-time priority. If any process that executes on an isolated CPU runs with real-time priority, the CPU will get interrupts with the interval specified in /proc/sys/kernel/sched_rt_period_us. If the system is configured with CONFIG_NO_HZ_FULL and a real-time process executes on a CONFIG_NO_HZ_FULL CPU, note that real-time throttling will cause the kernel to schedule extra ticks. See Section 2.3, Real-Time Throttling and Section 3.2.4, Optimize Real-Time Throttling for more information.

Disable real-time throttling by the following command:

$ echo -1 > /proc/sys/kernel/sched_rt_runtime_us

Disable Power Management

The CPU frequency governor https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt causes jitter because it is periodically monitoring the CPUs. The actual activity of changing the frequency can also have a serious impact. Disable the frequency governor with the following configuration CONFIG_CPU_FREQ=n.

An alternative is, at runtime, to change the governor policy to performance. The advantage is the policy affinity per each cpu:

$ echo "performance" > /sys/devices/system/cpu/cpu<cpu_id>/cpufreq/scaling_governor

An example based on the RT partition:

$ echo "performance" > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
$ echo "performance" > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor

Note that this could damage your hardware because of overheating. Make sure that you understand what works for your specific hardware.

Machine Check - x86 only

The x86 architecture has a periodic check for corrected machine check errors (MCE). The periodic machine check requires a timer that causes unwanted jitter. The periodic check can be disabled. Note that this might lead to that silently corrected MCEs goes unlogged. Turn it off on the RT CPUs. For each CPU in the real-time partition, do the following:

$ echo 0 >/sys/devices/system/machinecheck/machinecheck<cpu>/check_interval
$ echo 0 >/sys/devices/system/machinecheck/machinecheck2/check_interval
$ echo 0 >/sys/devices/system/machinecheck/machinecheck3/check_interval

It has been observed that it is enough to only disable this for CPU0, it will then be disabled on all CPUs. Read more about the x86 machine-check exception at https://www.kernel.org/doc/Documentation/x86/x86_64/machinecheck.

Disable the Watchdog

The watchdog timer is used to detect and recover from software faults . It requires a regular timer interrupt. This interrupt is a jitter source that can be removed. At the obvious cost of less error detection.

The watchdog can be disabled at compile time by setting CONFIG_LOCKUP_DETECTOR=n or in runtime in the :

$ echo 0 > /proc/sys/kernel/watchdog

For more information, see http://en.wikipedia.org/wiki/Watchdog_timer.

Disabling the NMI Watchdog - x86 only

Disable the debugging feature for catching hardware hangs and cause a kernel panic. On some systems it can generate a lot of interrupts, causing a noticeable increase in power usage:

$ echo 0 > /proc/sys/kernel/nmi_watchdog

Increase flush time to disk

To make write-backs of dirty memory pages occur less often than default, you can do the following:

$ echo 1500 > /proc/sys/vm/dirty_writeback_centisecs

Disable tick maximum deferment

To have the full tickless configuration, the https://lkml.org/lkml/2013/9/16/499 patch should be included. This allows the tick interval to be maximized by setting sched_tick_max_deferment variable in the /proc filesystem. To disable the maximum deferment, it should be set to -1.

$ echo -1 > /sys/kernel/debug/sched_tick_max_deferment

Network queues affinity

Linux can route the packets handling on different cpus in an SMP system. Also this handling can create timers on the specific CPUs, an example is the ARP timer management, based on neigh_timer. There are couple of solutions that can be adopted to minimize the effect of rerouting packets on different CPUs, like migrating all the timers on the non-realtime partition if possible, specifing the affinity of network queues on some architecture.

If the application needs the packets to be received only in the nRT partition then the affinity should be set as follows:

$ echo <NRT cpus mask> > /sys/class/net/<ethernet interface>/queues/<queue>/<x/r>ps_cpus

Example for TCI6638k2k board:

$ echo 1 > /sys/class/net/eth1/queues/tx-0/xps_cpus
$ echo 1 > /sys/class/net/eth1/queues/tx-1/xps_cpus
$ echo 1 > /sys/class/net/eth1/queues/tx-2/xps_cpus
$ echo 1 > /sys/class/net/eth1/queues/tx-3/xps_cpus
$ echo 1 > /sys/class/net/eth1/queues/tx-4/xps_cpus
$ echo 1 > /sys/class/net/eth1/queues/tx-5/xps_cpus
$ echo 1 > /sys/class/net/eth1/queues/tx-6/xps_cpus

$ echo 1 > /sys/class/net/eth1/queues/rx-0/rps_cpus

Note that if there is a need of network traffic on both partitions, the affinity should not be set. In this case, the neigh_timer can be handled by any cpu (including those in the RT partition).

3.3.4 Benchmarks for CPU isolation

The measurements that compares latency numbers with and without isolation on a stressed system can be found below. The benchmark program used is cyclictest (https://rt.wiki.kernel.org/index.php/Cyclictest) and the load was generated with the stress application (http://people.seas.harvard.edu/~apw/stress).

This is how the non-isolated benchmark was executed:

$ ./stress -d 4 --hdd-bytes 20M -c 4 -i 4 -m 4 --vm-bytes 15M &
$ ./cyclictest -m -p 99 -l 300000 -q;

In the isolated case the following boot parameter is used:

isolcpus=3

This is how the isolated benchmark was executed:

$ partrt create 0x8;
$ partrt run nrt ./stress -d 4 --hdd-bytes 20M -c 4 -i 4 -m 4 --vm-bytes 15M &
$ partrt run rt ./cyclictest -m -p 99 -l 300000 -q;

Please read more about partrt in Section 3.3.6, The CPU Partitioning Tool - partrt. The benchmark was executed on a TCI6638k2k board.

Table 3.5 Latencies on partitionated system

 Latency (µs)
MinMaxAvg
Not isolated431 42868
Isolated4015159

The benchmark shows that the worst case latency drops with a factor 10 when the benchmark runs on an isolated CPU.

3.3.6 The CPU Partitioning Tool - partrt

Enea Linux includes a tool, partrt, for dividing an SMP Linux system into partitions. By using the methods described in the previous section, this tool provides an easy to use interface to set up CPU isolation in an intelligent way. The tool can be downloaded from https://github.com/OpenEneaLinux/rt-tools.

Usage Examples

Create the RT partition on CPU 2 and CPU 3. Default names will be "rt" for the real time partition and "nrt" for the general purpose partition:

$ partrt create 0xc

Or on the NUMA system assumed in previous section:

$ partrt create -n 1

Show the recently created partitions like this:

$ partrt list

Run cyclictest on CPU 3 in the RT partition:

$ partrt run -c 0x8 rt cyclictest -n -i 10000 -l 10000

Move cyclictest to NRT partition:

$ partrt move «pid of cyclictest» nrt

Undo partitioning (restore environment)

$ partrt undo

See full partrt help text like this:

$ partrt -h

3.4 Further Actions if Needed

If the attempts in the previous sections to improve the real-time performance are not enough, consider those described in this section.

3.4.1 User Space Runtime

As mentioned on several places in this manual, the standard Linux kernel is not real-time safe. Read more about this in Chapter 2, Basic Linux from a Real-Time Perspective.

Two proposed solutions to tackle real-time in Linux are PREEMPT_RT Section 2.4, The PREEMPT_RT Patch and CPU isolation Section 2.6, CPU Isolation. Both solutions has consequences. PREEMPT_RT has throughput issues and CPU isolation on a standard kernel requires that only a subset of the libc can be used if real-time shall be maintained. One approach that can improve throughput and real-time properties is to not use the kernel at all. Instead a runtime entirely in user space can be used.

Below are some issues that appear with PREEMPT_RT and/or CPU isolation.

Linux kernel adds to much overhead

For application writers that move from a bare metal or RTOS environment to Linux, the overhead of the Linux API might be unacceptable. PREEMPT_RT adds even more overhead.

Glibc does not provide a real-time deterministic API

CPU isolation can be a good way to get real-time performance on a standard Linux kernel. The big problem is that the Linux API is not real-time safe and has to be avoided whenever real-time determinism is required. This limits the application developer, since standard APIs and programming models can not be used. IPC between real-time tasks and general purpose tasks is an other issue. Most IPC implementations on Linux relies on system calls and can add real-time problems if used on a standard kernel. Some IPC implementations might be unsafe to use on PREEMPT_RT, depending on how the IPC implementation handles dynamic memory. This subject is discussed in Chapter 4, Designing Real-Time Applications.

Running a real-time deterministic runtime completly in user space can be a good way to increase determinism and throughput. One example is Enea LWRT which provides deterministic multithreading, memory management and LINX IPC. See http://www.enea.com/Embedded-hub/documents/Linux/Enea_LWRT_datasheet/ for more details. A specialized user space runtime can solve the mentioned problems:

Decrease overhead by avoiding the Linux kernel

The runtime can implement leightweight threads in userspace. This can greatly decrease overhead and increase determinism. Voluntary context switches can be done entierly in user space. A comparision between LWRT light weight user space threads and standard Linux pthreads can be seen below. The benchmark is done on a Texas Instruments TCI6638k2k (1.2 GHz).

Table 3.6 Task switch latency. LWRT vs. pthreads (micro seconds)

 Latency (µs)
MinMaxAvg
Linux pthread4.2531.358.58
LWRT process0.261.880.41


The table shows that the pthread context switch has about a factor of 30 higher overhead compared with LWRT processes in the average case. This could be a problem in, for example, a telecommunication application that uses a high amount of threads.

Replace glibc API with a real-time deterministic API

A real time safe user space runtime can provide real-time deterministic API that can replace the undeterministic parts of the glibc API. Typical undeterministic calls that can be replaced are related to dynamic memory management, multi-threading and IPC.

3.4.2 Use an RTOS

For completeness, it should be mentioned that using an RTOS may be the best alternative if all ways to improve the real-time properties in a Linux system have been exhausted without reaching acceptable performance. Doing that is outside the scope of this manual, but an example is using one of the RTOS products provided by Enea.

4. Designing Real-Time Applications

Optimizing the Linux system itself for real-time is only half the solution to give applications optimal real-time properties. The Linux applications themselves must be designed in a proper way to allow for real-time properties. This section will provide some hints how to do this.

The C function library, libc, is a part of the runtime for applications under Linux. It provides basic facilities like fopen, malloc, printf, exit, etc. The C library shall provide all functions that are specified by the ISO C standard. Usually, additional functions specific to POSIX are also supported.

GNU libc (https://www.gnu.org/software/libc) is the most widely used libc, but there are alternative implementations like newlib (https://www.sourceware.org/newlib) and uClibc (http://www.uclibc.org/other_libs.html), all of them supporting the POSIX.1b (IEEE 1003.1b) real-time extensions.

The libc provides a powerful toolbox with many useful and frequently used features, but from a real-time perspective, it must be used with some caution.

The first problem to deal with is the level of real-time support in the libc code itself. The code is often considered to be proven in use and therefore used without deeper analysis. This is probably a valid assumption for a typical Linux system where average performance is more important than worst case behavior, but for real-time systems it might be an unreasonable attitude. This should however not be a major issue since the source code is available for analysis.

Another issue is the fact that the functions in libc may use system calls interacting with the kernel. Depending on the kernel preemption model, this may lead to execution of different non-preemptible sections. The kernel can also decide to execute other tasks like soft IRQs on its way back to user space.

Further application design challenges are:

For further reading, see https://rt.wiki.kernel.org/index.php/RT_PREEMPT_HOWTO.

4.1 Application Memory Handling

Linux applications access memory by using virtual addresses. Each virtual address translates into a physical address with the help of hardware that uses translation tables. This feature makes it possible to address more virtual memory than there is physical memory available, assuming that not all applications need all their allocated memory at the same time.

Allocating memory will by default only reserve a virtual memory range. When the first memory access to this newly allocated virtual memory range occurs, this causes a page fault, which is a hardware interrupt indicating that the translation table does not contain the addressed virtual memory. The page fault interrupt will be handled by the Linux kernel, which will provide the virtual-to-physical memory mapping. Then the program execution can continue.

Most architectures use a cache called translation lookaside buffer, TLB, for the translation table. The TLB cache is used to speed up virtual-to-physical memory translations. A translation causes latency jitter when a looked-up address is not in the TLB, which is referred to as a TLB miss.

Virtual memory makes it possible for Linux to have memory content stored in a file, e.g. by loading an executed binary in an on-demand fashion or by swapping out seldom used memory. This is called demand paging, see http://en.wikipedia.org/wiki/Demand_paging for more information on this topic. Demand paging can cause unbound latency since it involves accessing a file or a device. Therefore, the application needs to disable demand paging by using the mlockall() function call:

mlockall(MCL_CURRENT | MCL_FUTURE)

The MCL_CURRENT will make sure that all physical memory has the expected content and that the translation table contains the needed virtual-to-physical memory mapping. This includes code, global variables, shared libraries, shared memory, stack and heap. MCL_FUTURE means that updating the translation table and initializing the physical memory, if applicable, is done during future allocations, not when accessing the memory. Future allocations can be stack growth, heap growth, shm_open(), malloc(), or similar calls like mmap().

When using mlockall(), there is a risk that Linux will allow allocating less memory. For example, the allocated memory must be available as physical memory.

Note that a call to malloc() can still show large latency variation, since now the translation table update is done within this function call instead of when accessing the memory. Not to mention that a malloc() may or may not need to ask for more virtual memory from the kernel. It is therefore recommended to allocate all needed dynamic memory at start, to avoid this jitter.

In case dynamic memory allocation needs to be done within the real-time application, there are some actions that can be performed to mitigate the malloc() latency variation. The glibc has a function called mallopt() which can be used to change the behavior of malloc(). Two interesting options are M_MMAP_MAX and M_TRIM_THRESHOLD.

  • M_MMAP_MAX controls at what allocation size the mmap() function will be used instead of sbrk(). The advantage with mmap() is that the memory can then be returned to the system as soon as free() is called, since each allocation is separate. The disadvantage with mmap() is that it is slower than sbrk().

  • M_TRIM_THRESHOLD controls when part of an sbrk() allocated memory area, where a large contiguous area of memory at the top has been freed, shall be returned to the system. Turning this feature off will not allow the application to release any memory to the system once it has been allocated, but it can somewhat improve the real-time properties.

See the following table for some measurements done on malloc() call time and memory access time. The measurements have been done on a system with CPU isolation.

Table 4.1 malloc() and memory access measurements on TCI6638k2k board

PartitionScenarioLatency (µs)Operation
MinMaxAvg
RTnormal4178108.8Mem access
433119.6Malloc call
mlockall()4135.0Mem access
451596.5Malloc call
mlockall(), M_MMAP_MAX = 0, M_TRIM_THRESHOLD = -14114.8Mem access
44175.7Malloc call
GPnormal41384109.0Mem access
716419.7Malloc call
mlockall()41255.1Mem access
4160795.4Malloc call
mlockall(), M_MMAP_MAX = 0, M_TRIM_THRESHOLD = -14914.8Mem access
44635.7Malloc call

There are a large number of "4" in the minimum latency column. This is due to the timer resolution that will not allow any smaller values. For a real-time application, the maximum latency is of the greatest importance. The measurements show that when using mlockall(), the memory access becomes more predictable, while malloc() will always have a latency that is hard to predict. Letting the real-time tasks run on dedicated CPUs, here referred to as an RT partition, also results in lower latency jitter.

An alternative is to use another heap allocator than the one provided by glibc;, an allocator which is better adapted for embedded and real-time requirements. Here are two examples:

TLSF - Two level segregated fit

O(1) complexity for allocation and free calls and only 4 bytes overhead per allocation. It can be slower than the glibc default allocator on average, but should have better worst case behavior. See http://www.gii.upv.es/tlsf/ for more information.

HF - Half fit

O(1) complexity for allocation and free. It uses bitmaps and segregated lists. See http://www.gii.upv.es/tlsf/alloc/others for more information.

The glibc itself uses ptmalloc (http://malloc.de/en/), which is a more SMP-friendly version of DLMalloc (http://gee.cs.oswego.edu/dl/html/malloc.html).

Since memory is slow compared to the CPU, there is usually a fast but small memory area called cache between the CPU and the memory. When an access is done to non-cached memory, the CPU needs to wait until the cache has been updated. Accessing a memory location that has not been accessed for a long time will therefore take more time than accessing a recently accessed memory location.

One obvious way to the cache problem is to disable the cache. While this would help the worst case latency it will make the average latency horrible, given that the selected architecture supports disabling the cache. You can also use CPU partitioning to let the real-time application run on the CPU alone, and making sure that this application does not access more memory than fits in the cache.

Another consequence of the per-CPU caches becomes obvious when using shared memory among tasks running on different CPUs. Such memory can be shared either by threads in the same process or by using the shm_open() function call. When such memory is updated by one task and being read by another task running on another CPU, the memory contents need to be copied. This copy usually has a minimum size, called cache line, which for many architectures is 32 bytes. A one byte write will therefore end up copying 32 bytes in this situation, which is 32 times more than might be expected.

Summary:

  • Use mlockall(MCL_CURRENT | MCL_FUTURE) to lower memory access latency jitter.

  • Pre-allocate all needed dynamic memory, since the time needed for memory allocation is hard to predict.

  • Avoid sparse memory accesses to better utilize hardware caches.

  • Be careful about sharing memory between tasks running on different CPUs, as each access may end up copying 32 bytes on many architectures.

  • Consider using the M_MMAP_MAX and M_TRIM_THRESHOLD options in mallopt() in case dynamic memory cannot be pre-allocated.

  • Consider using the TLSF heap allocator if lower worst-case latency for malloc() is needed.

4.2 Multi-Threaded Applications

There are two driving forces to make applications multi-threaded:

  1. Easier to design event driven applications.

  2. Make use of concurrency offered by multicore systems.

A rule of thumb when designing a multi-threaded application is to assign one thread for each source of event, output destination and state machine for controller logic. This will usually lead to many threads, which can result in the application spending a lot of time doing task switches between those threads. A user-space scheduler can solve that problem but then the threads cannot run on different cores. You can also merge threads so that work with higher real-time requirements are kept in separate threads from work with lower real-time requirements.

To make the application truly concurrent, a good choice is to use POSIX threads, pthreads. Each pthread can run on its own core, and all pthreads within the same process are sharing memory, making it easy to communicate between the threads. Be careful however that if the threads need a lot of synchronization, the application will no longer be concurrent despite the fact that several CPUs are used. In this case the application might even become slower than having it as a single thread.

Asynchronous message passing can solve some of the synchronization problems, especially use cases where a thread with high real-time requirements can delegate work to a thread with less real-time requirements. An example of such mechanism is the POSIX functions mq_send() and mq_receive(). One challenge with message passing is how to handle flow control, i.e what to do when the queue is full.

Message passing as well as other synchronization mechanisms, e.g. the POSIX pthread_mutex_lock(), can suffer from priority inversion where a high priority task is forced to wait for a low priority task. For pthread_mutex_lock() it is possible to enable priority inheritance to avoid this problem, while for message passing this needs to be considered when designing the messaging protocol.

Example on how to set priority inheritance for a pthread mutex:

pthread_mutex_t mutex;
pthread_mutexattr_t attr;

pthread_mutexattr_init(&attr);
pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
pthread_mutex_init(&mutex, &attr);

When using gcc, you need to compile the code with options "-pthread -D_GNU_SOURCE".

Summary:

  • When designing an event-driven application, it is often easier to use multiple threads rather than callback functions.

  • Compared to singe-threaded applications, multi-threaded applications using pthreads can scale better on a multicore systems.

  • In a multi-threaded application where the threads require heavy synchronization, the threads will spend most of the time waiting for each other. This will make the application less scalable.

  • A user-space scheduler, compared to scheduling in the kernel, will allow more synchronization and more threads before efficiency goes down.

  • Properly designed message passing applications can make the synchronization less painful.

  • When synchronizing threads, beware of priority inversion. Message passing protocol needs proper design, and POSIX mutexes can use priority inheritance to avoid priority inversion.

4.3 Application I/O

I/O accesses usually end up as system calls accessing a kernel driver. This kernel driver then accesses hardware, which can have an unbound latency. The calling task will by default block during this period. It is however possible to perform asynchronous I/O to avoid being blocked. See the aio(7) man page for more information about asynchronous I/O.

The driver can add deferred work to a work queue or to soft IRQs, which can make it hard to predict latencies for a real-time application.

Furthermore, the driver might need some timeouts. If using the full dynamic ticks feature, such timeouts may cause tick interrupts to be triggered. One possible solution to this is to delegate the I/O calls to a task running on another CPU than the real-time tasks. Then the latencies caused by deferred driver work will only affect the delegated task, but not the real-time task.

Note that the I/O concerns are also applicable to text messages sent to stdout or stderr, or text read from stdin. If a device driver writes a diagnostic message from the kernel, e.g. by using the kernel function printk(), the I/O concerns applies to this message as well.

Summary:

  • Delegate I/O to a task running on another CPU than the real-time task.

  • If delegation is not possible, asynchronous I/O might help. See the aio(7) man page.

5. Hardware Aspects

The real-time properties of a system do not only depend on the operating system. The hardware is also important. This chapter contains a brief discussion on what to consider regarding real-time capable hardware.

Real-time capable hardware requires, at the minimum, that resources can be accessed within a deterministic time interval. If there is a risk for resource congestion, the hardware must implement deterministic multiple access algorithms which guarantee that there will be no starvation. Another typical hardware requirement in a real-time system is that CPUs have access to reliable high resolution timers.

5.1 CPU

Modern CPUs use a number of techniques to speed up code execution, such as instruction pipelines, out-of-order execution, and branch prediction. They all contribute to better average speed in code execution, but will also cause latency jitter. Read more about these topics on http://en.wikipedia.org/wiki/Instruction_pipeline, http://en.wikipedia.org/wiki/Out-of-order_execution, and http://en.wikipedia.org/wiki/Branch_prediction.

5.2 Shared Resources

Designing real-time applications for SMP systems is more complex than designing for single-CPU systems. Typically, on SMP, all hardware resources such as I/O devices and memory are shared by the CPUs. Since all resources are shared, a low priority task could starve a high priority task, i.e. cause a priority inversion.

There are methods for scheduling multiple access to I/O devices, see Section 2.9, I/O Scheduling, but it is more tricky to manage shared memory. Currently there are no good tools for controlling multiple access to shared memory resources as there are for multiple access to I/O devices. Multiple access to shared memory is more hidden from a software perspective and is largely implemented in hardware.

A proper real-time design should consider how to deal with shared memory and memory congestion, how hyper-threading affects real-time performance, how to use NUMA to improve the execution time, and how to decrease impact of shared resource congestion; all of these topics are covered in this section.

For a deeper study in the area of resource sharing on multi-core devices, see http://atcproyectos.ugr.es/wicert/downloads/wicert_papers/wicert2013_submission_2.pdf.

5.2.1 Shared Memory and Memory Congestion

SMP systems typically share a system bus, cache, MMU and one or several memory channels. An application that causes heavy load on a shared resource can significantly degrade performance for other CPUs. Not only CPUs, but also other devices that support DMA, will increase the congestion.

Memory congestion can have various sources. In Chapter 4, Designing Real-Time Applications you can study the software driven congestion caused by a shared-memory programming model. This section covers the impact from hardware resource congestion due to shared memory and how it affects the worst case execution time.

A real-time application designer probably wants to test the system to find a worst case execution time. The methods below describe how to stress the shared memory and measure an approximate worst case execution time. The methods are suitable for soft and firm real-time systems.

A pessimistic indication of what impact shared memory congestion has on worst case latency can be estimated like this:

  1. Turn off caches. On some architectures it is possible to disable the caches. Above all, this will give the absolutely worst impact that cache misses can have on worst case execution time. It will also make it possible to measure the impact of congestion on the memory bus and on the off-chip memory.

    Start memory load on each general-purpose CPU by calling the command specified below. This gives an indication about what impact the memory bus and off-chip memory congestion has on worst case execution time. This will give a good indication even if it isn't possible to disable the caches.

    The application that is used to generate load is called stress and is described in Appendix B, Benchmark Details. Start the stress application on each non-real-time CPU. Use memory stress with a stride that equals to the cache line size of the target architecture. Make sure that the value passed to --vm is larger than the last level shared cache.

    taskset <GP-CPU-n> ./stress --vm <LAST_LEVEL_CACHE_SIZE> \
       --vm-stride <CACHE_LINE_SIZE>

    That will likely trash the shared cache and cause a minimal number of cache hits for that real-time application.

  2. If the impact of MMU congestion is of interest, repeat step 2 but use a stride size that is equal to the system page size.

    taskset <GP-CPU-n> ./stress --vm <LAST_LEVEL_CACHE_SIZE> --vm-stride <PAGE_SIZE>

The impact of the generated load in the above examples will vary significantly depending on the CPU clock speed, memory clock speed, cache size, coherency algorithm and cache/memory hierarchy. Changing these hardware parameters will create different congestion thresholds. Processor architectures that cannot guarantee CPU access to a specific bus or device within a deterministic amount of time cannot be used for real-time applications.

Note

For hard real-time systems, Linux is probably not a suitable operating system. If you anyway go for Linux, a static analysis should be done instead of using the methods above. The static analysis is needed to calculate a theoretical worst case execution time based on number of clock cycles for a worst case scenario which also takes the hardware latency into account.

5.2.2 Hyper-Threading

Hyper-threading means that there are multiple virtual CPUs per physical CPU. Two or more instruction pipelines will share hardware resources. A low priority process and a high priority process can run on separate virtual CPUs belonging to the same physical CPU. This can lead to a situation where the low priority process decreases the performance of the high priority process. It is recommended to disable hyper-threading when real-time performance is required. Another approach is to make sure that each real-time task has exclusive access to a physical CPU.

5.2.3 NUMA

The negative impacts on worst case execution time can at large be eliminated if the target hardware supports NUMA. By using the Linux cpuset feature it is easy to give the real-time application its own memory node. Read more about this in Section 2.6, CPU Isolation. Note that memory congestion will also occur if the real-time application runs on multiple CPUs in the real-time partition. However, that should be more manageable.

5.2.4 Decrease Impact of Shared Resource Congestion

Below is a list with suggestions on how to decrease the impact of shared resource congestion.

  1. If the platform has NUMA: Dedicate one NUMA node to the RT application. See Section 3.3.1, CPU Isolation.

  2. Disable hyper-threading. If that isn't possible, use CPU isolation with static affinity so that only one real-time task executes per physical CPU.

  3. Disable the cache if the architecture allows it. Do this to avoid possible indeterminism added by cache misses. If this is needed, it could indicate that Linux as operating system or the hardware platform is unsuitable for the application.

  4. On some architectures it might be possible to lock real-time application pages into the cache. Consult the processor manual and, if available, the hardware platform specific SDK manual.

5.3 The System Management Mode (x86)

The x86 architecture has an operating mode called System Management Mode, also known as SMM. It is "a special-purpose operating mode provided for handling system-wide functions like power management, system hardware control, or proprietary OEM designed code." The SMM is entered via an event called system management interrupt, SMI. SMM/SMI has a higher priority than the operating system and will therefore affect the latency. It cannot be disabled by the OS, and even if there might be other ways to disable it, it should probably be kept since it also handles thermal protection. Consequently, there is not much that can be done about it except for adding enough margins to tolerate it. Read more about SMM in http://en.wikipedia.org/wiki/System_Management_Mode.

Appendix A: Optimizing Example - P2041

When someone states the goal to "optimize a specific Linux target for real-time" and provides a benchmark result, it is very important to be clear on what capabilities the measured system actually has. Benchmark results may be interesting to read, but they are only valid and relevant if they are somewhat comparable with each other and if the setup is relevant for real-world use cases.

This appendix states the goal to optimize for real-time, but it actually tries to reach as far as possible regarding both throughput performance and low worst-case latency response time since the use case we focus on is an embedded control system within the communications domain, which normally has both fairly high soft real-time requirements and performance requirements.

A.1 In reality, "real-time" is not only about interrupt latency

In a real-time system, the characteristic behavior of the entire operating system is very important. To start with, a deterministic response time from an external event until the application is invoked is what we normally refer to when talking about real-time characteristics of an operating system. This implies not only the interrupt latency, but also the event chain until the application gets scheduled. Since a chain is not stronger than its weakest link, it is also important to provide a deterministic runtime environment for the entire application execution flow so that it can respond within the specified operational deadline on system level. This implies that also the task scheduler and the resource handling API in the OS must behave in a deterministic way. When designing a system for high throughput performance, the goal is to keep down the average latency, while designing a real-time system aims to keep the worst case latency under a specified limit. As a consequence, the design of a both high-performing and real-time capable system must take both average and maximum latency into account, and this is something we will strive for in this application note.

A.2 Performance and latency tuning strategy

The selected target for this exercise is the Enea Linux PowerPC kernel for the p2041rdb board. The p2041rdb kernel (and most other Enea Linux PPC kernels) is built in two different ways:

  • A high-performance kernel image that is optimized for high throughput performance and low footprint, for example intended for IP benchmarking. This image, and its corresponding kernel configuration file, is tagged "RELEASE".

  • A demo/debug kernel image that has the goal to be fully configured regarding all kinds of online- and off-line debug capabilities. The demo/debug image is not tagged, and this is also the one that you can modify via the command bitbake –c menuconfig virtual/kernel, and also rebuild via the command bitbake virtual/kernel.

Both kernels are configured as a Low-Latency Desktop (LLD) kernel, i.e. the most preemptive standard kernel variant. (selected by CONFIG_PREEMPT).

The strategy we will follow in the tuning effort is to go through a number of steps, each of which we briefly describe the configuration level and the latency benchmark result:

  1. The first attempt is with the default "demo/debug" kernel image, for the main reason to highlight the difference to the end result caused by both debug overhead and selective tuning of important parameters.

  2. The performance-optimized "RELEASE" kernel image, which is clearly configured for speed. However, it will show here that it is not optimized for latency, and we can do additional tuning to improve both performance and worst case latency.

  3. A standard LLD kernel image highly tuned and optimized for real-time AND performance. This kernel is based on the RELEASE kernel configuration but with additional build configuration changes and boot time configuration options that give the smallest worst-case latency figure possible while not compromising the performance.

  4. Finally, we enable the PREEMPT_RT patch as an "overlay" on the previous LLD kernel in 3) in order to see our possible best result regarding worst case latency.

Note that the goal here is to optimize for performance and real-time; in both the development phase, the deployment phase and the maintenance phase in a real-life production system we will very likely have to add some of the tracing and debugging features we have now removed because otherwise the system will become unmaintainable. This has a price in overhead, but the performance/debug tradeoffs are different from case to case.

A.3 Configuring and building the Enea Linux PowerPC p2041rdb kernel(s)

The Enea Linux PowerPC kernel is built so called "staged"; first the "RELEASE" image is configured and built, then after that the normal demo/debug image is configured and built. The recipe for the kernel build can be found under poky/meta-enea-recipes-kernel/linux. The file linux-qoriq-sdk.bbappend is essential here; it describes exactly what configuration that shall go into both the RELEASE kernel and the demo kernel. The kernel configuration file (.config) is built up incrementally by merging configuration fragments in a specific order from the sub-directory files/cfg according to specific variable definitions.

The configuration file for the high-performance RELEASE kernel is defined by the incremental merge of the fragments specified in KERNEL_FEATURES variable, and the resulting .config file can also be found as the config-p2041rdb-RELEASE.config file in the deployment images directory. The default demo/debug kernel has got additional configuration fragments merged to its .config file, specified by the STAGING_KERNEL_FEATURES variable, and the aggregated .config file is named config-p2041rdb.config.

A.4 Benchmark description

The worst case latency benchmark uses a combination of cyclictest ( https://rt.wiki.kernel.org/index.php/Cyclictest ) and stress ( http://people.seas.harvard.edu/~apw/stress ) . The values of the buffer sizes used in the stress are chosen in order both to generate much stress load on network via NFS traffic in the hdd test, and also in the attempt to resemble a real live embedded application. The values are:

Table A.1 Details of stress scenarios

Test scenario

Stress

hdd

./stress –d 4 –hdd-bytes 1M

vm

./stress –m 4 –vm-bytes 4096

full

./stress –c 4 –i 4 –m 4 –vm-bytes 4096 –d 4 –hdd-bytes 4096


The benchmark runs one stress instance per core in parallel with the cyclictest program:

./cyclictest –S –m –p99 –l 100000

A.5 The "default" LLD kernel targeting demo/debug

This kernel is configured to contain all kinds of debug features, and thus it has a lot of overhead. Below is an enumeration of the features added, briefly described by the name of the configuration fragment:

  • files/cfg/00020-debug.cfg: misc collection of numerous ftrace options, performance events, counters, stacktrace support, file system debug options.

  • files/cfg/00033-kprobes.cfg

  • files/cfg/00007-oprofile.cfg

  • files/cfg/00019-i2c.cfg

  • files/cfg/00027-lttng.cfg

  • files/cfg/00025-powertop.cfg

  • files/cfg/00004-systemtap.cfg

  • files/cfg/00014-kgdb.cfg

Benchmark results for the default build uImage-p2041rdb.bin image (prebuilt in the distribution):

Table A.2 Benchmark numbers for the "default" LLD kernel (demo/debug)

Latency [µs]

Stress type

no stress

cpu

Io

vm

hdd

full

Min

12

10

19

10

11

10

average

24

17

32

18

30

25

max

382

33

90

39

388

230


The table above shows the printed resulting min, average, max, latency in microseconds from the cyclictest program rhat run one instance on each core in parallell with the stress program.

The result shows a fairly long and fluctuating worst case latency, as well as a significant overhead in min- and average values. The conclusion is that this is neither a suitable kernel configuration for production systems with any kind of real-time requirements, nor for systems where the performance is important. This kernel is fully-featured with its pros and cons.

A.6 The performance-optimized RELEASE kernel

As described earlier, this is a kernel configuration without the debug features found in the default demo/debug kernel. Benchmark results for the build uImage-p2041rdb-RELEASE.bin (prebuilt in the distribution):

Table A.3 Benchmark numbers for the performance-optimized RELEASE kernel

Latency [µs]

Stress type

no stress

cpu

Io

vm

hdd

full

Min

4

4

5

4

4

4

average

7

5

6

4

7

6

max

67

32

18

25

126

56


The result still shows a somewhat fluctuating worst case latency, but now the min- and average values are significantly improved. The conclusion is that this is still not a suitable kernel configuration for production systems with any kind of real-time requirements. However, for systems where you want to minimize the kernel overhead in order to maximize the application performance, this is a potential base configuration from which you may add debug features deemed necessary in field.

A.7 The "rt-optimized" mainstream Low-Latency Desktop kernel

The two kernel builds previously described exist "out-of-the-box" in the Enea Linux p2041rdb distribution, so in order to be able to do further benchmarking we need to describe here how we can modify the kernel configuration for new builds. We can do this either by using the command

bitbake –c menuconfig

or, we can temporarily modify the kernel recipe in the meta-enea layer. We will choose to modify the kernel recipe, mainly in order to enable reproduction but also because we otherwise has to do a substantial amount of "reversing" of options in the menuconfig command since the .config file we have to work with is the demo/debug one with all debug features enabled. The modifications we have to do can be described in three steps:

  1. Add a new, latency-optimizing, fragment file 000xx-latency.cfg under files/cfg that contains:

    # CONFIG_HIGHMEM is not set
    # CONFIG_FSL_ERRATUM_A_004801 is not set
    # CONFIG_FSL_ERRATUM_A_005337 is not set
    CONFIG_HZ_1000=y CONFIG_HZ=1000
    CONFIG_JUMP_LABEL=y
    # CONFIG_DEBUG_PREEMPT is not set
    # CONFIG_DEBUG_SHIRQ is not set
    # CONFIG_RCU_TRACE is not set
    # CONFIG_TREE_RCU_TRACE is not set
    CONFIG_RCU_BOOST=y
    CONFIG_RCU_BOOST_PRIO=1
    CONFIG_RCU_BOOST_DELAY=500
    CONFIG_RCU_NOCB_CPU=y

    Our intended target system for the exercise is still an embedded 32 bit target and we want the tick to be at least 1ms. As of today, most p2041 processor devices are at least rev 2 or later, which means that we can safely disable the HW errata workaround that disables HW MMU table walk and reverts to SW MMU table walk, which makes kernel slower. GCC for PowerPC produces slightly better code using likely/unlikely, so this we enable. We will also remove some additional potential tracing overhead. A fundamental contribution to improving the real-time characteristics and to reduce OS jitter is to enable the priority boosting for RCU readers and enable offloading of RCU callback invocation to kernel threads (from kernel version 3.7).

  2. Edit the file linux-qoriq-sdk.bbappend. Replace the content of STAGING_KERNEL_FEATURES (debug, kprobes, oprofile, i2c, lttng, powertop, systemtap, kgdb) with the one single fragment cfg/00040-latency from step 1 above.

  3. Add the argument threadirqs to the linux kernel boot argument list using uboot: E.g.:

    setenv bootargs threadirqs root=/dev/nfs rw …

The actions above will further improve the worst case latency figures as much as possible for a standard LLD PowerPC Linux kernel. The RCU callbacks can be fairly heavy and if the execution is offloaded and balanced in pre-emptible kernel threads, we will get a lower jitter in the response of external interrupts and thus better worst case latency figures. Similarly, some ISR:s (Interrupt Service Routines), can be very long and since these normally execute with interrupts (and thus preemption) disabled, the risk for such ISR:s adding to worst case latency is very big. Since the kernel version 2.6.39 (may 2011, as a "spinoff" from the PREEMPT_RT patch) it is possible to give the boot time parameter threadirqs, and this will instruct the kernel to defer the execution of the ISR from the hardirq context to a pre-emptible kernel thread instead, one per ISR. This will remove much of the driver’s ISR execution time from the sources of latency jitter, and thus it contributes to the improvement of the overall determinism. It will however due to the added context switch potentially increase the overhead slightly, but since this is a subset of the PREEMPT_RT patch we know that the corresponding overhead also is less than in the PREEMPT_RT case.

Benchmark results for the build uImage configured according to above and with boot time threadirqs:

Table A.4 Benchmark numbers for the "rt-optimized" mainstream LLD kernel

Latency [µs]

Stress type

no stress

cpu

Io

Vm

hdd

full

Min

3

3

4

3

3

3

average

6

3

5

3

6

4

max

31

16

19

20

45

44


The resulting figures here demonstrates a significant decrease in the jitter fluctuation and we end up getting a fairly good worst-case latency of 44 µs also while loading the system with network traffic via the nfs file system operations in hdd test. The minimal and average values are also very low, which indicates that the average code path executed when preemption is disabled has decreased. The conclusion here is that this kernel configuration would be a possible option for production systems with soft real-time requirements in the area of 100µs.

A.8 The "rt-optimized" PREEMPT_RT kernel

Just as for the rt-optimized LLD kernel in the previous section, we have to modify the kernel recipe temporarily in order to configure and build this kernel. The modifications we have to do can be described as follows:

1. Repeat steps 1 & 2 in Section A.7, The "rt-optimized" mainstream Low-Latency Desktop kernel.

2. Make another copy of the STAGING_KERNEL_FEATURES_RT_p4080ds statement and call it STAGING_KERNEL_FEATURES_p2041rdb. This will generate the merge of the fragment 00018-rt on top of the LLD kernel and enable the preempt_rt patch.

3. The existence of the _RT statement triggers a third stage kernel build, named uImage-p2041rdb-rt.bin and a corresponding config file.

Benchmark results for the build uImage-p2041rdb-rt.bin:

Table A.5 Benchmark numbers for the "rt-optimized" PREEMPT_RT kernel

Latency in [µs]

Stress type

no stress

cpu

Io

vm

hdd

full

Min

3

3

3

3

3

3

average

6

3

7

3

7

4

max

16

12

19

13

27

18


The benchmark results shows that the preempt_rt patched kernel has even more improved worst-case latency figures, 27 µs compared to 45 µs we could reach with the standard LLD kernel. The other observation is that the minimal and average figures are very similar, perhaps with a slightly longer average latency and thus overhead for the preempt_rt kernel, but this is not significant. The conclusion here is that a preempt_rt kernel configuration would be a possible option for production systems with soft real-time requirements in the area of 50-100µs.

A.9 Summary and Conclusion

As a short summary, the result as we have seen ranges from around 300 µs in worst case latency for the demo/debug LLD kernel down to as low as 40-45 µs with the rt-optimized LLD kernel with threadirqs and 25-30 µs when PREEMPT_RT is enabled. We have also seen a significant decrease of the minimal and average latency from around 20-25 µs down to about 3-6 µs, which implies that we also have got an overall significant throughput performance increase.

The benchmark indicates that the last years development in the mainline Low-Latency Desktop kernel with for example the threaded irqs feature, and the offloaded RCU callback feature, has made it possible to reduce the OS jitter and worst case latency down to a level where it actually starts to be a real alternative to the preempt_rt patched kernel as an option for OS choice in an embedded Linux system with soft real-time requirements.

The benchmarrk above are constructed only to indicate potential ways forward to reach soft-realtime requirements. The chosen test case does not in any way guarantee that the results are valid for any type of BSP in any type of system. It is important to note that other or different versions of drivers may affect the result, as well as different versions of the kernel or application use case pattern.

Appendix B: Benchmark Details

The combination of cyclictest and stress has been used for the benchmarking presented in this document. Details are given in the following tables:

Table B.1 Parameters to cyclictest used in this document

ParameterDescription
-SAlias for '-t -a -n', i.e. one thread bound to each core. Use clock_nanosleep insterad of POSIX timers
-mLock current and future memory allocations in memory
-p99Set worker thread priority o 99
-l 100000Set number of cycles
-qQuiet run. Print only a summary on exit

Table B.2 Stress scenarios

IdCommandDescription
cpu./stress -c <n><n> worker threads spinning on sqrt()
hdd./stress -d <n> --hdd-bytes 20M<n> worker threads s spinning on write()/unlink(). Each worker writes 20MB.
io./stress -i <n><n> worker threads spinning on sync()
vm./stress -m <n> --vm-bytes 10M<n> worker threads spinning on malloc()/free(). Buffer size: 15MB
full./stress -c <n> -i <n> -m <n> --vm-bytes 15M<n> worker threads each doing the cpu, io & vm stress. Buffer sizes are malloc: 256MB, write: 15MB.


The number of worker threads (<n>) is set to the number of cores for the target.

Example

$ ./stress -c 4 &
$ ./cyclictest -S -m -p99 -l 300000 -q
...
$ killall stress

Index

A

aio, Application I/O
application, Full Dynamic Ticks
applications, Designing Real-Time Applications
asynchronous message passing, Multi-Threaded Applications

B

block device write back, Optimizing a Partitioned System
branch prediction, CPU

F

full dynamic ticks, Full Dynamic Ticks

I

instruction pipelines, CPU
interrupt, Soft IRQs and Tasklets
interrupts, Interrupts
isolation, CPU Isolation, CPU Isolation
isolcpus, Full Dynamic Ticks

J

jitter, CPU Isolation, CPU

O

off-chip memory, Shared Memory and Memory Congestion
out-of-order execution, CPU

U

user space, Kernel Preemption Model
user space runtime, User Space Runtime
user-space, System calls

V

virtual memory statistics, Optimizing a Partitioned System