Doing embedded system software development, we often see all kinds of alignment in the code, many times we know why it is, we know to do a good job of various alignments, but we don't understand why alignment, what are the consequences of misalignment, this article roughly summarizes the reasons for memory alignment.

CPU architecture and MMU requirements

At present, some RISC instruction sets do not support unaligned memory variable access operations, such as MIPS/PowerPC/some DSPs, etc., if unaligned memory access occurs, an unaligned exception will be generated.
The ARM instruction set supports unaligned memory access from ARMv6 (ARM11), and the CPU of the older ARM9 does not support unaligned access. Some of the iterations of features supported by the ARM instruction set are as follows:
Although the Cortex-AXX series CPUs of the modern ARMv7 ARMv8 instruction set all support unaligned memory access, considering the coordinated work of multiple heterogeneous CPUs in modern SOC chips as shown in the figure below, the ARM64 with the main CPU for running Linux/Android operating system can support unaligned memory access, but there are other co-CPUs in the SOC that do not know the architecture and version (may be MIPS, ARM7, Cortex-R/M series, and even 51 microcontroller cores).
These co-CPUs all share different address segments of physical memory with the main ARM64 main CPU, and have their own firmware programs running on the memory, so it is still necessary to pay attention to the problem of memory alignment when dividing the address space, especially considering that these co-CPUs may not support non-aligned access.

Why ARM embedded systems do memory alignment

Similarly, in ARM's MMU virtual address management, there are also requirements for memory address alignment, and the following figure shows the working principle of ARM's MMU and the index relationship diagram of multi-level page tables (Translation Tables).

MMU requirements for ARM architecture

arm 32位体系结构要求L1第一级页表基地址（The L1 Translation Table Base Addr）对齐到16KB的地址边界，L2第二级页表地址（The L2 Translation Table Add）对齐到1KB的地址边界。
The ARM 64-bit architecture requires the 21st-28th bit VA [28:21] of the virtual address to be aligned to the 64 KB granule, and the 16th to 20th bit VA [20:16] aligned to the 4 KB granule.

ARM 的Memory ordering特性中的不同Memory types对非对齐内存访问的支持的要求是不同的。下图是ARM Memory ordering特性中三种不同的Memory types访问规则

Only Normal Memory supports unaligned memory access
Strongly-ordered 和 Device Memory不支持非对齐内存访问

Impact on atomic operations

Although modern ARMv7 ARMv8 instruction set ARM CPUs support unaligned memory access, unaligned memory access does not guarantee the atomicity of the operation. The following diagram shows the memory layout of a variable when it is in memory alignment and non-alignment:

Memory-aligned variable access is staged using a single common CPU register, and the read and write operation of a memory-aligned variable is guaranteed to be a single atomic operation.
The memory access of unaligned variables is a non-atomic operation, and they usually need to access a variable in an unaligned memory need to access the memory twice, so the atomicity cannot be guaranteed.

ARM NEON

Modern ARM CPUs generally have a NEON coprocessor, which is generally used in floating-point calculations for SIMD parallel vector acceleration computing. The following diagram is the basic schematic diagram of NEON SIMD parallel vector computing:

NEON natively supports unaligned memory access
However, NEON accesses unaligned memory and generally has 2 instruction cycles
In general, in order to flexibly apply the parallel computing characteristics of NEON, we need to align the corresponding variables according to the number of bits of the lane of the NEON register when performing SIMD parallel vector acceleration operations. If it is configured as an 8-bit calculation, it will be aligned with 8-bits, if it is a 16-bit calculation, it will be aligned with 16-bits, and so on.

Impact on performance perf

In general, ARM accessing unaligned memory addresses can cause significant performance degradation, even though modern ARM CPUs already support access to unaligned memory addresses. Because accessing an unaligned memory requires increasing the number of load/store memory variables multiple times, which in turn increases the instruction cycle for the program to run
There is an alignment-faults event in the perf tool, which can observe the event statistics of the program accessing the unaligned memory

cache line 对齐

In addition to the usual memory alignment based on the number of address bits the CPU accesses the memory, when optimizing the program, you should also take into account the existence of the cache and align your access variables according to the length of the cache line.

The schematic diagram of the structure of cache and cache line is as follows (Figure 2 is quoted from: Cenalulu from this article), cache line is the smallest unit of data transmission between cache and memory, and generally cache reads and writes the mapped address in memory at one time with the length of cache line.
In the ARM series of CPUs, the cache line length of different models of ARM CPUs is different, so if the CPU is also based on the ARM platform, when porting the optimized program from platform A to platform B, you must pay attention to whether the cache line size of different CPUs is the same, and whether you need to readjust the cache line alignment optimization.
The following figure is the cache line data manual of several public versions of ARMv7 CPUs, the current cache line size of ARMv8 64-bit public CPUs (A53, A57, A72, A73) is 64 bytes, but the cache line size of the customized version of the CPU based on the public version of ARM may be different from each company, be sure to refer to the relevant TRM manual for adjustment, alignment, and optimization.
The following figure is an example of memory read and write performance jitter without cache line alignment, quoted from cenalulu.The test code is as follows: The general idea of the program is to perform 100 million read and write operations on arrays of different sizes, and count the read and write times of different array sizes.
From the test results, it can be seen that when the array size is less than the cache line size, the read and write time basically does not change much, and when the array size just exceeds the cache line size, the read and write time jitters violently.
This is because array elements that exceed the size of the cache line may not be pre-read to the cache line in advance, and after accessing the array elements in the cache line, the data must be read from memory again and the cache line will be refreshed, resulting in performance jitter. This example shows us that making full use of the system caching feature, aligning your data according to the cache line, and ensuring that the local data accessed by the program is in a cache line can improve the performance of the system.

If there are no variables aligned to the same cache line, the cross cache line operation is a non-atomic operation in a multi-core SMP system, and there is a risk of tampering. The example is quoted from kongfy) The test code is as follows, the program is to the effect that the cache line of the system CPU is 64 bytes, a 68-byte struct data, which is preceded by a 60-byte pad[15] array, and the last 8-byte variable v, so that the structure size exceeds 64 bytes, and the front and back parts of the last variable v can not be aligned in the same cache line, and the whole struct cannot be aligned according to the cache line.
The initial value of the global variable value.v is 0, the program turns on multi-threading, and performs multiple ~ bit inversion operations on the global variable value.v, intuitively the bit result of the final result value.v is either all 0 or all 1, but the final bit result of value.v is actually half 1 and half 0, This is because the cross-cache line operation is non-atomic, As a result, when one thread negates the first half of value.v, another thread negates the second half of the second half of the cache line at the same time, and then the previous thread negates the second half of the value.v of another cache line, resulting in intuition inconsistency.

There are really a lot of things to learn about the embedded Internet of Things, don't learn the wrong route and content, resulting in a salary that doesn't go up!

Share a data package with everyone for free, almost more than 150G. The learning content, faces, and projects are relatively new and complete! It is estimated that it will be at least dozens better to buy a certain fish.

The following is a screenshot, scan the code to enter the group to receive it for free: Artificial Intelligence Exchange Group 7

I will regularly share embedded development, employment and related information with my friends in the group.

Finally, I wish you all progress every day!!

Why ARM embedded systems do memory alignment

CPU architecture and MMU requirements

Impact on atomic operations

ARM NEON

Impact on performance perf

cache line 对齐

Read on

iOS18's Android flavor is a little stronger, [covering his face] The AI trick that has been held back for so long, only the iPhone15Pro series of A17Pro can be used for half a day,

高性能版本的零内存分配LikeString函数(ZeroMemAllocLikeOperator)

The rare 45W standard pressure U at the 3K price point! The 618 cost-effective laptop is here, and there is also 16+1T memory

Apple is purely stumbling itself and the factor that limits the use of AI by the iPhone may be memory

[I want to be quiet] [I want to be quiet] [I want to be quiet] RTX50The video memory of the independent graphics card is still 128bit8GB, and it can only be said that the N-card video memory is made of gold, Apple memory

The next-generation NVIDIA top-of-the-line model, the RTX 5090, will feature a wider memory bit width and more video memory

Jintaike Super Tiger DDR5 White Moonlight Memory Picture Reward: Pure White Exquisite Overclocking Weapon!

Satellite communication + Snapdragon 8Gen3 + 24G large memory This new mobile phone makes the satellite phone a cabbage price...

HUAWEI CLOUD launches the elastic memory storage service (EMS) to break the AI memory wall

Running out of PC and want to upgrade your memory?

DDR5 memory demand explodes! Memory chips are scarce track, and the leading and strong are always strong

Is it expensive to cut leeks? Why is Apple's memory more expensive than gold?

一加Ace3 Pro公布核心参数:骁龙8芯片+24GB内存!

How exaggerated was the United States back then? With a computer with 4KB of memory, the miracle of the moon landing was realized

The most expensive thing about having a pet is actually memory