天天看點

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

目錄   題記  一   記憶體讀取粒度   Memory access granularity   從記憶體的角度解釋記憶體對齊的原理   隊列原理 Alignment fundamentals   Lazy processors  二 速度 Speed (記憶體對齊的基本原理)   代碼解釋   中文代碼及其記憶體解釋  三 不懂記憶體對齊将造成的可能影響如下  四 記憶體對齊規劃   記憶體對齊的原因   對齊規則   試驗  五 作者

題記 下面的文章中是我對四個部落格文章的合成,非原創,解釋了記憶體對齊的原因,作用(中英文說明),及其規劃!尤其适用于對sizeof結構體。 首先解釋了記憶體對齊的原理,然後對作用進行了說明,最後是例子!其中中文對記憶體對齊,原作者做了詳細的說明及其例子解釋,需要注意的是,如 struct { char a; int b; char b }A; a在配置設定時候占用其中一個位元組,剩下3個,但是b配置設定的是4位元組,明顯3個位元組無法滿足,那麼就需要另外寫入隊列 人覺得第二個中文作者(按我最後說明部落格位址順序)提到的最重要的是畫圖是一個很不錯的方法.  我在引用的第四個部落格中,也就是最後的部落格中,通過詳細的代碼解釋,說明了記憶體對齊的規劃問題!  記憶體對齊在系統或驅動級别以至于高真實時,高保密的程式開發的時候, 程式記憶體配置設定問題仍舊是保證整個程式穩定,安全,高效的基礎。是以對記憶體對齊需要學會掌握!至少在CSDN能說的來頭!

  文章可能很雜,如果看不懂,可以直接浏覽中文部分,雖然我的部落格很少人來看。                                                                 --QQ124045670

 一      記憶體讀取粒度          Memory access granularity   從記憶體的角度解釋記憶體對齊的原理

Programmers are conditioned to think of memory as a simple array of bytes. Among C and its descendants, 

char*

 is ubiquitous as meaning "a block of memory", and even Java? has its 

byte[]

 type to represent raw memory.

Figure 1. How programmers see memory

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

However, your computer's processor does not read from and write to memory in byte-sized chunks. Instead, it accesses memory in two-, four-, eight- 16- or even 32-byte chunks. We'll call the size in which a processor accesses memory its memory access granularity.

Figure 2. How processors see memory

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

The difference between how high-level programmers think of memory and how modern processors actually work with memory raises interesting issues that this article explores.

If you don't understand and address alignment issues in your software, the following scenarios, in increasing order of severity, are all possible:

  • Your software will run slower.
  • Your application will lock up.
  • Your operating system will crash.
  • Your software will silently fail, yielding incorrect results.

隊列原理 Alignment fundamentals

To illustrate the principles behind alignment, examine a constant task, and how it's affected by a processor's memory access granularity. The task is simple: first read four bytes from address 0 into the processor's register. Then read four bytes from address 1 into the same register.

First examine what would happen on a processor with a one-byte memory access granularity:

Figure 3. Single-byte memory access granularity

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

This fits in with the naive programmer's model of how memory works: it takes the same four memory accesses to read from address 0 as it does from address 1. Now see what would happen on a processor with two-byte granularity, like the original 68000:

Figure 4. Double-byte memory access granularity

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

When reading from address 0, a processor with two-byte granularity takes half the number of memory accesses as a processor with one-byte granularity. Because each memory access entails a fixed amount overhead, minimizing the number of accesses can really help performance.

However, notice what happens when reading from address 1. Because the address doesn't fall evenly on the processor's memory access boundary, the processor has extra work to do. Such an address is known as an unaligned address. Because address 1 is unaligned, a processor with two-byte granularity must perform an extra memory access, slowing down the operation.

Finally, examine what would happen on a processor with four-byte memory access granularity, like the 68030 or PowerPC? 601:

Figure 5. Quad-byte memory access granularity

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

A processor with four-byte granularity can slurp up four bytes from an aligned address with one read. Also note that reading from an unaligned address doubles the access count.

Now that you understand the fundamentals behind aligned data access, you can explore some of the issues related to alignment.

Lazy processors

A processor has to perform some tricks when instructed to access an unaligned address. Going back to the example of reading four bytes from address 1 on a processor with four-byte granularity, you can work out exactly what needs to be done:

Figure 6. How processors handle unaligned memory access

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

The processor needs to read the first chunk of the unaligned address and shift out the "unwanted" bytes from the first chunk. Then it needs to read the second chunk of the unaligned address and shift out some of its information. Finally, the two are merged together for placement in the register. It's a lot of work.

Some processors just aren't willing to do all of that work for you.

The original 68000 was a processor with two-byte granularity and lacked the circuitry to cope with unaligned addresses. When presented with such an address, the processor would throw an exception. The original Mac OS didn't take very kindly to this exception, and would usually demand the user restart the machine. Ouch.

Later processors in the 680x0 series, such as the 68020, lifted this restriction and performed the necessary work for you. This explains why some old software that works on the 68020 crashes on the 68000. It also explains why, way back when, some old Mac coders initialized pointers with odd addresses. On the original Mac, if the pointer was accessed without being reassigned to a valid address, the Mac would immediately drop into the debugger. Often they could then examine the calling chain stack and figure out where the mistake was.

All processors have a finite number of transistors to get work done. Adding unaligned address access support cuts into this "transistor budget." These transistors could otherwise be used to make other portions of the processor work faster, or add new functionality altogether.

An example of a processor that sacrifices unaligned address access support in the name of speed is MIPS. MIPS is a great example of a processor that does away with almost all frivolity in the name of getting real work done faster.

The PowerPC takes a hybrid approach. Every PowerPC processor to date has hardware support for unaligned 32-bit integer access. While you still pay a performance penalty for unaligned access, it tends to be small.

On the other hand, modern PowerPC processors lack hardware support for unaligned 64-bit floating-point access. When asked to load an unaligned floating-point number from memory, modern PowerPC processors will throw an exception and have the operating system perform the alignment chores in software. Performing alignment in software is much slower than performing it in hardware.

二 速度 Speed (記憶體對齊的基本原理)

記憶體對齊有一個好處是提高通路記憶體的速度,因為在許多資料結構中都需要占用記憶體,在很多系統中,要求記憶體配置設定的時候要對齊.下面是對為什麼可以提高記憶體速度通過代碼做了解釋!

代碼解釋  

Writing some tests illustrates the performance penalties of unaligned memory access. The test is simple: you read, negate, and write back the numbers in a ten-megabyte buffer. These tests have two variables:

  1. The size, in bytes, in which you process the buffer. First you'll process the buffer one byte at a time. Then you'll move onto two-, four- and eight-bytes at a time.
  2. The alignment of the buffer. You'll stagger the alignment of the buffer by incrementing the pointer to the buffer and running each test again.

These tests were performed on a 800 MHz PowerBook G4. To help normalize performance fluctuations from interrupt processing, each test was run ten times, keeping the average of the runs. First up is the test that operates on a single byte at a time:

Listing 1. Munging data one byte at a time

void Munge8( void *data, uint32_t size ) {
    uint8_t *data8 = (uint8_t*) data;
    uint8_t *data8End = data8 + size;
    
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}
      

It took an average of 67,364 microseconds to execute this function. Now modify it to work on two bytes at a time instead of one byte at a time -- which will halve the number of memory accesses:

Listing 2. Munging data two bytes at a time

void Munge16( void *data, uint32_t size ) {
    uint16_t *data16 = (uint16_t*) data;
    uint16_t *data16End = data16 + (size >> 1); /* Divide size by 2. */
    uint8_t *data8 = (uint8_t*) data16End;
    uint8_t *data8End = data8 + (size & 0x00000001); /* Strip upper 31 bits. */
    
    while( data16 != data16End ) {
        *data16++ = -*data16;
    }
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}      

This function took 48,765 microseconds to process the same ten-megabyte buffer -- 38% faster than Munge8. However, that buffer was aligned. If the buffer is unaligned, the time required increases to 66,385 microseconds -- about a 27% speed penalty. The following chart illustrates the performance pattern of aligned memory accesses versus unaligned accesses:

Figure 7. Single-byte access versus double-byte access

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

The first thing you notice is that accessing memory one byte at a time is uniformly slow. The second item of interest is that when accessing memory two bytes at a time, whenever the address is not evenly divisible by two, that 27% speed penalty rears its ugly head.

Now up the ante, and process the buffer four bytes at a time:

Listing 3. Munging data four bytes at a time

void Munge16( void *data, uint32_t size ) {
    uint16_t *data16 = (uint16_t*) data;
    uint16_t *data16End = data16 + (size >> 1); /* Divide size by 2. */
    uint8_t *data8 = (uint8_t*) data16End;
    uint8_t *data8End = data8 + (size & 0x00000001); /* Strip upper 31 bits. */
    
    while( data16 != data16End ) {
        *data16++ = -*data16;
    }
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}      

This function processes an aligned buffer in 43,043 microseconds and an unaligned buffer in 55,775 microseconds, respectively. Thus, on this test machine, accessing unaligned memory four bytes at a time is slower than accessing aligned memory two bytes at a time:

Figure 8. Single- versus double- versus quad-byte access

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

Now for the horror story: processing the buffer eight bytes at a time.

Listing 4. Munging data eight bytes at a time

void Munge32( void *data, uint32_t size ) {
    uint32_t *data32 = (uint32_t*) data;
    uint32_t *data32End = data32 + (size >> 2); /* Divide size by 4. */
    uint8_t *data8 = (uint8_t*) data32End;
    uint8_t *data8End = data8 + (size & 0x00000003); /* Strip upper 30 bits. */
    
    while( data32 != data32End ) {
        *data32++ = -*data32;
    }
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}
      

Munge64

 processes an aligned buffer in 39,085 microseconds -- about 10% faster than processing the buffer four bytes at a time. However, processing an unaligned buffer takes an amazing 1,841,155 microseconds -- two orders of magnitude slower than aligned access, an outstanding 4,610% performance penalty!

What happened? Because modern PowerPC processors lack hardware support for unaligned floating-point access, the processor throws an exception for each unaligned access. The operating system catches this exception and performs the alignment in software. Here's a chart illustrating the penalty, and when it occurs:

Figure 9. Multiple-byte access comparison

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

The penalties for one-, two- and four-byte unaligned access are dwarfed by the horrendous unaligned eight-byte penalty. Maybe this chart, removing the top (and thus the tremendous gulf between the two numbers), will be clearer:

Figure 10. Multiple-byte access comparison #2

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

There's another subtle insight hidden in this data. Compare eight-byte access speeds on four-byte boundaries:

Figure 11. Multiple-byte access comparison #3

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

Notice accessing memory eight bytes at a time on four- and twelve- byte boundaries is slower than reading the same memory four or even two bytes at a time. While PowerPCs have hardware support for four-byte aligned eight-byte doubles, you still pay a performance penalty if you use that support. Granted, it's no where near the 4,610% penalty, but it's certainly noticeable. Moral of the story: accessing memory in large chunks can be slower than accessing memory in small chunks, if that access is not aligned.

Atomicity

All modern processors offer atomic instructions. These special instructions are crucial for synchronizing two or more concurrent tasks. As the name implies, atomic instructions must be indivisible -- that's why they're so handy for synchronization: they can't be preempted.

It turns out that in order for atomic instructions to perform correctly, the addresses you pass them must be at least four-byte aligned. This is because of a subtle interaction between atomic instructions and virtual memory.

If an address is unaligned, it requires at least two memory accesses. But what happens if the desired data spans two pages of virtual memory? This could lead to a situation where the first page is resident while the last page is not. Upon access, in the middle of the instruction, a page fault would be generated, executing the virtual memory management swap-in code, destroying the atomicity of the instruction. To keep things simple and correct, both the 68K and PowerPC require that atomically manipulated addresses always be at least four-byte aligned.

Unfortunately, the PowerPC does not throw an exception when atomically storing to an unaligned address. Instead, the store simply always fails. This is bad because most atomic functions are written to retry upon a failed store, under the assumption they were preempted. These two circumstances combine to where your program will go into an infinite loop if you attempt to atomically store to an unaligned address. Oops.

Altivec

Altivec is all about speed. Unaligned memory access slows down the processor and costs precious transistors. Thus, the Altivec engineers took a page from the MIPS playbook and simply don't support unaligned memory access. Because Altivec works with sixteen-byte chunks at a time, all addresses passed to Altivec must be sixteen-byte aligned. What's scary is what happens if your address is not aligned.

Altivec won't throw an exception to warn you about the unaligned address. Instead, Altivec simply ignores the lower four bits of the address and charges ahead, operating on the wrong address. This means your program may silently corrupt memory or return incorrect results if you don't explicitly make sure all your data is aligned.

There is an advantage to Altivec's bit-stripping ways. Because you don't need to explicitly truncate (align-down) an address, this behavior can save you an instruction or two when handing addresses to the processor.

This is not to say Altivec can't process unaligned memory. You can find detailed instructions how to do so on the Altivec Programming Environments Manual (see Resources). It requires more work, but because memory is so slow compared to the processor, the overhead for such shenanigans is surprisingly low.

Structure alignment

Examine the following structure:

Listing 5. An innocent structure

void Munge64( void *data, uint32_t size ) {

typedef struct {

    char    a;

    long    b;

    char    c;

}   Struct;

What is the size of this structure in bytes? Many programmers will answer "6 bytes." It makes sense: one byte for 

a

, four bytes for

b

 and another byte for 

c

. 1 + 4 + 1 equals 6. Here's how it would lay out in memory:

Field Type Field Name Field Offset Field Size Field End

char

a

1 1

long

b

1 4 5

char

c

5 1 6
Total Size in Bytes: 6

However, if you were to ask your compiler to 

sizeof( Struct )

, chances are the answer you'd get back would be greater than six, perhaps eight or even twenty-four. There's two reasons for this: backwards compatibility and efficiency.

First, backwards compatibility. Remember the 68000 was a processor with two-byte memory access granularity, and would throw an exception upon encountering an odd address. If you were to read from or write to field 

b

, you'd attempt to access an odd address. If a debugger weren't installed, the old Mac OS would throw up a System Error dialog box with one button: Restart. Yikes!

So, instead of laying out your fields just the way you wrote them, the compiler padded the structure so that 

b

 and 

c

 would reside at even addresses:

Field Type Field Name Field Offset Field Size Field End

char

a

1 1
padding 1 1 2

long

b

2 4 6

char

c

6 1 7
padding 7 1 8
Total Size in Bytes: 8

Padding is the act of adding otherwise unused space to a structure to make fields line up in a desired way. Now, when the 68020 came out with built-in hardware support for unaligned memory access, this padding was unnecessary. However, it didn't hurt anything, and it even helped a little in performance.

The second reason is efficiency. Nowadays, on PowerPC machines, two-byte alignment is nice, but four-byte or eight-byte is better. You probably don't care anymore that the original 68000 choked on unaligned structures, but you probably care about potential 4,610% performance penalties, which can happen if a 

double

 field doesn't sit aligned in a structure of your devising.

中文代碼及其記憶體解釋

記憶體對齊關鍵是需要畫圖!在下面的中文有說明例子      

Examine the following structure:

如果英文看不懂,那麼可以直接用中文例如(http://www.cppblog.com/snailcong/archive/2009/03/16/76705.html)來說!      

首先由一個程式引入話題:

 1 //環境:vc6 + windows sp2  2 //程式1  3 #include <iostream>  4   5 using namespace std;  6   7 struct st1   8 {  9     char a ; 10     int  b ; 11     short c ; 12 }; 13  14 struct st2 15 { 16     short c ; 17     char  a ; 18     int   b ; 19 }; 20  21 int main() 22 { 23     cout<<"sizeof(st1) is "<<sizeof(st1)<<endl; 24     cout<<"sizeof(st2) is "<<sizeof(st2)<<endl; 25     return 0 ; 26 } 27 

程式的輸出結果為:

 sizeof(st1) is 12         sizeof(st2) is 8 

問題出來了,這兩個一樣的結構體,為什麼sizeof的時候大小不一樣呢?

本文的主要目的就是解釋明白這一問題。

記憶體對齊,正是因為記憶體對齊的影響,導緻結果不同。

對于大多數的程式員來說,記憶體對齊基本上是透明的,這是編譯器該幹的活,編譯器為程式中的每個資料單元安排在合适的位置上,進而導緻了相同的變量,不同聲明順序的結構體大小的不同。

       那麼編譯器為什麼要進行記憶體對齊呢?程式1中結構體按常理來了解sizeof(st1)和sizeof(st2)結果都應該是7,4(int) + 2(short) + 1(char) = 7 。經過記憶體對齊後,結構體的空間反而增大了。

在解釋記憶體對齊的作用前,先來看下記憶體對齊的規則:

1、  對于結構的各個成員,第一個成員位于偏移為0的位置,以後每個資料成員的偏移量必須是min(#pragma pack()指定的數,這個資料成員的自身長度) 的倍數。

2、  在資料成員完成各自對齊之後,結構(或聯合)本身也要進行對齊,對齊将按照#pragma pack指定的數值和結構(或聯合)最大資料成員長度中,比較小的那個進行。

#pragma pack(n) 表示設定為n位元組對齊。 VC6預設8位元組對齊

以程式1為例解釋對齊的規則 :

St1 :char占一個位元組,起始偏移為0 ,int 占4個位元組,min(#pragma pack()指定的數,這個資料成員的自身長度) = 4(VC6預設8位元組對齊),是以int按4位元組對齊,起始偏移必須為4的倍數,是以起始偏移為4,在char後編譯器會添加3個位元組的額外位元組,不存放任意資料。short占2個位元組,按2位元組對齊,起始偏移為8,正好是2的倍數,無須添加額外位元組。到此規則1的資料成員對齊結束,此時的記憶體狀态為:

oxxx|oooo|oo

0123 4567 89 (位址)

(x表示額外添加的位元組)

共占10個位元組。還要繼續進行結構本身的對齊,對齊将按照#pragma pack指定的數值和結構(或聯合)最大資料成員長度中,比較小的那個進行,st1結構中最大資料成員長度為int,占4位元組,而預設的#pragma pack 指定的值為8,是以結果本身按照4位元組對齊,結構總大小必須為4的倍數,需添加2個額外位元組使結構的總大小為12 。此時的記憶體狀态為:

oxxx|oooo|ooxx

0123 4567 89ab  (位址)

到此記憶體對齊結束。St1占用了12個位元組而非7個位元組。

St2 的對齊方法和st1相同,讀者可自己完成。

下面再看一個例子  http://www.cppblog.com/cc/archive/2006/08/01/10765.html

記憶體對齊

         在我們的程式中,資料結構還有變量等等都需要占有記憶體,在很多系統中,它都要求記憶體配置設定的時候要對齊,這樣做的好處就是可以提高通路記憶體的速度。

 我們還是先來看一段簡單的程式:

                                程式一  1 #include <iostream>  2 using namespace std;  3   4 struct X1  5 {  6   int i;//4個位元組  7   char c1;//1個位元組  8   char c2;//1個位元組  9 }; 10  11 struct X2 12 { 13   char c1;//1個位元組 14   int i;//4個位元組 15   char c2;//1個位元組 16 }; 17  18 struct X3 19 { 20   char c1;//1個位元組 21   char c2;//1個位元組 22   int i;//4個位元組 23 }; 24 int main() 25 {    26     cout<<"long "<<sizeof(long)<<"\n"; 27     cout<<"float "<<sizeof(float)<<"\n"; 28     cout<<"int "<<sizeof(int)<<"\n"; 29     cout<<"char "<<sizeof(char)<<"\n"; 30  31     X1 x1; 32     X2 x2; 33     X3 x3; 34     cout<<"x1 的大小 "<<sizeof(x1)<<"\n"; 35     cout<<"x2 的大小 "<<sizeof(x2)<<"\n"; 36     cout<<"x3 的大小 "<<sizeof(x3)<<"\n"; 37     return 0; 38 }

             這段程式的功能很簡單,就是定義了三個結構X1,X2,X3,這三個結構的主要差別就是記憶體資料擺放的順序,其他都是一樣的,另外程式輸入了幾種基本類型所占用的位元組數,以及我們這裡的三個結構所占用的位元組數。

這段程式的運作結果為: 1 long 4 2 float 4 3 int 4 4 char 1 5 x1 的大小 8 6 x2 的大小 12 7 x3 的大小 8

     結果的前面四行沒有什麼問題,但是我們在最後三行就可以看到三個結構占用的空間大小不一樣,造成這個原因就是内部資料的擺放順序,怎麼會這樣呢?

    下面就是我們需要講的記憶體對齊了。

    記憶體是一個連續的塊,我們可以用下面的圖來表示,  它是以4個位元組對一個對齊機關的:

                                                    圖一

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

   讓我們看看三個結構在記憶體中的布局:

   首先是 X1,如下圖所示

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

    X1 中第一個是 Int類型,它占有4位元組,是以前面4格就是滿了,然後第二個是char類型,這中類型隻占一個位元組,是以它占有了第二個4位元組組塊中的第一格,第三個也是char類型,是以它也占用一個位元組,它就排在了第二個組塊的第二格,因為它們加在一起大小也不超過一個塊,是以他們三個變量在記憶體中的結構就是這樣的,因為有記憶體分塊對齊,是以最後出來的結果是8,而不是6,因為後面兩個格子其實也算是被用了。

    再次看看X2,如圖所示

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

    X2中第一個類型是Char類型,它占用一個位元組,是以它首先排在第一組塊的第一個格子裡面,第二個是Int類型,它占用4個位元組,第一組塊已經用掉一格,還剩3格,肯定是無法放下第二Int類型的,因為要考慮到對齊,是以不得不把它放到第二個組塊,第三個類型是Char類型,跟第一個類似。所因為有記憶體分塊對齊,我們的記憶體就不是8個格子了,而是12個了。

再看看X3,如下圖所示:

記憶體對齊的原理,作用,例子以及一些規劃(中英文說明,适用sizeof結構體)

   關于X3的說明其實跟X1是類似的,隻不過它把兩個1個位元組的放到了前面,相信看了前面兩種情況的說明這裡也是很容易了解的。

What is the size of this structure in bytes? Many programmers will answer "6 bytes." It makes sense: one byte for 

a

, four bytes for

b

 and another byte for 

c

. 1 + 4 + 1 equals 6. Here's how it would lay out in memory:

Field Type Field Name Field Offset Field Size Field End

char

a

1 1

long

b

1 4 5

char

c

5 1 6
Total Size in Bytes: 6

However, if you were to ask your compiler to 

sizeof( Struct )

, chances are the answer you'd get back would be greater than six, perhaps eight or even twenty-four. There's two reasons for this: backwards compatibility and efficiency.

First, backwards compatibility. Remember the 68000 was a processor with two-byte memory access granularity, and would throw an exception upon encountering an odd address. If you were to read from or write to field 

b

, you'd attempt to access an odd address. If a debugger weren't installed, the old Mac OS would throw up a System Error dialog box with one button: Restart. Yikes!

So, instead of laying out your fields just the way you wrote them, the compiler padded the structure so that 

b

 and 

c

 would reside at even addresses:

Field Type Field Name Field Offset Field Size Field End

char

a

1 1
padding 1 1 2

long

b

2 4 6

char

c

6 1 7
padding 7 1 8
Total Size in Bytes: 8

Padding is the act of adding otherwise unused space to a structure to make fields line up in a desired way. Now, when the 68020 came out with built-in hardware support for unaligned memory access, this padding was unnecessary. However, it didn't hurt anything, and it even helped a little in performance.

The second reason is efficiency. Nowadays, on PowerPC machines, two-byte alignment is nice, but four-byte or eight-byte is better. You probably don't care anymore that the original 68000 choked on unaligned structures, but you probably care about potential 4,610% performance penalties, which can happen if a 

double

 field doesn't sit aligned in a structure of your devising.

很多人都知道是記憶體對齊所造成的原因,卻鮮有人告訴你記憶體對齊的基本原理!上面作者就做了解釋!      

三  不懂記憶體對齊将造成的可能影響如下

  • Your software may hit performance-killing unaligned memory access exceptions, which invoke very expensive alignment exception handlers.
  • Your application may attempt to atomically store to an unaligned address, causing your application to lock up.
  • Your application may attempt to pass an unaligned address to Altivec, resulting in Altivec reading from and/or writing to the wrong part of memory, silently corrupting data or yielding incorrect results.

四            記憶體對齊規劃

  • 一、記憶體對齊的原因
  • 大部分的參考資料都是如是說的:

    1、平台原因(移植原因):不是所有的硬體平台都能通路任意位址上的任意資料的;某些硬體平台隻能在某些位址處取某些特定類型的資料,否則抛出硬體異常。

    2、性能原因:資料結構(尤其是棧)應該盡可能地在自然邊界上對齊。原因在于,為了通路未對齊的記憶體,處理器需要作兩次記憶體通路;而對齊的記憶體通路僅需要一次通路。

    二、對齊規則

    每個特定平台上的編譯器都有自己的預設“對齊系數”(也叫對齊模數)。程式員可以通過預編譯指令#pragma pack(n),n=1,2,4,8,16來改變這一系數,其中的n就是你要指定的“對齊系數”。

    規則:

    1、資料成員對齊規則:結構(struct)(或聯合(union))的資料成員,第一個資料成員放在offset為0的地方,以後每個資料成員的對齊按照#pragma pack指定的數值和這個資料成員

    自身長度中,比較小的那個進行。

    2、結構(或聯合)的整體對齊規則:在資料成員完成各自對齊之後,結構(或聯合)本身也要進行對齊,對齊将按照#pragma pack指定的數值和結構(或聯合)最大資料成員長度中,比較小的那個進行。

    3、結合1、2可推斷:當#pragma pack的n值等于或超過所有資料成員長度的時候,這個n值的大小将不産生任何效果。

    三、試驗

    下面我們通過一系列例子的詳細說明來證明這個規則

    編譯器:GCC 3.4.2、VC6.0

    平台:Windows XP

    典型的struct對齊

    struct定義:

    #pragma pack(n)

    struct test_t {

     int a;

     char b;

     short c;

     char d;

    };

    #pragma pack(n)

    首先确認在試驗平台上的各個類型的size,經驗證兩個編譯器的輸出均為:

    sizeof(char) = 1

    sizeof(short) = 2

    sizeof(int) = 4

    試驗過程如下:通過#pragma pack(n)改變“對齊系數”,然後察看sizeof(struct test_t)的值。

    1、1位元組對齊(#pragma pack(1))

    輸出結果:sizeof(struct test_t) = 8 [兩個編譯器輸出一緻]

    分析過程:

    1) 成員資料對齊

    #pragma pack(1)

    struct test_t {

     int a; 

     char b; 

     short c;

     char d; 

    };

    #pragma pack()

    成員總大小=8

    2) 整體對齊

    整體對齊系數 = min((max(int,short,char), 1) = 1

    整體大小(size)=$(成員總大小) 按 $(整體對齊系數) 圓整 = 8 [注1]

    2、2位元組對齊(#pragma pack(2))

    輸出結果:sizeof(struct test_t) = 10 [兩個編譯器輸出一緻]

    分析過程:

    1) 成員資料對齊

    #pragma pack(2)

    struct test_t {

     int a; 

     char b; 

     short c;

     char d; 

    };

    #pragma pack()

    成員總大小=9

    2) 整體對齊

    整體對齊系數 = min((max(int,short,char), 2) = 2

    整體大小(size)=$(成員總大小) 按 $(整體對齊系數) 圓整 = 10

    3、4位元組對齊(#pragma pack(4))

    輸出結果:sizeof(struct test_t) = 12 [兩個編譯器輸出一緻]

    分析過程:

    1) 成員資料對齊

    #pragma pack(4)

    struct test_t {

     int a; 

     char b; 

     short c;

     char d; 

    };

    #pragma pack()

    成員總大小=9

    2) 整體對齊

    整體對齊系數 = min((max(int,short,char), 4) = 4

    整體大小(size)=$(成員總大小) 按 $(整體對齊系數) 圓整 = 12

    4、8位元組對齊(#pragma pack(8))

    輸出結果:sizeof(struct test_t) = 12 [兩個編譯器輸出一緻]

    分析過程:

    1) 成員資料對齊

    #pragma pack(8)

    struct test_t {

     int a; 

     char b; 

     short c;

     char d; 

    };

    #pragma pack()

    成員總大小=9

    2) 整體對齊

    整體對齊系數 = min((max(int,short,char), 8) = 4

    整體大小(size)=$(成員總大小) 按 $(整體對齊系數) 圓整 = 12

    5、16位元組對齊(#pragma pack(16))

    輸出結果:sizeof(struct test_t) = 12 [兩個編譯器輸出一緻]

    分析過程:

    1) 成員資料對齊

    #pragma pack(16)

    struct test_t {

     int a; 

     char b; 

     short c;

     char d; 

    };

    #pragma pack()

    成員總大小=9

    2) 整體對齊

    整體對齊系數 = min((max(int,short,char), 16) = 4

    整體大小(size)=$(成員總大小) 按 $(整體對齊系數) 圓整 = 12

    8位元組和16位元組對齊試驗證明了“規則”的第3點:“當#pragma pack的n值等于或超過所有資料成員長度的時候,這個n值的大小将不産生任何效果”。

    記憶體配置設定與記憶體對齊是個很複雜的東西,不但與具體實作密切相關,而且在不同的作業系統,編譯器或硬體平台上規則也不盡相同,雖然目前大多數系統/語言都具有自動管理、配置設定并隐藏低層操作的功能,使得應用程式編寫大為簡單,程式員不在需要考慮詳細的記憶體配置設定問題。但是,在系統或驅動級以至于高實時,高保密性的程式開發過程中,程式記憶體配置設定問題仍舊是保證整個程式穩定,安全,高效的基礎。

  • [注1]

    什麼是“圓整”?

    舉例說明:如上面的8位元組對齊中的“整體對齊”,整體大小=9 按 4 圓整 = 12

    圓整的過程:從9開始每次加一,看是否能被4整除,這裡9,10,11均不能被4整除,到12時可以,則圓整結束。

五  作者

Jonathan Rentzsch http://www.ibm.com/developerworks/library/pa-dalign/ http://www.cppblog.com/cc/archive/2006/08/01/10765.html(中文優秀解釋) http://www.cppblog.com/snailcong/archive/2009/03/16/76705.html(對英文版的消化,可以檢視該部落格) http://blogold.chinaunix.net/u3/118340/showart_2615855.html