Problem: How to coordinate activities of thousands of cores

Research Problem on 1000-cores architecture and OS

Ying Li

This problem is really macro in scope and challenging to solve even with careful analysis and sufficient insights in

both the parallel architectures and system software. I try to analyze the problem in several aspects, but there are so

many stochastic parameters affecting the direction of the trend, so maybe some predications may fail in the future. As

the essay style writing becomes larger and larger, I have to divide it into a series of passages for potential readers to

make a plan to read it.

Outline

Passage 1 introduces the topic from the perspective of synchronization and communication methods;

Passage 2 introduces the hardware architecture revolution and the incoming problems for operating systems;

Passage 3 analyzes the Problem 1 philosophically and introduces the philosophical framework aims to address P1;

Passage 4 analyzes traditional time-sharing OS under the philosophical framework and introduces 2OCN principle;

Passage 5 analyzes two CC-convergent monolithic family OS instances from UC Berkeley: ROS and Tessellation;

Passage 6 analyzes a Non-CC-convergent microkernel family OS from ETH Zurich: Barrelfish; discusses coherence;

Passage 7 analyzes another Non-CC-convergent microkernel family OS from MIT: fos; discusses 1000-cores chip;

Passage 8 (to come) analyzes an experimental hardware platform from Intel: SCC - Single-Chip-Cloud;

Passage 9 (to come) analyzes another experimental hardware platform from MIT: ATAC

Passage A (to come) analyzes a hardware platform product from TILERA: TileraGX100

Passage B (to come) analyzes a parallel programming model: MPI; investigates message-passing in 1000-core chip;

Passage C (to come) analyzes a parallel programming model: OpenMP; investigates shared-memory in 1K-core chip;

Passage D (to come) analyzes a parallel programming mode: PVM; I will highly regard this model as potential target;

Passage E (to come) discuss what system software researchers can do in the architecture revolution in general;

Passage F (to come) will open a gate to a DYNAMIC series of fine grain problems to RESEARCH in details;

Passage G the END of this series of passages.

Passage 1

Classic concurrency and parallel programs involve two basic forms of processor/core synchronization: mutual

exclusion and producer-consumer. Both of them seem to be implemented conventionally using locks which spin to

wait the value stored in the shared memory through private caches. As we know, the DRAM-caches could reduce

accessing latency and mitigate bandwidth contention, but introduce coherence problem. The accessing path stated

above and the spin nature will inherently cause coherence traffic and waste power. So, traditional locks depend on

cache coherence. As the number of cores increases, coherence traffic will make the memory bandwidth contention

more severe. Other methods such as transactional memory and message passing could be used to solve mutual

exclusion and producer-consumer synchronization separately. But for traditional OS kernels which rely on shared

memory (shared address space) heavily, scaling to harness many cores with more complicated and structural

cache/memory hierarchy without major redevelopment will be difficult even impossible, although some researchers

in MIT found they could remove some kernel bottlenecks by modifying the kernel and applications slightly to support

up to 48-cores in shared address space.

Consider the two basic communication methods between cores/processors: shared-memory and message passing

provided by parallel hardware architectures including SMPs, ccNUMAs, and custom clusters. According to John L.

Hennessy and David A. Patterson, shared-memory based message passing implementations are classis but the

reverse forms are hard. But now hardware designers tend to abandon cache coherent shared address space support,

arguing that they hit the coherency wall in OCN-based many-core architectures. We could implement software

managed cache coherence protocols to support a global coherent shared address space just like what we did in DSM

architecture, but limitations of hardware support make this problem challenging. Some hardware researchers even

promote to give up coherence or avoid coherence-relied programming paradigms through message passing or other

forms of sharing but not strictly coherent. For traditional general purpose application programmers, the leak of

cache coherency shared memory makes their life hard (at least reduces their programming productivity) if they are

not ready to write parallel programs using MPI-like message passing library which mainly aims for high-performance

computing originally.

Passage 2

There are so many reasons for which traditional OSes cannot scale up 1000-core architecture without even with

coherent cache/memory accessing path which is a revolution in hardware architecture. This revolution may impact

the whole software stack up to the top layer applications which directly interact with end users such as graphical

financial analytics, 3D games and tools like IDE, even basic programming paradigms may switch from concurrency

(time-sharing, sometimes) to true parallel manners which are common in HPC and scientific computing.

System software plays a very important role in this architecture and programming paradigm revolution, because it is

the bridge between the low-level hardware and high-level applications. Meanwhile it is challenging due to the

bi-direction nature. Researchers need to harness the more complex inherently parallel hardware and provide simple

and easy-to-program interfaces and stable run-time environment to applications. The problem becomes severe

because neither of the two ends is mature nor clear. The two tasks naturally left to operating systems. I retype them

below:

P1: Harness the new parallel architecture to coordinate 1000 cores which is actually a MPP embedded in an OCN

P2: Provide efficient run-time environment and effective programming support.

Notes: MPP: Massive Parallel Processors; OCN: On-Chip Network; efficient: high performance/core; effective: easy to use;

Passage 3

A THOUSAND is relatively a large number in human minds, not in dollar but in management from psychological and

philosophical perspective. To harness such an amount of activities or behaviors, the method which I can think out is

divide-and-conquer. But dividing is a process from the most 1000 to the least 1, so what critical is when to stop and

how to conquer the resulting structural dividing tree. The philosophy is whether we need a central coordinator or

government to control the behavior of the whole tree. For traditional OSes which have a monolithic kernel, the

governing role is played by the scheduler which coordinating or scheduling all application processes by distributing

time slices to them according to some clever strategies which could balance the needs of different forms. This is a

scenario just like republic countries, the second form is granting freedoms to the fine grained individuals just like

united countries. In computing world, distributed systems based on network of different sizes share this nature, for

example, the Internet which is really a self-governed system. Research pioneers take advantage of this idea by

introducing the concept of microkernel into OS design space, which makes the services provided by traditional OS in

the form of system calls which based on context switch separated to standalone servers/daemons which are

inherently parallelism concepts. In the single-processor/multi-processors era, servers within an OS-scope box are

naturally different from servers in network, because they execute concurrently, which is not a true parallel form.

Maybe this is also the reason why microkernel OS did not prevail in industrial OS market. But the prime time for

microkernel seems coming due to the trend 1000-core architecture. I have to summarize the track of analysis here in

order talk further. The philosophy framework of OS design:

The philosophical method of coordinating behavior of 1000 cores is Divide-And-Conquer; the critical technique is

H2D: how to divide and H2C: how to conquer. How-to-conquer depends on how-to-divide, the critical decision is

whether a CC: central coordinator is considered. CC is the key tool of traditional monolithic OS. Non-CC/Loose-CC

is one of the key philosophies of microkernel OS.

Notes: concepts here are the personal view of the author. Non-CC for microkernel may not accurate, a deeper investigation is needed.

Passage 4

We can see the philosophy from some historical perspective. In the incoming many-core era, the most basic

computing resource – processors/cores will become not sparse, the ratio of processor to user (P/U) will be increased

greatly by the invocation of 1000-cores chip. But in the past single processor era, processor is unique, demands of

users or applications are many in account, and the problem of how to divide (H2D) seems hard: how to divide one

processor? Of cause we cannot divide one processor in amount, but we can divide the computing resource in time,

just because the processor actually is the program-counter which is a discrete sequence of numbers mathematically,

here the memory addresses in F architecture. Time-sharing opens the era of concurrent programming /

multiprogramming, a novel module called scheduler in UNIX plays the role of CC which also accomplish the hard task

of H2D in that time. In the current multi-processor era, UNIX-likes and other traditional OSes which based on

monolithic kernels scale to several cores easily, because the change from one to four or six is not severe or in essence.

In addition, the networking revolution did not scare the pattern, because interconnecting happens outside of chips,

which has no reason to destroy the working order of OS kernels. Actually what we did was combining OS boxes not

dividing them. What we have to notice is the resemblance between traditional networks (or say off-chip networks,

whose abbreviation form is also OCN) and on-chip networks (OCN). A phenomenon emerged, because the

resemblance gives so many motivations to OS researchers to utilize the results from off-chip network distrusted

systems into their on-chip network OS design. Actually I think that the two OCNs are the same problem in some level.

In order to emphasize this resemblance and the phenomenon, I will use the 2OCN resemblance or 2OCN principle or

just 2OCN in later discussions. Note the microkernel was also motivated by 2OCN.

Passage 5

Before we enter into the many-core era, OS researchers have already taken into the considerations of OS revolution.

As we can predicate from the philosophy I stated above, two kinds of views existed, although maybe not so closely

convergent to the base scenarios in the philosophy. But that is sufficient to help us to draw the big picture of the

current research progress and figure out the underlying philosophy of different design decisions.

In the CC-convergent genres excluding Linux and traditional OSes (including tries by modifying them slightly for the

purpose of scaling to 1000 cores), ROS (akaros) and Tessellation are representatives. Both of them come from UC

Berkeley, and Tessellation was implemented upon ROS firstly. Their ideas resemble with each other, but Tessellation

may be more sophisticated in design. I chose to talk Tessellation below because I think it covers the other. To answer

H2D, Tessellation introduced a novel concept called Cell which is their dividing result, their H2D methodology is STP

which means Space-Time Partitions. To answer H2C, they introduce a method called f Tow-Level Scheduling which

means scheduling outside and inside the Cells. At this time, I feel happy for abstracting their design motivations

under my philosophy framework successfully (This sentence will be removed in serious publications).

Passage 6

In the Non-CC/Loose-CC- convergent genres, Barrelfish and fos are representatives. Barrelfish comes from ETH

Zurich and fos comes from MIT. Both of them employ the idea of microkernel and naturally they try to utilize 2OCN

principle.

Barrelfish introduce the concept of multikernel to emphasize the possibly diverse (or say asymmetric, from my own

understanding) architecture including memory hierarchies interconnects and even instruction sets. I think that this

probability is really small, because the complexity of integrating those many asymmetric things on a small chip will

go out of control of our human beings. Hardware researchers tend to give up coherence support for the reason of

complexity of connecting too many coherence caches and cores together. A controlled diversity may occur but not

that severe in the true OCN platforms.

Barrelfish also over emphasize the 2OCN principle that they give up shared-memory at the lowest level using pure

messaging passing instead. There is no doubt that hardware features will shape the design space of system software

especially OS. Barrelfish designers seem to try their best to get out of the limitation of hardware, especially for the

most varying part – cache coherence shared-memory. They chose the ultimate method to escape the impact from the

incoming leak or variance believing that coherence is not necessary even nor sufficient. But I have to say this kind of

design is not friendly to application programmers, what they do is just leaving the problem to upper layer programs

or other programmers.

For cache coherence, maybe we have to give it up in the future, no matter hardware managed or software managed if

we cannot find an efficient communication channel with sufficient bandwidth or clever method to separate and

balance the contention of bandwidth. This is because the nature of the problem will not change in software-managed

form which is the contention of channel bandwidth from cache/buffered storage to main memory storage. No

method can escape this obstacle as cores increase greatly in amount. Barrelfish multikernel seems to be more

portable than scalable as the researchers expected.

Passage 7

However, Barrelfish is a little more sophisticated than fos. According to the predication from fos authors, this year

(2012), we will have 2000+ cores on single chip; this is not the situation at all. The problem of scaling chip to 1000

cores is not just the problem of putting more transistors on smaller chips under the control of Moore’s Law from

Intel. The problem is how to connect cores and cores to memory in high bandwidth channels. Hardware designers

see this problem as an optimization: how to best connect cores, maybe this is the reason why they give up coherence

to interconnect more and more cores; they trade off ! Finding high bandwidth channels and connection method may

not be implemented in silicon, and maybe this is the time for physics to perform, researchers from MIT use optical as

the communication media to address the problem in a project called ATAC, the result seems promising.

Back to the design of fos which means factored operating system in space, researchers use space-sharing to replace

time-sharing in order to scale to many-core. Fleet of servers or applications bound to cores communicate through

message passing and shared-memory cross fleets and within fleets separately. This is a neutral design in

communication manners which is a little not practical when hardware implementations should be taken into account;

naturally they simulate their products in QEMU which is a virtual environment.

For designs classified as microkernel family featured with Non-CC/Loose-CC, the H2D strategy or criterion is dividing

or grouping by functions or say service types. Self-controlled servers dynamically or statically bound to cores with

little interaction with the microkernel, provide different classes of services to applications through message-passing.

For H2C, the case is really Non-CC/Loose-CC style, service providers do not have too many things to communicate

with each other and also seldom with microkernel.

The end

More passages will come soon. Thank you for reading so much! For any suggestions, please contact me at

[email protected] or [email protected]

Problem: How to coordinate activities of thousands of cores

繼續閱讀

美國學生如何撰寫“個人陳述”以争取被大學錄取

Debug時檢測到Loaderlock的解決辦法

LoadRunner8.0是不支援IE7的解決方法

OS --written test1

ASP程式設計經典例子

qt中隐藏滑鼠

ASP程式設計中20個非常有用的例子

ASP程式設計中15個非常有用的例子

一份整理的還不錯的DOS Batch快速指南

web OS —— goowy.com

C++的背影——C++之父Bjarne Stroustrup印象

C++ 的背影{C++之父Bjarne Stroustrup印象　左輕侯}

C#多線程——前台線程和背景線程

PHP進階學習之session寫入資料庫

CSMA/CD1． CSMA/CD的概述2． CSMA 的工作原理3． CSMA/CD控制規程及特點4． CSMA/CD協定5． CSMA/CD的優點6．結束語

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置