Multicore Software Sensing Δ 8th of September 2014 Ω 11:37 AM

ΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞ
yourDragonXi~ Multicore Association
yourDragonXi~ Shift to multi-core DSP Solutions
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ
yourDragonXi~ sense for Ξ

ξ
ξ
ξ
«Software Sensing
Θ

Θ
ΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞΞ
































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~ Multicore Association

»Multicore Association

MULTICORE COMMUNICATIONS API WORKING GROUP

Objective
ξ provides a standardized API for communication and synchronization between closely distributed cores and/or processors in embedded systems

Overview
ξ the purpose of MCAPI, which is a message-passing API,
ξ is to capture the basic elements of communication and synchronization
ξ that are required for closely distributed (multiple cores on a chip and/or chips on a board) embedded systems
ξ the target systems for such an API will span multiple dimensions of heterogeneity e.g.
ξ core, interconnect, memory, operating system, software toolchain and programming language heterogeneity

ξ many industry standards have primarily been focused on the needs of widely distributed systems,
ξ SMP systems, or specific application domains

ξ Multicore Communications API but more highly constrained, goals than these existing standards with respect to
ξ scalability and fault tolerance, yet has more generality with respect to application domains



select: ~[Σ] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~ Shift to multi-core DSP Solutions

»

Shared resources
ξ address the perceived challenges inherent in new architectures
ξ complexities are added to system development in a multi-core system compared with a single-core, multiple-device system
ξ to measure total bandwidth requirements to guarantee that the multi-core DSP has sufficient I/O capability to handle a system’s requirements
ξ to share resources
ξ to ensure that one core does not corrupt the operation of another core
ξ to communicate between cores
ξ to include device-level parameters such as internal bandwidth and memory, in addition to board-level parameters
ξ to consider additional internal device parameters

Memory access
ξ L2 memory becomes an internal system-level memory shared between the cores inside the device
ξ sharing code images and data tables between the cores using high-frequency internal buses is possible,
ξ thereby reducing system level memory requirements and access latencies
ξ L2 memory in this case must be multi-ported so that I/O activities do not interfere with core accesses to the memory

ξ there will still be cases when more than one core wants to access L2 simultaneously
ξ in this case, the lower-priority core will be blocked
ξ because the memory is operating at the core frequency, the stall is short
ξ the multi-core device architect must ensure that each core has equal access to L2 memory and
ξ will not lock out other cores from accessing the resource

ξ the inclusion of a cache prefetch capability and a buffer for core writes to L2 memory
ξ can minimize the number of direct accesses to L2 memory by each core
ξ although there are four cores, there may not be 4 x the amount of on-chip memory in a multi-core device
ξ this means that the size of compiled code and the storage required for data is important
ξ in some cases, the sharing of code images between the four cores means that a system designer can avoid using external SDRAM
ξ however, with emerging telecommunications standards defining the size of data channels,
ξ using external SDRAM/DDR memory as L3 usually remains a requirement
ξ designers must determine and decide whether the additional latency for L3 accesses dictates
ξ that data or instruction information be stored externally to the device
ξ code with less stringent latency requirements, such as initialization code, can reside in L3 memory
ξ frequently-accessed data should be close in L1 to minimize latency delays and maximize performance
ξ data buffers may move from L3 to L1 for low latency processing
ξ system designers should analyze the data and code to determine
ξ what the cores can share to reduce overall system requirements and meet latency requirements for performance.

Data routing
ξ designers should carefully plan the data routing between I/O interfaces and destination memories
ξ it is important to complete code development and performance estimates to understand a given application’s processing capability per core
ξ based on this information, the system designer then knows how much data each core can process and
ξ plans the interface data routing accordingly
ξ if instruction cache is used, L2 memory must be large enough to store the program code and
ξ the cache overhead must be considered in the core processing time
ξ ideally, channel data, stack, and tables are in L1 memory to benefit from low access times
ξ L1 memory must have enough space to store the channel data, stack, and tables
ξ for maximum performance, the TDM interface must be able to route the channel data to the correct core’s memory
ξ without the need for core intervention
ξ system designers should determine the worst-case situation(s) for
ξ maximum memory use, maximum bandwidth, and minimum latency requirements (maximum core processing) as well as
ξ follow the estimation process above for these cases
ξ task switching must be taken into consideration

The importance of bandwidth and cache utilization
ξ DMA channel scheduling may be required to move data from temporary storage locations to processing buffers and then to an interface for output
ξ bandwidth capabilities for I/O interfaces must be powerful enough to support all available cores
ξ any lacking bandwidth means that cores are sitting idle and the performance benefits of the multiple cores are decreased
ξ the interfaces must be able to interrupt cores individually so that
ξ only the core requesting data can receive processing notification, while
ξ other cores can continue processing their own data without unnecessary interruption
ξ system designers must consider that cache misses will occur for multiple cores within the device
ξ these cache accesses to shared memory must be factored into bandwidth calculations
ξ in a multi-core device, cache performance is critical as multiple caches are vying for bandwidth to L2 memory
ξ the cache architecture must be more flexible and have better performance than what may be offered in a single-core device
ξ this means providing architectural capabilities such as associativity, prefetch, and variable fetch lengths
ξ in a direct-mapped cache, each memory line has one mapped location in the cache based on its index value
ξ when a cache fetch occurs in a direct-mapped architecture the fetched information has only one location to go to in the cache
ξ it may happen that it is a location currently held by the most recently used information,
ξ but this cached information is still over-written because there is only one location for the fetched information to reside in cache
ξ in an associative cache architecture, a memory line can have multiple locations to map to in the cache
ξ this allows the fetched information to reside in the least recently used cache location and
ξ keep recently used cached information in the cache
ξ this decreases the number of cache misses
ξ prefetching is based on the assumption that if the application experiences a cache miss and fetches a block of information from L2,
ξ that it will continue to process through the data sequentially and the next fetch will be from the sequential locations in L2
ξ the cache can be programmed to “prefetch” and retrieve the next block in memory so that
ξ it is ready when the core needs it, thereby avoiding another cache miss
ξ because there may be differences in the size of the sequential blocks and
ξ bandwidth utilization of the fetch unit accessing higher-level memory,
ξ it is useful to provide the device programmer the option of varying the fetch length to maximize prefetch amounts or minimize them
ξ use of prefetch to decrease cache misses has shown to decrease cache overhead by 6.5 percent in packet telephony applications
ξ such as Session Initiation Protocol (SIP)
ξ software developers should measure the cache performance and plan code storage allocation to meet software performance requirements
ξ this is especially important if a direct-mapped cache is used so that frequently-used cached information is not overwritten by newly fetched information
ξ designers may need to understand cache bandwidth utilization, in addition to the data routing considerations mentioned previously,
ξ to meet strict processing latency requirements
ξ cacheable information placement parameters are fed to the linker at build time
ξ so that information that will be accessed sequentially is placed sequentially in memory to take advantage of prefetching
ξ simulators can measure cache hits and misses
ξ multi-core DSP devices are now available that include hardware mechanisms to count cache hits or misses
ξ while an application is operating real-time
ξ this information can be displayed graphically in an easy-to-understand format by cache performance tools
ξ this helps the programmer understand where there are opportunities for improving an application’s performance in regard to cache usage

Reentrancy and resource sharing
ξ code for a multi-core device must follow the rules for reentrancy:
* Storing local data on the stack (dynamically declared)
* Passing input and output pointers
* Never writing to statically-declared memory
* Programming routines that are robust in a preemptive situation

ξ programmers ensure that one task does not corrupt the data for another task
ξ to ensure that one core’s processing does not corrupt the data of another core’s processing
ξ to ensure that the function calls the correct core’s specific function,
ξ the calling core must pass the pointer to its Function S instead of an absolute address

The following code is not reentrant:
int S; //static memory
int sub(void) //shared subroutine
{
int I; //dynamically declared variable
//on stack
for(I=0;I<3;I++) S++;
return S;
}
ξ the variable S is a global (statically declared) variable accessible by other functions
ξ if the sub() function is preempted and another thread modifies the value of S or calls sub() again,
ξ then the original sub() call will return an incorrect value for S when it completes its processing
ξ instead, sub() can declare S locally on the stack
ξ then even if another function preempts and calls a second instance of sub()
ξ the second one will allocate its own S variable on the stack and will not modify the first instance’s S variable
ξ the rules of reentrancy dictate that global variables should be read-only
ξ there may be a case where multiple cores use a global variable, such as S
ξ in this case, the code should ensure that semaphores protect the variable
ξ correct resource sharing can be guaranteed by utilizing mechanisms such as semaphores and spin locks
ξ semaphores are flags that may be checked to determine whether a resource is currently in use by another task or core
ξ in a spin-lock situation, the requesting task or core checks to see if a resource is available
ξ instead of returning to another task if the resource is unavailable, the requestor waits, or “spins,” until the resource is available
ξ similar to reentrancy, semaphores and spin locks are common single-core-system multitasking concepts
ξ that are essential in a multi-core system as well

System architecture options
ξ it is possible to utilize a multi-core device in different ways, such as a DSP farm where the cores operate independently
ξ the interfaces route data to and from each core with no interaction between cores
ξ in other cases the cores work together to complete a task
ξ while efficient inter-core communication may be necessary in either situation,
ξ it is especially important for the case when the cores are working together to complete a task
ξ a multi-core device must provide a mechanism for communicating between cores when the need arises
ξ this mechanism should minimize the amount of time required for both the sending and receiving core to handle the message passing
ξ this message passing stays within the device and operates at core frequencies
ξ because of this, it is a low-latency communication method
ξ ideally, the message can be located in internal memory
ξ if it is in L1 memory then a zero-wait-state read access (one core cycle) can retrieve the message
ξ writing the message at runtime should require a single write
ξ althoug some previous initialization of the messaging parameters is acceptable
ξ as long as it is not required each time a message is communicated
ξ with this capability, it may be possible for the system designer to remove the host controller from the DSP farm application and
ξ utilize one of the DSP cores to manage necessary control functions

Flexibility is key
ξ one of the most important aspects of multi-core DSP architectures is flexibility
ξ this is especially true for the programmable DSP, which is able to accommodate a wide range of signal processing applications
ξ to be flexible, the architecture must:
* be based on a core with efficient C compilability
* have a multi-level memory architecture to provide options for partitioning the system based on application timing requirements
* provide powerful industry-standard I/O interfaces to keep the multiple cores fed with data

ξ if code can be compiled efficiently then changes in application programs can be implemented quickly
ξ multi-level memory hierarchy allows functions and data to be moved depending on changes in latency requirements
ξ if a new service is called for that requires lower-latency,
ξ then a multi-level hierarchy provides flexibility for other functions to be moved in the system to accommodate the new service
ξ using industry-standard interfaces simplifies board design and parts searches
ξ familiarity of standard interfaces also makes it easier to estimate changes in data throughput requirements
ξ to determine whether the interface will be adequate to meet bandwidth demands
ξ the architecture must provide efficient messaging mechanisms for communication between cores
ξ DSP device architects should take care in providing this functionality to the customer
ξ system designers must then be diligent in understanding the architectural support provided for the multiple cores and
ξ plan their data processing to take advantage of the architectural featuress
ξ as integration increases, the system level design moves from a focus at the board level
ξ to include additional considerations at the device level
ξ this may seem like an increase in complexity but is, in fact, a natural evolution of the technology to meet market demands
ξ delivering such an architectural solution results in equipment
ξ that meets performance targets and is flexible enough for future optimization,
ξ intellectual property inclusions, and feature additions,
ξ while also drastically reducing power consumption, cost, and size
ξ the challenge is finding the right DSP vendor with the experience, and
ξ peripheral and tools support to provide such a multi-core solution



select: ~[
Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω] ~[Δ]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
yourDragonXi ~





select: ~[Σ] ~[Ω]!































































~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Small & Smart Inc reserves rights to change this document without any notice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~