| -- Stefan Hacker, Analog Devices GmbH |
 |
| Need for Speed: |
 |
|
Today more and more complex applications require higher computational performance to realize them in real-time. These applications start from simple control algorithms in consumer goods and reach highly sophisticated applications as in today's and tomorrow's communications industry. Especially for modern RAS modems, xDSL technologies and mobile communications, like the coming 3G UMTS standard, high-performance signal computers must be used to meet the application's requirements. Furthermore, a steady trend in minimizing the appliance can be seen, resulting in an increased number of channels or ports per system, thus requiring more powerful and specialized processors.
To meet this demand, chip designers can pursue two different approaches: the first one will speed up the existing device; the second one is looking for an improved device by developing a new architecture. The first one offers only limited possibilities, resulting in performance gains between 5-8 times by the optimization of the production process and reducing geometry of the part. The second approach defines a new processor from scratch, which is highly suitable for the targeted applications, optimized to the user's chosen programming language and starts off in the beginning where the old processor reached its threshold.
Analog Devices has gone through this procedure. The first 32-bit fixed- and floating-point ADSP-21020 processor core has been improved and enhanced from originally 30 MFLOPs, so that it delivers now in the ADSP-21065L, a peak performance of 198 MFLOPs, or a speed improvement of factor 6.6. To maintain code compatibility and to protect the customer code investment, Analog Devices applied further optimizations and added architectural features so that the first version of the coming ADSP-2116x processor family is aimed at 100 MHz core clock, delivery 600 MFLOPs.
As coming applications like 3G mobile communications, xDSL technologies and RAS modem servers require in terms of performance, a multiple of 600MFLOPS, the new architecture of the TigerSHARC was introduced.
Chip designers and system engineers carefully analyzed the targeted applications and came to the conclusion that a new processor has to combine multiple features from different architectures and designs. The new processor must take advantage of existing DSP technologies, like fast and deterministic execution cycles, a highly responsive interrupt model and an excellent peripheral interface to support vast computational core rates. To archive the excellent core processor performance, features from RISC computers, like the load/store architecture, the deeply pipelined sequencer with branch prediction and large interlocked register files are introduced. Finally, to feed all units on the chip with instructions, a clever management of the instruction word is necessary. Parts of very long instruction word designs with instruction parallelism determined prior to runtime are taken into account. |
 |
| TigerSHARC Architecture: |
 |
|
Before going in detail of the 32-bit TigerSHARC architectural components, the resulting implementation diagram is shown in Figure 1. |
 |
 |
 |
| Figure 1: TigerSHARC Architecture |
 |
|
When figuring out the needs of upcoming applications, the researcher will soon notice parallelisms showing that lots of computations are equal in terms of the used instructions but vary on their data. Especially in multichannel applications or where data is arranged orthogonal, the performance can be doubled by adding a second set of mathematical units. Processors offering a second computational unit are referred to as single-instruction multiple-data architectures, in short SIMD architectures. The TigerSHARC allows issuing a single instruction to process data in both computational units.
Furthermore, the second computational unit can be operated independently from the first one; the term multiple data-paths applies to this model. To operate it properly, usually additional space must be reserved in the instruction word for placing code, which results in very long instruction words (VLIW). Using VLIWs can tie up very quickly, in most cases, very limited available on-chip program memory by issuing the "no operation" instruction (NOP) to a unit not being used in the current processor cycle.
To avoid placing NOPs in the code, a huge disadvantage being observed in currently available VLIW designs, the large instruction words are broken in separate small instructions, which can be issued for each unit of the TigerSHARC independently. Up to four of these instructions can be carried out simultaneously. As dependency checking is enabled for each instruction, all pipelining effects and delays are tracked and resolved by the program sequencer. Handling multiple instructions for independent units concurrently is a key feature of superscalar processor architectures. With processor resource coordination and resource allocation checked in the coding phase, the TigerSHARC features a static superscalar design. This reduces security concerns greatly by providing a deterministic code execution. |
 |
| TigerSHARC Processing Elements |
 |
|
Each of the two computational units referred to as Processing Element X (PEX) and Y (PEY) contains a large 32-entry long, each 32-bits wide, fully interlocked register file. Every computation carried out by ALU, MAC or Shifter will source its data from this register file and feedback its results back to it, the key characteristic of a load/store architecture. By having a large number of registers for processing or storing data, the use of a high-level programming language is eased. To maintain high internal bandwidth, each register file connects to the three internal 128-bit wide busses by two 128-bit wide busses. Both busses can be used concurrently for memory reads and one memory bus can be used for write operations. This bus structure matches typical mathematics instructions requiring two inputs and computing one output.
When considering the targeted applications and markets, the programmer will find out that most of the application data presents itself as a mix of 8-bit, 16-bit, 32-bit or 64-bit words. Additionally to the pure data value some headroom is required to have enough dynamic range to process them. The TigerSHARC accounts for these different data types by providing native support for byte (8-bit), short (16-bit), normal (32-bit) and long (64-bit) fixed-point words. Every data type may be signed or unsigned. Additionally, the TigerSHARC offers a 32-bit and 40-bit extended floating-point data type, like in the current ADSP-2106x processor family.
To take advantage of the different data types and to boost performance, the native 32-bit fixed-point data words may be broken down into several smaller data types, e.g. into two 16-bit values or four 8-bit values. The data is then arranged in the register in a packed mode. The Arithmetic Logic Unit (ALU) of the TigerSHARC, as given in Figure 2, offers parallel computation of multiple data fractions. Furthermore, the ALU does not only work on a single 32-bit word, it can take two 32-bit words as input. This results in up to impressive 8 concurrent additions or subtracts or complementary add/subtracts in a single processor cycle, just for a single processing element. All ALU instructions are computed in a single cycle with the availability of the result in the register file one cycle later. Furthermore, the ALU supports promotion/demotion between different data types, allowing an easy exchange of data in different formats. An additional architecture enhancement is special registers to calculate sideways sums of data fractions in registers or to track a history of comparisons highly needed in error correction modules like the Viterbi algorithm, widely used in today's telecommunications applications. |
 |
 |
 |
| Figure 2: TigerSHARC ALU Data Types |
 |
|
The Multiplying Accumulator (MAC) of the TigerSHARC, presented in Figure 3, has built in the same multiple data support for concurrent operations, boosting the performance to 4 MAC instructions per cycle in a single Processing Element. Again, all computations are single cycled, with a one cycle delay before the data is available again from the register file. As destination for the computations, either the register file can be used or one of the four special function MAC registers. |
 |
 |
 |
| Figure 3: TigerSHARC MAC Data Types |
 |
|
When using fixed-point mathematics, the user might soon encounter overflows when performing successive MAC instructions. To prevent an early loss of valid data, the amount of overflow space for the MAC operation is adjusted to the magnitude of the input data types and ranges from 4 bits to 16 bits. This allows the programmer to verify the result and in case of overflow, to scale the output accordingly or saturate the MAC.
Besides mathematical operations on real data, the TigerSHARC is equipped with complex data type support in the MAC unit, allowing a direct handling of complex numbers. This feature is improving execution times for FFT and iFTT algorithms needed in xDSL applications largely. Just to provide an example: a 16-bit 256 point complex FFT computes in as little as 1100 processor cycles or 4.4 microseconds when running from 250 MHz.
By design the TigerSHARC's ALU and MAC support fixed-point data and floating-point data equally. This is another key feature when programming in high level languages like C. There is no need to reconfigure or switch the processor mode when transitioning between the data types, as the processor has no hardware modes. The data format to be used is encoded in the instruction line.
Last part of each Processing Element is the Shifter unit, implemented as a barrel shifter to shift more than a single bit per cycle. Additionally, the Shifter unit offers bit field operations like field extract and field deposit allowing cut and paste operations on multiple bits in 32-bit words. A bit set, clear, toggle and test logic allow altering the data word, too. Finally a special provision is implemented in a dedicated shifter register to simulate bit streams, quite helpful in the generation of CRC codes or PN sequences based on polynomials. |
 |
| TigerSHARC Integer ALUs |
 |
|
To fill the register file of each Processing Element with sufficient data, the TigerSHARC architecture foresees two Integer ALUs (IALU), called J-ALU and K-ALU, which may be used totally independent from each other in two ways.
First, the IALUs serve as data address generators for indirect accessing of the internal and external memory. The asserted address pointer may be pre- and post-modified, where the pointer modification happens in the same processor cycle of the data access. The modification value may be within the full addressable range of 32 bits. The user selectable optional bit-reversal mode and the circular buffer array support help addressing variables during FFT algorithms.
Second, the IALU can be used for integer mathematics like add, subtract and bit manipulation. IALU resources are typically used for loop or event counters or generally 32-bit integer mathematics when the ALUs of the two Processing Elements are already tied up in other operations. To leverage high-level language programming, the IALU operates on a 32-entry long, each 32-bit wide register file. |
 |
| TigerSHARC Memory Integration |
 |
|
The large on-chip memory is divided into three separate blocks of equal size. Each block is 128-bits wide, offering the quad word structure and four addresses for every row. The memory can be configured to the user's needs, with no specific segmentation in program memory and data memory.
For data accesses, the processor can address one 32-bit word or two 32-bit words (long) or four 32-bit words (quad) and transfer it to/from a single computational unit or to both in a single processor cycle. The user only has to care that the start addresses are either modulo two or modulo four addresses when fetching long words and quad words. In applications that require computing data of a delay line in which the start address of the variable does not match the modulo requirements, or in other applications that require unaligned data fetches a data alignment buffer (DAB) is provided. Once the DAB is filled, quad word fetches can be made from it.
Besides the internal memory, the TigerSHARC can access up to four giga words of memory. The memory map is given in Figure 4. |
 |
 |
 |
| Figure 4: TigerSHARC Memory Map |
 |
|
All internal resources of the TigerSHARC show up in the memory map, allowing other external bus masters or host processors to place data not only in the memory but also into registers of the Processing Elements. Furthermore, all internal resources like memories and registers appear in this memory map under an address alias corresponding to each processor's multiprocessing ID, allowing direct access from other bus masters.
Above the internal memory space and the multiprocessing space, several memory segments are mapped. These are dedicated segments for SRAM, SDRAM and peripherals. Finally, a 3.75 Gwords large memory segment is available allowing to reach with master transfers areas of the host system. |
 |
| TigerSHARC Program Sequencer |
 |
|
One of the most complex parts in the new architecture is the program sequencer, dispatching the instructions and caring for a proper code execution of the instruction word sent to the unit. The program sequencer on the TigerSHARC has to keep track of a 3-stage decode pipeline and a 5-stage execution pipeline. To reduce pipeline effects in non-linear code, the sequencer is equipped with a branch target buffer (BTB). The BTB mechanism allows the prediction of a branch location and stores it into a 128-entries deep buffer. By branch prediction, the pipeline penalty when branching can be reduced from six or three cycles down to a single cycle.
The TigerSHARC does not have fixed length VLIW to drive instructions to each unit. Rather a combination of up to four 32-bit long instruction words can be selected by the programmer to be executed. All instructions are fetched from memory as a 128-bit wide (quad) word, containing packed code. The packed code is stored in the instruction alignment buffer (IAB) before execution, so that the program sequencer can determine the number of concurrent instructions to be carried out. Afterwards, the selected 32-bit chunks of the two times 128-bit wide instruction alignment buffer are dispatched from the instruction slots to the units. An example for the code storage and the breaking up into instruction cycles is shown in Figure 5. |
 |
 |
 |
| Figure 5: TigerSHARC Instruction Parallelism and Code Packing |
 |
|
As the TigerSHARC architecture is deeply pipelined and provides a fixed one cycle delay before computational operation results (MAC or ALU) in the Processing Elements are available from the register file again, dependency checking is provided. This dependency checking will insert automatically a stall cycle to prevent a computation with invalid operands. Having this interlocking of registers in a large register file introduces a much easier programming style. Smart development tools from Analog Devices for the TigerSHARC architecture will assist the programmer to avoid stall cycles by intelligent register usage within C sources and assembly language files.
The program sequencer additionally evaluates conditional instructions and loads the branch target buffer accordingly for more flexible code and maintains the counter information for two nestable loop counters. Loop counters are supported by the branch target buffer, too, so the user will not encounter pipeline penalties when reaching the last instruction of a loop with the loop counter not decremented to zero.
Finally, the program sequencer maintains track of incoming interrupt requests, which may be generated by the on-chip timers, the DMA engine or external devices. Due to the deterministic architecture, the TigerSHARC model is highly interruptable and will respond to interrupt requests within 6 instruction cycles when the interrupt is unmasked and enabled. |
 |
| Summary |
 |
|
The very powerful architecture of the TigerSHARC, combining the best elements of RISC and DSP cores, is highly suited to deliver the performance required for upcoming applications in 3G mobile communications, xDSL technologies and imaging systems. The implementation of different data types of 8-bit, 16-bit, 32-bit fixed-point size and 32-/40-bit floating point data size allows the programmer to handle application data efficiently and provides enough dynamic range for complex computations by not being bound to a single format. The Static Superscalar architecture maintains determinism for security-sensitive applications and the high number of internal registers allows the efficient use of a high-level language, speeding up the development process of the designers. So be prepared to leap forward into the new TigerSHARC architecture! |
 |
 |
|