15 Cho and Jin, MICRO06 Page coloring to improve proximity of data and computation Flexible software policies Has the benefits of S-NUCA (each address has a unique location and no search is required) Has the benefits of D-NUCA (page re-mapping can help migrate data, although at a page granularity) Easily extends to multi-core and can easily mimic the behavior of private caches.
14 Page Coloring 00000000000 000000000000000 000000 Block offsetSet IndexTag Bank number with Set-interleaving Bank number with Page-to-Bank Page offsetPhysical page number CACHE VIEW OS VIEW.
13 Victim Replication, Zhang & Asanovic, ISCA05 Large shared L2 cache (each core has a local slice) On an L1 eviction, place the victim in local L2 slice (if there are unused lines) The replication does not impact correctness as this core is still in the sharer list and will receive invalidations On an L1 miss, the local L2 slice is checked before fwding the request to the correct slice P C P C P C P C P C P C P C P C.
Will need support for L2 coherence as well
Alternative Layout From Huh et al., ICS05: Paper also introduces the notion of sharing degree A bank can be shared by any number of cores between N=1 and 16.
11 Beckmann and Wood, MICRO04 Latency 13-17cyc Latency 65 cyc Data must be placed close to the center-of-gravity of requests.
SHARED CACHE VS PRIVATE CACHE SERIES
10 Static and Dynamic NUCA Static NUCA (S-NUCA) The address index bits determine where the block is placed sets are distributed across banks Page coloring can help here to improve locality Dynamic NUCA (D-NUCA) Ways are distributed across banks Blocks are allowed to move between banks: need some search mechanism Each core can maintain a partial tag structure so they have an idea of where the data might be (complex!) Every possible bank is looked up and the search propagates (either in series or in parallel) (complex!).
9 CPU Issues to be addressed for Non-Uniform Cache Access: Mapping Migration Search Replication Innovations for Shared Caches: NUCA.
8 Dynamic Spill-Receive Dynamic Spill-Receive, Qureshi, HPCA09 Instead of forcing a block upon a sibling, designate caches as Spillers and Receivers and all cooperation is between Spillers and Receivers Every cache designates a few of its sets as being Spillers and a few of its sets as being Receivers (each cache picks different sets for this profiling) Each private cache independently tracks the global miss rate on its S/R sets (either by watching the bus or at the directory) The sets with the winning policy determine the policy for the rest of that private cache referred to as set-dueling.
7 Innovations for Private Caches: Cooperation Cooperative Caching, Chang and Sohi, ISCA06 P C P C P C P C P C P C P C P C D Prioritize replicated blocks for eviction with a given probability directory must track and communicate a blocks replica status Singlet blocks are sent to sibling caches upon eviction (probabilistic one-chance forwarding) blocks are placed in LRU position of sibling.
Private SHR: No replication of blocks SHR: Dynamic allocation of space among cores SHR: Low latency for shared data in LLC (no indirection thru directory) SHR: No interconnect traffic or tag replication to maintain directories PVT: More isolation and better quality-of-service PVT: Lower wire traversal when accessing LLC hits, on average PVT: Lower contention when accessing some shared data PVT: No need for software support to maintain data proximity
5 Multi-Core Cache Organizations Private L1 caches Private L2 caches Scalable network Directory-based coherence between L2s (through a separate directory) P C P C P C P C P C P C P C P C D.
4 Multi-Core Cache Organizations Private L1 caches Shared L2 cache, but physically distributed Scalable network Directory-based coherence between L1s P C P C P C P C P C P C P C P C.
3 Multi-Core Cache Organizations Private L1 caches Shared L2 cache, but physically distributed Bus connecting the four L1s and four L2 banks Snooping-based coherence between L1s P C P C P C P C.
2 Multi-Core Cache Organizations P C P C P C P C P C P C P C P C CCCCCC CCCCCC Private L1 caches Shared L2 cache Bus between L1s and single L2 cache controller Snooping-based coherence between L1s.
1 Lecture 8: Large Cache Design I Topics: Shared vs.