[NPU Design Journey] Shared Memory Design and Technical Challenges for Capacity-to-Area Optimization

[NPU Design Journey] Shared Memory Design and Technical Challenges for Capacity-to-Area Optimization

1. NPU ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜: ์™œ Cache๋ณด๋‹ค Scratchpad์ธ๊ฐ€?

NPU(Neural Processing Unit) ์„ค๊ณ„์—์„œ ๊ฐ€์žฅ ๋จผ์ € ๋งˆ์ฃผํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜์  ์„ ํƒ์ง€๋Š” “๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๊ด€๋ฆฌํ•  ๊ฒƒ์ธ๊ฐ€"์ž…๋‹ˆ๋‹ค. ์ €๋Š” ์ด ์ง€์ ์—์„œ ์ผ๋ฐ˜์ ์ธ CPU์˜ ๋ฐฉ์‹์ด ์•„๋‹Œ, NPU ํŠนํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.

CPU vs NPU: ๊ฒฐ์ •๋ก ์  ๋ฐ์ดํ„ฐ ํ๋ฆ„์˜ ์ฐจ์ด

  • ์ผ๋ฐ˜์ ์ธ CPU: ์‹คํ–‰ ๊ฒฝ๋กœ๊ฐ€ ๋ณต์žกํ•˜๊ณ  ๋ถ„๊ธฐ ์˜ˆ์ธก(Branch Prediction)์ด ์–ด๋ ค์›Œ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ํŒจํ„ด์ด ๋งค์šฐ ๋ถˆ๊ทœ์น™ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ•˜๋“œ์›จ์–ด๊ฐ€ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ด€๋ฆฌํ•˜๊ณ  ๊ต์ฒดํ•˜๋Š” L1/L2 Cache ๊ตฌ์กฐ๊ฐ€ ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.
  • NPU: ํ–‰๋ ฌ ๊ณฑ์…ˆ(Matrix Multiplication)๊ณผ ๊ฐ™์€ ๋”ฅ๋Ÿฌ๋‹ ์—ฐ์‚ฐ์€ ๋ฐ˜๋ณต์ ์ด๋ฉฐ ๋ฐ์ดํ„ฐ์˜ ํ๋ฆ„์ด ๋งค์šฐ ๊ฒฐ์ •๋ก ์ (Deterministic)์ž…๋‹ˆ๋‹ค. ์ฆ‰, ์–ด๋–ค ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋А ์‹œ์ ์— ํ•„์š”ํ• ์ง€ ์ปดํŒŒ์ผ๋Ÿฌ ๋‹จ๊ณ„์—์„œ ๋ฏธ๋ฆฌ ์™„๋ฒฝํ•˜๊ฒŒ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Scratchpad Memory(SPM)์˜ ์ „๋žต์  ์ด์ 

์ด๋Ÿฌํ•œ ๊ฒฐ์ •๋ก ์  ํŠน์„ฑ ๋•๋ถ„์— NPU์—์„œ๋Š” ํ•˜๋“œ์›จ์–ด๊ฐ€ ์•Œ์•„์„œ ๊ด€๋ฆฌํ•˜๋Š” ์บ์‹œ๋ณด๋‹ค, ์†Œํ”„ํŠธ์›จ์–ด๊ฐ€ ์ง์ ‘ ์ œ์–ดํ•˜๋Š” Scratchpad Memory(SPM) ๋ฐฉ์‹์ด ํ›จ์”ฌ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

  • ๋ฉด์  ์ตœ์ ํ™”: ์บ์‹œ์— ํ•„์ˆ˜์ ์ธ ํ•˜๋“œ์›จ์–ด Tag ๋กœ์ง๊ณผ ๋ฐ์ดํ„ฐ ๋น„๊ต ๋กœ์ง์„ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์•„๋‚€ ๋ฉด์ ์€ ๋” ๋งŽ์€ ์—ฐ์‚ฐ๊ธฐ๋‚˜ ๋” ํฐ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ํ™•๋ณดํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • ์ตœ๊ณ ์˜ ํšจ์œจ์„ฑ: Google์˜ TPU๋‚˜ Tenstorrent์™€ ๊ฐ™์€ ์„ ๋„์ ์ธ NPU ๊ธฐ์—…๋“ค์ด SPM ๋ฐฉ์‹์„ ์„ ํ˜ธํ•˜๋Š” ํ•ต์‹ฌ ์ด์œ ์ž…๋‹ˆ๋‹ค. ์†Œํ”„ํŠธ์›จ์–ด๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ์น˜๋ฅผ ์ง์ ‘ ๊ด€๋ฆฌํ•จ์œผ๋กœ์จ ๋ถˆํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ๊ต์ฒด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ณ  ์ „๋ ฅ ํšจ์œจ์„ ๊ทน๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ ์„ค๊ณ„์˜ ๊ณ„์ธต ๊ตฌ์กฐ: Scratchpad + Shared Memory

์ €ํฌ ์„ค๊ณ„์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์ €์žฅ ๊ณต๊ฐ„์„ ๋‘ ๊ฐ€์ง€ ๋ ˆ์ด์–ด๋กœ ์„ธ๋ถ„ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • Scratchpad (Local): ๊ฐ ์—ฐ์‚ฐ ์ฝ”์–ด๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ์† ์—ฐ์‚ฐ์„ ์ง€์›ํ•˜๋Š” ๋กœ์ปฌ ์ €์žฅ์†Œ.
  • Shared Memory (Shared): ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ ์ฝ”์–ด๊ฐ€ ๋ฐ˜๋ณต์ ์œผ๋กœ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ ํ•˜๊ณ  ์žฌ์‚ฌ์šฉํ•˜์—ฌ ์™ธ๋ถ€ ๋ฉ”๋ชจ๋ฆฌ(DRAM) ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ณต์œ  ๊ณต๊ฐ„.

์ด๋Ÿฌํ•œ ์ด์ค‘ ๊ตฌ์กฐ๋Š” NPU ๋‚ด๋ถ€์˜ ๋ฐ์ดํ„ฐ ํ๋ฆ„์„ ์ตœ์ ํ™”ํ•˜๊ณ , ์ „์ฒด ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ๋“ ๋“ ํ•œ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค.


2. NPU ๋‚ด๋ถ€ ์ €์žฅ์†Œ ์„ค๊ณ„์˜ ํ•ต์‹ฌ: ๋ฉด์ ๊ณผ ์„ฑ๋Šฅ์˜ ์ค„ํƒ€๊ธฐ

NPU๋Š” ์ˆ˜๋งŽ์€ ์—ฐ์‚ฐ ์ฝ”์–ด๊ฐ€ ๋™์‹œ์— ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•˜์—ฌ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. ์ด ๋•Œ๋ฌธ์— NPU ๋‚ด๋ถ€ ์ €์žฅ์†Œ๋Š” ์ „์ฒด ์นฉ ๋ฉด์ ์˜ ์ƒ๋‹น ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•˜๋ฉฐ, ‘์ œํ•œ๋œ ๋ฉด์  ๋‚ด ์ตœ๋Œ€ ์šฉ๋Ÿ‰ ํ™•๋ณด’๋Š” ๊ณง ์นฉ์˜ ๊ฒฝ์Ÿ๋ ฅ๊ณผ ์ง๊ฒฐ๋ฉ๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ๋‹จ์ˆœํžˆ ์šฉ๋Ÿ‰๋งŒ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์€ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ์ฝ”์–ด๊ฐ€ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ์š”์ฒญ์„ ๋ณด๋‚ด๋Š” ํ™˜๊ฒฝ์—์„œ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์•ก์„ธ์Šค ์†๋„ ๋˜ํ•œ ๋งค์šฐ ์ค‘์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ €๋Š” ์ด๋ฒˆ ํ”„๋กœ์ ํŠธ์—์„œ NPU์˜ ํ•ต์‹ฌ ์ €์žฅ ๊ณต๊ฐ„์ธ Shared Memory๋ฅผ ์„ค๊ณ„ํ•˜๋ฉฐ ๋ฉด์ ๊ณผ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ๋งŒ์กฑ์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์ตœ์ ํ™” ์ž‘์—…์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

ํŠนํžˆ ์‹ ๊ฒฝ ์ผ๋˜ ๋ถ€๋ถ„์€ Read Cache ๊ตฌํ˜„๊ณผ Test Interface (Shared Bus) ์„ค๊ณ„์˜€์Šต๋‹ˆ๋‹ค. SRAM์˜ ๋™์ž‘ ์ฃผํŒŒ์ˆ˜์™€ ์ฃผ๋ณ€ ๋กœ์ง์˜ ์†๋„ ์ฐจ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด Read Cache๋ฅผ ๋„์ž…ํ–ˆ๊ณ , ๋ฌด์—‡๋ณด๋‹ค ํ…Œ์ŠคํŠธ๋ฅผ ์œ„ํ•œ ๋กœ์ง์ด ์‹ค์ œ ๋™์ž‘(Functional) ๊ฒฝ๋กœ์˜ ํƒ€์ด๋ฐ์— ๋ฏธ์น˜๋Š” ์•…์˜ํ–ฅ์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋…ธ๋ ฅํ–ˆ์Šต๋‹ˆ๋‹ค.


3. ๊ธฐ์ˆ ์  ๊ณผ์ œ: ์ด์ „ ์„ค๊ณ„์˜ ํ•œ๊ณ„์™€ STA ํƒ€์ด๋ฐ ์†ํ•ด ๋ถ„์„

SRAM์„ Die ๋‚ด์— ๊ณ ๋ฐ€๋„๋กœ ๋ฐฐ์น˜ํ•˜๋ฉด์„œ๋„ ์•ˆ์ •์ ์ธ ํ…Œ์ŠคํŠธ๋ฅผ ๋ณด์žฅํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ๊นŒ๋‹ค๋กœ์šด ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์ด์ „ ๊ณผ์ œ์—์„œ๋Š” ์ผ๋ฐ˜์ ์ธ ASIC ์„ค๊ณ„ ๋ฐฉ์‹์— ๋”ฐ๋ผ MBIST(Memory Built-In Self Test) ๋กœ์ง์„ SRAM ๋ฐ”๋กœ ์•ž์— ๋ฐฐ์น˜ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ์‹์€ ๋ฌผ๋ฆฌ์  ์„ค๊ณ„(Physical Design) ๋‹จ๊ณ„์—์„œ ์‹ฌ๊ฐํ•œ ๋ฌธ์ œ์ ์„ ๋“œ๋Ÿฌ๋ƒˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 0: MBIST์— ๊ด€ํ•œ ๊ฐ„๋‹จํ•œ ๊ฐœ๋…๋„ ๊ทธ๋ฆผ 0์€ Memory BIST์˜ ๊ธฐ๋ณธ์ ์ธ ๋™์ž‘ ์›๋ฆฌ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. SRAM ๋ฉ”๋ชจ๋ฆฌ ์™ธ๋ถ€์— BIST Controller๊ฐ€ ์กด์žฌํ•˜๋ฉฐ, Test Mode์ผ ๋•Œ Mux๋ฅผ ํ†ตํ•ด Test Pattern (Address, Data, Control)์„ SRAM์œผ๋กœ ๊ณต๊ธ‰ํ•˜๊ณ , ์ถœ๋ ฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋Œ€๊ฐ’๊ณผ ๋น„๊ตํ•˜์—ฌ ๋ถˆ๋Ÿ‰ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•ฉ๋‹ˆ๋‹ค. ์ด Mux๊ฐ€ ์‹ค์ œ ๋™์ž‘ ๊ฒฝ๋กœ์ƒ์— ์œ„์น˜ํ•œ๋‹ค๋Š” ์ ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 1: ์ด์ „ ATOM ๊ณผ์ œ SRAM Macro ๋ฐ BIST Logic ์œ„์น˜
Figure 1: ์ด์ „ ATOM ๊ณผ์ œ SRAM Macro ๋ฐ BIST Logic ์œ„์น˜

๊ทธ๋ฆผ 1 ์ด์ „ ATOM ๊ณผ์ œ์˜ ์‹ค์ œ ๋ ˆ์ด์•„์›ƒ์„ ํ™•์ธํ•ด๋ณด๋ฉด, ์ˆ˜๋งŽ์€ SRAM Macro ์‚ฌ์ด์‚ฌ์ด์— BIST ๋กœ์ง(Controller ๋ฐ Collars)๋“ค์ด ์‚ฐ์žฌํ•ด ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ BIST ๋กœ์ง์ด SRAM ๊ทผ์ฒ˜์— ์œ„์น˜ํ•˜๊ฒŒ ๋˜๋ฉด, ์‹ค์ œ ๋™์ž‘(Functional) ์‹ ํ˜ธ๊ฐ€ ์ด๋™ํ•ด์•ผ ํ•  ๊ฒฝ๋กœ์— Mux๊ฐ€ ์ถ”๊ฐ€๋˜๋ฉด์„œ STA(Static Timing Analysis) ๊ฒฐ๊ณผ๊ฐ€ ๊ธ‰๊ฒฉํžˆ ๋‚˜๋น ์ง‘๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 2: SRAM Read Path ๋ฐ Test Logic/Mux ๋ฐฐ์น˜ ๋ฌธ์ œ
Figure 2: SRAM Read Path ๋ฐ Test Logic/Mux ๋ฐฐ์น˜ ๋ฌธ์ œ

๊ทธ๋ฆผ 2๋Š” ์ด๋Ÿฌํ•œ ํƒ€์ด๋ฐ ์†ํ•ด์˜ ๋ฌผ๋ฆฌ์ ์ธ ์ด์œ ๋ฅผ ๋ช…ํ™•ํžˆ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. SRAM Macro ์‚ฌ์ด์˜ ๊ณต๊ฐ„(Channel)์€ ๋งค์šฐ ์ข์•„ ํ…Œ์ŠคํŠธ ๋กœ์ง์ด ๋“ค์–ด๊ฐˆ ์ถฉ๋ถ„ํ•œ ๊ณต๊ฐ„์ด ๋‚˜์˜ค์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ํ…Œ์ŠคํŠธ ๋กœ์ง๊ณผ Mux๋Š” SRAM์—์„œ ๋‹ค์†Œ ๋จผ ๊ณณ์— ๋ฐฐ์น˜๋  ์ˆ˜๋ฐ–์— ์—†๊ณ , ์—ฌ๊ธฐ์„œ SRAM๊นŒ์ง€ ์‹ ํ˜ธ๋ฅผ ์ „๋‹ฌํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜๋งŽ์€ Buffer์™€ Inverter Chain์ด ํ•„์ˆ˜์ ์œผ๋กœ ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ์ด ‘๋ถˆํ•„์š”ํ•œ ๊ผฌ๋ฆฌ’๋“ค์ด ์—„์ฒญ๋‚œ ํƒ€์ด๋ฐ ์ง€์—ฐ์„ ์œ ๋ฐœํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 3: ์ด์ „ ๊ณผ์ œ STA Report ๋ถ„์„
Figure 3: ์ด์ „ ๊ณผ์ œ STA Report ๋ถ„์„

๊ทธ๋ฆผ 3 ์‹ค์ œ STA Report๋ฅผ ๋ถ„์„ํ•ด๋ณธ ๊ฒฐ๊ณผ, ํ…Œ์ŠคํŠธ Mux ์ดํ›„์—๋„ ๋ฌด๋ ค 14๋‹จ์˜ Buffer๋ฅผ ํ†ต๊ณผํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” Functional Read Path์— ์—„์ฒญ๋‚œ Timing Penalty๋ฅผ ๊ฐ€์ ธ์™”๊ณ , ์ „์ฒด ์‹œ์Šคํ…œ์˜ ๋™์ž‘ ์ฃผํŒŒ์ˆ˜๋ฅผ ์˜ฌ๋ฆฌ๋Š” ๋ฐ ํฐ ๊ฑธ๋ฆผ๋Œ์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.


4. ํ•ด๊ฒฐ์ฑ…: Shared Bus ๊ธฐ๋ฐ˜ ์„ค๊ณ„ ๋ณ€๊ฒฝ ๋ฐ ๋ฉด์  ์ตœ์ ํ™”

์ €๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ๊ทผ๋ณธ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Mentor(Siemens)์˜ Shared Bus ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ํ…Œ์ŠคํŠธ ๋กœ์ง์„ SRAM ๋ฐ”๋กœ ์•ž์ด ์•„๋‹ˆ๋ผ, ํƒ€์ด๋ฐ ์—ฌ์œ ๊ฐ€ ์žˆ๋Š” Regslice(Pipeline Stage) ์•ž์ชฝ์œผ๋กœ ์ „์ง„ ๋ฐฐ์น˜ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์ œ ์ด์ „ ๊ฒฝ๋ ฅ์ธ ์‚ผ์„ฑ CPU Hardening ํ”„๋กœ์ ํŠธ์—์„œ์˜ ๊ฒฝํ—˜์„ ASIC ์„ค๊ณ„์— ์ ‘๋ชฉํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 4: SHM ์˜์—ญ์˜ DFT Logic ๋ถ„ํฌ
Figure 4: SHM ์˜์—ญ์˜ DFT Logic ๋ถ„ํฌ

Shared Bus ๋ฐฉ์‹์„ ์ ์šฉํ•œ ์ƒˆ๋กœ์šด Shared Memory ์˜์—ญ์˜ DFT(Scan, BIST) ๋กœ์ง ๋ถ„ํฌ๋ฅผ ํ™•์ธํ•ด๋ณด๋ฉด, SRAM Macro ์‚ฌ์ด ์˜์—ญ์— ๋กœ์ง์ด ๊ฑฐ์˜ ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. SRAM ๊ทผ์ฒ˜์—์„œ ํƒ€์ด๋ฐ Criticalํ•œ Mux์™€ Buffer๋“ค์„ ์ œ๊ฑฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€๋Šฅํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, SRAM Macro๋ฅผ ํ›จ์”ฌ ๋” ์กฐ๋ฐ€ํ•˜๊ฒŒ ๋ฐฐ์น˜ํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ , ๋™์ผ ๋ฉด์  ๋‚ด์— ๋” ๋งŽ์€ SRAM์„ ์ง‘์ ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.


5. ๊ฒฐ๊ณผ ๋ฐ ๋น„๊ต ๋ถ„์„: ์ˆซ์ž๋กœ ์ฆ๋ช…ํ•˜๋Š” ์ตœ์ ํ™” ํšจ๊ณผ

์„ค๊ณ„ ๋ณ€๊ฒฝ ๊ฒฐ๊ณผ๋Š” ํ•ฉ์„ฑ(Synthesis) ๋ฐ ๋ ˆ์ด์•„์›ƒ ๋‹จ๊ณ„์—์„œ ๊ทน์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค.

ํ•ญ๋ชฉ๊ธฐ์กด ๋ฐฉ์‹ (Screatch Pad ์ ์šฉ)๊ฐœ์„  ๋ฐฉ์‹ (Shared Bus ์ ์šฉ SHM)
Mux Count27,000๊ฐœ14,000๊ฐœ (์•ฝ 48% ์ ˆ๊ฐ)
TimingBuffer Chain์œผ๋กœ ์ธํ•œ CriticalRegslice ์ „์ง„ ๋ฐฐ์น˜๋กœ Margin ํ™•๋ณด
Area Density๋กœ์ง ๋ถ„์‚ฐ์œผ๋กœ SRAM ๊ฐ„๊ฒฉ ๋ฒŒ์–ด์งSRAM ๋ฐ€์ง‘ ๋ฐฐ์น˜ ๊ฐ€๋Šฅ (Max Capacity)
IR Drop/Power์ข์€ ๊ณต๊ฐ„ ๋กœ์ง ๋ฐ€์ง‘์œผ๋กœ IR Drop ์•…ํ™”๊ท ์ผํ•œ ๋ถ„ํฌ๋กœ ํŒŒ์›Œ ๋ฌธ์ œ ๊ฐœ์„ 

๊ทธ๋ฆผ 5: Scratch Pad vs Shared Memory ๋ฌผ๋ฆฌ์  ํŠน์„ฑ ๋น„๊ต ๊ทธ๋ฆผ 5: Scratch Pad vs Shared Memory ๋ฌผ๋ฆฌ์  ํŠน์„ฑ ๋น„๊ต (๋ถ‰์€์ƒ‰์ผ์ˆ˜๋ก ๋ถ„ํฌ๊ฐ€ ๋†’์Œ) > * a, d: Cell Density Map

  • b, e: DFT Logic Location (๋…ธ๋ž€์ƒ‰: Scan, ๋ถ‰์€์ƒ‰: BIST)
  • c, f: DVD (Dynamic Voltage Drop) Map

๊ทธ๋ฆผ์—์„œ ๋ณด๋“ฏ์ด, Shared Bus๊ฐ€ ์ ์šฉ๋˜์ง€ ์•Š์€ Screatch Pad ์˜์—ญ์€ SRAM Macro ์‚ฌ์ด์— BIST ๋กœ์ง๋“ค์ด ๋‹ค์ˆ˜ ์กด์žฌํ•˜์—ฌ, Cell Density๊ฐ€ ๋น„์ •์ƒ์ ์œผ๋กœ ๋†’๊ณ , ์ข์€ ๊ณต๊ฐ„์— ๋ฐฐ์น˜๋œ ๋กœ์ง๋“ค๋กœ ์ธํ•ด ํŒŒ์›Œ ๋ฌธ์ œ(DVD Drop)๊นŒ์ง€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

๋ฐ˜๋ฉด, ์ œ๊ฐ€ ์„ค๊ณ„ํ•œ Shared Bus ์ ์šฉ Shared Memory ์˜์—ญ์€ SRAM ์‚ฌ์ด ๊ณต๊ฐ„์ด ๋งค์šฐ ๊นจ๋—ํ•˜์—ฌ, Cell Density๊ฐ€ ๊ณ ๋ฅด๊ฒŒ ๋ถ„ํฌํ•˜๋ฉฐ, DVD ํŒŒ์›Œ ๋ฌธ์ œ ๋˜ํ•œ ํ˜„์ €ํžˆ ๊ฐœ์„ ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


6. ๋งˆ์น˜๋ฉฐ: ํ•˜๋“œ๋‹ ๊ฒฝํ—˜์ด ASIC ์„ค๊ณ„์— ์ฃผ๋Š” ๊ฐ€์น˜

์ด๋ฒˆ ์ตœ์ ํ™” ๊ณผ์ œ๋Š” ๋‹จ์ˆœํžˆ DFT ๊ธฐ๋Šฅ์„ ๊ตฌํ˜„ํ•˜๋Š” ๊ฒƒ์„ ๋„˜์–ด, ๋ฌผ๋ฆฌ์  ๋ ˆ์ด์•„์›ƒ๊ณผ ํƒ€์ด๋ฐ ์Šฌ๋ž™์„ ๊ณ ๋ คํ•œ ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์ข‹์€ ์‚ฌ๋ก€์˜€์Šต๋‹ˆ๋‹ค. ARM CPU ์„ค๊ณ„ ๋“ฑ์—์„œ ์‚ฌ์šฉ๋˜๋Š” High-end Hardening ๊ธฐ๋ฒ•์„ ์ œ๊ฐ€ ์ง์ ‘ ASIC ์„ค๊ณ„์— ์ ์šฉํ•จ์œผ๋กœ์จ Mux 27K -> 14K ๊ฐ์†Œ๋ผ๋Š” ๋†€๋ผ์šด ๋ฉด์  ์ตœ์ ํ™”๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

์•ž์œผ๋กœ๋„ ์ €ํฌ ํŒ€์€ ์ œํ•œ๋œ ์นฉ ๋ฉด์  ๋‚ด์—์„œ NPU์˜ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด, ๋ฌผ๋ฆฌ์  ์„ค๊ณ„๋ฅผ ๊ณ ๋ คํ•œ ๋‹ค์–‘ํ•œ ์•„ํ‚คํ…์ฒ˜์  ์‹œ๋„๋ฅผ ๊ณ„์†ํ•ด ๋‚˜๊ฐˆ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

1. NPU Memory Architecture: Why Scratchpad over Cache?

The very first architectural decision encountered in NPU (Neural Processing Unit) design is “how to manage data.” At this juncture, I opted for an NPU-specific memory architecture rather than the conventional CPU approach.

CPU vs. NPU: The Difference in Deterministic Data Flow

  • Conventional CPUs: With complex execution paths and difficult branch prediction, data access patterns are highly irregular. Therefore, an L1/L2 Cache structure, where hardware manages and replaces data in real-time, is essential.
  • NPUs: Deep learning operations, such as matrix multiplication, are repetitive, and the data flow is highly deterministic. In other words, the compiler knows exactly what data will be needed and when.

The Strategic Advantage of Scratchpad Memory (SPM)

Thanks to this deterministic nature, software-controlled Scratchpad Memory (SPM) is vastly superior in NPUs compared to hardware-managed caches.

  • Area Optimization: The hardware Tag logic and data comparison logic, which are essential for caches, can be eliminated. The area saved here is utilized to secure more compute units or larger memory capacity.
  • Maximum Efficiency: This is the core reason why leading NPU companies like Google (TPU) and Tenstorrent prefer the SPM approach. By having software directly manage memory allocation, unnecessary data replacement is prevented, and power efficiency is maximized.

Our Design’s Hierarchical Structure: Scratchpad + Shared Memory

To increase data reusability, we subdivided the storage space into two distinct layers in our design.

  • Scratchpad (Local): Local storage used independently by each compute core to support high-speed operations.
  • Shared Memory (Shared): A shared space where multiple compute cores share and reuse repeatedly needed data, thereby minimizing external memory (DRAM) access.

This dual structure optimizes the data flow inside the NPU and serves as a robust foundation for boosting overall system performance.


2. The Core of NPU Internal Storage Design: Balancing Area and Performance

An NPU is structured so that numerous compute cores simultaneously access massive amounts of data to perform operations. Because of this, internal storage takes up a significant portion of the total chip area, and “maximizing capacity within a limited area” directly translates to the chip’s competitiveness.

However, simply increasing capacity is not viable. In an environment where multiple cores repeatedly send data requests, guaranteeing optimal performance requires extremely fast memory access speeds. During this project, I undertook optimization efforts to simultaneously satisfy both area and performance requirements while designing the Shared Memory, the NPU’s core storage space.

Particular attention was given to implementing the Read Cache and designing the Test Interface (Shared Bus). To overcome the speed discrepancy between the SRAM’s operating frequency and the surrounding logic, a Read Cache was introduced. Above all, I focused on minimizing the adverse impact of the test logic on the timing of the actual functional paths.


3. Technical Challenges: Limitations of Previous Designs and STA Timing Penalty Analysis

Ensuring stable testing while densely placing SRAMs within the die is a highly demanding task. In a previous project, we followed standard ASIC design practices by placing MBIST (Memory Built-In Self Test) logic directly in front of the SRAM. However, this approach revealed severe issues during the Physical Design stage.

Figure 0: Simple concept diagram of MBIST Figure 0 illustrates the basic operating principle of Memory BIST. A BIST Controller exists outside the SRAM memory. In Test Mode, it supplies Test Patterns (Address, Data, Control) to the SRAM via a Mux and judges pass/fail by comparing the output data with expected values. The critical point is that this Mux is located directly on the actual functional path.

Figure 1: SRAM Macro and BIST Logic placement in the previous ATOM project
Figure 1: SRAM Macro and BIST Logic placement in the previous ATOM project

Looking at the actual layout of the previous ATOM project in Figure 1, you can see BIST logic (Controllers and Collars) scattered among numerous SRAM Macros. When BIST logic is located near the SRAM like this, Muxes are added to the paths where actual functional signals must travel, causing Static Timing Analysis (STA) results to deteriorate sharply.

Figure 2: SRAM Read Path and Test Logic/Mux placement issues
Figure 2: SRAM Read Path and Test Logic/Mux placement issues

Figure 2 clearly shows the physical reason for this timing penalty. The spaces (Channels) between SRAM Macros are too narrow to accommodate the test logic. Consequently, the test logic and Muxes are forced to be placed somewhat far from the SRAM. To transmit signals from there to the SRAM, numerous buffers and inverter chains must inevitably be added. These ‘unnecessary tails’ cause massive timing delays.

Figure 3: STA Report analysis of the previous project
Figure 3: STA Report analysis of the previous project

As analyzed in the actual STA Report in Figure 3, signals had to pass through a staggering 14 stages of buffers even after the test Mux. This introduced a tremendous timing penalty to the Functional Read Path and became a major bottleneck in increasing the overall system operating frequency.


4. The Solution: Shared Bus-Based Design Modification and Area Optimization

To fundamentally resolve this issue, I introduced Mentor’s (Siemens) Shared Bus interface. The core idea is to move the test logic forwardโ€”not right in front of the SRAM, but ahead of the Regslice (Pipeline Stage) where there is more timing margin. This is the result of applying my previous experience from Samsung CPU Hardening projects to ASIC design.

Figure 4: DFT Logic distribution in the SHM area
Figure 4: DFT Logic distribution in the SHM area

Checking the new DFT (Scan, BIST) logic distribution in the Shared Memory area where the Shared Bus was applied, you can see almost no logic exists in the areas between the SRAM Macros. This was possible because timing-critical Muxes and buffers near the SRAM were removed. As a result, we could pack the SRAM Macros much more densely, integrating more SRAM within the same area.


5. Results and Comparative Analysis: Optimization Effects Proven by Numbers

The results of the design changes manifested dramatically during the synthesis and layout stages.

MetricPrevious Method (Applied to Scratchpad)Improved Method (SHM with Shared Bus)
Mux Count27,00014,000 (Approx. 48% Reduction)
TimingCritical due to Buffer ChainsMargin secured by advancing Regslice
Area DensityLogic dispersion widened SRAM gapsDense SRAM placement possible (Max Capacity)
IR Drop/PowerWorsened IR Drop due to logic crowding in narrow spacesPower issues improved through uniform distribution

Figure 5: Physical characteristic comparison between Scratchpad and Shared Memory Figure 5: Physical characteristic comparison between Scratchpad and Shared Memory (The redder, the higher the density)

  • a, d: Cell Density Map
  • b, e: DFT Logic Location (Yellow: Scan, Red: BIST)
  • c, f: DVD (Dynamic Voltage Drop) Map

As shown in the figure, the Scratchpad area (without the Shared Bus) has numerous BIST logics between SRAM Macros, resulting in abnormally high Cell Density. The logics crammed into narrow spaces even trigger power issues (DVD Drop).

In contrast, the Shared Memory area where I applied the Shared Bus design is very clean between the SRAMs. You can clearly see that the Cell Density is evenly distributed, and the DVD power issues have also been significantly mitigated.


6. Conclusion: The Value of Hardening Experience in ASIC Design

This optimization project was a prime example showing how critical architectural design is when considering physical layout and timing slack, going far beyond merely implementing DFT functions. By directly applying high-end hardening techniques used in ARM CPU design to this ASIC design, I achieved an astonishing area optimization, reducing the Mux count from 27K to 14K.

Moving forward, our team will continue to experiment with various architectural approaches that consider physical design to maximize NPU performance within a limited chip area. Thank you.