Publication

Publications & Articles

πŸ“„ [Paper] CTDM: Resource-Efficient FPGA-Accelerated Simulation of Large-Scale NPU Designs

Role: First Author | Venue: ICCAD 2025

Abstract This paper proposes a novel approach to accelerate large Neural Processing Unit (NPU) design simulations on FPGA through Chain-based Time-Division Multiplexing (CTDM) and its automatic compiler.

  • Key Innovation: CTDM replaces repeated logic patterns with single logic patterns and register chains, leveraging hardware-predefined shift register primitives. This minimizes logic overhead and routing congestion, reducing FPGA resource utilization more effectively than conventional multiplexer-based TDM.
  • Scalability & Compatibility: The automated compiler supports various HDLs (Verilog, VHDL, HLS, Chisel) and diverse hardware ranging from single boards to server-grade simulators like Synopsys Zebu. It also introduces a block interleaving technique to hide inter-FPGA link latency.
  • Results: When applied to NVIDIA’s NVDLA, CTDM achieved 66% LUT and 82% FF resource reduction, enabling full deployment on a single AMD U250 FPGA. This resulted in a 3,653x acceleration in simulation time compared to CPU-based VCS.
  • Real-World Application: Successfully implemented for the verification of a proprietary 4-die 1024 TFLOPS chiplet using 144 FPGAs on Zebu Server 5.

πŸ“„ [Paper] A Quad-Chiplet AI SoC with Full-Chip Scalable Mesh Over 16Gb/s UCIe-Advanced Die-to-Die Interface for Large-Scale AI Inferencing

Role: Co-author | Venue: ISSCC 2026

  • Source: 2026 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA

Abstract This paper presents a 4nm-based quad-chiplet LLM accelerator achieving 56.8TPS on LLaMA v3.3 70B. The architecture integrates low-latency UCIe-Advanced die-to-die interfaces, unified mixed-precision compute, and HBM3E with advanced power schemes to sustain the bandwidth and thermal stability required for large-scale AI inferencing.


πŸ“„ [Paper] ATOMUS: A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications

Role: Co-author | Venue: ISSCC 2024

  • Source: 2024 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA

Abstract ATOMUS is a 5nm AI accelerator optimized for latency-critical applications such as high-frequency trading and SLO-based AI services. It delivers 32TFLOPS/128TOPS with outstanding single-stream responsiveness and low TDP, enabling efficient scale-out for both edge and datacenter/cloud platforms.


πŸ“° [Article] Technical Reliability Issues in the Student Council Mobile Voting System

Role: Reporter (Korea University Newspaper) link news | [2015.11.23] Investigative Report | [2015.11.23]

Summary Authored an investigative article critiquing the mobile voting system used by the university student council. The report exposed significant security vulnerabilities and a lack of technical reliability in the system, raising concerns about potential election fraud and the integrity of the digital voting process.