Natural-Langauge-Processing

Knowledge Distillation (KD) is a model compression technique, popularized by Hinton et al. in their 2015 paper “Distilling the Knowledge in a Neural Network”. The core idea is to transfer (or “distill”) the knowledge from a large, complex model (the “teacher”) to a smaller, simpler model (the “student”), enabling efficient deployment without significant loss in performance. Today we will discuss the background and motivation of KD, and followed by the methodology, and finally we will dive deep into the terminology “temperature” specifically —- We know that “distiallation” should be done under high temperature, then what does “temperature” represents and how to pick the most appropriate temperature? ...