Summary:
UCLA researchers in the Department of Electrical and Computer Engineering have developed a method titled Collaborative Memory ECC Technique (COMET), to efficiently detect two error correcting codes (ECC) and eliminate silent data corruption (SDC) when double-bit errors occur within DRAMs.
Background:
Technological abundance has been the prime driver of manufacturers increasing the capacity of the Dynamic Random-Access Memory (DRAM) modules for electronics such as computers, phones, and cars. DRAM manufactures have begun to adopt on-die error correcting coding (ECC) to deal with increasing error rates during usage. However, the on-die ECC can miscorrect double bit errors, which can result in triple bit errors more than 45% of the time. Triple bit errors are then undetectable or miscorrected in the memory controller more than 55% of the time, resulting in silent data corruption (SDC). Therefore, there is a need for a method that can improve the detection of silent data corruption when a double bit error happens to avoid further damage of data in the DRAM of electronics.
Innovation:
To solve this issue, UCLA researchers in the Department of Electrical and Computer Engineering have developed Collaborative Memory ECC Technique (COMET). This method efficiently co-designs two of the error correcting coding (ECC) codes to guarantee no silent data corruption when a double-bit error happens within the dynamic random-access memory (DRAM). Moreover, the method allows the collaboration between the on-die and in-controller ECC decoders that corrects most of the double-bit errors without adding any more redundancy bits to either of the two codes. Furthermore, COMET was able to eliminate all double-bit error induced silent data corruptions and corrected ~99.9997% of double bit errors with negligible data, power, and performance impact. COMET can be implemented by manufacturers to improve the overall performance of DRAM components and ultimately reduce data storage and transmission errors.
Potential Applications:
• Gaming designs
• Computer capacity and performance
• Phone memory performance
• Programming design and memory performance
Advantages:
• Elimination of double bit errors
• Detection of silent data corruptions
• Correction of error negligible data
• Negligible power and performance impact
• High performance
Development to Date:
First description of the complete invention
Related Papers:
M. Patel, J. S. Kim, T. Shahroodi, H. Hassan,and O. Mutlu, “Bit-exact ecc recovery (beer): Determining dram on-die ecc functions by exploiting dram data retention characteristics,” 2020.
P. J. Nair, V. Sridharan, and M. K.Qureshi, “Xed: Exposing on-die error detection information for strong memory reliability,” in2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 341–353.