From 8e32ecc0a77082f1e232a3e6d12e2f163f9667a4 Mon Sep 17 00:00:00 2001 From: Matthew Dillon Date: Sun, 25 Dec 2011 13:47:39 -0800 Subject: [PATCH] kernel - Add workaround support for a probable AMD cpu bug related to cc1 * Add supporting inlines and a #define. See the followup commit to the gcc-4.4 code in the DFly codebase. * This bit of code is used to add a single NOP instruction just prior to the pop/ret sequence in cc1's fill_sons_in_loop() which works around what we believe to be a very difficult to reproduce AMD cpu bug. The bug appears to be present on contemporary AMD cpus and was replicated on a Phenom(tm) II X4 820 Processor (Origin = "AuthenticAMD" Id = 0x100f42 Stepping = 2) and on an opteron 12-core cpu AMD Opteron(tm) Processor 6168 (Origin = "AuthenticAMD" Id = 0x100f91 Stepping = 1). * The bug is extremely sensitive to %rip and %rsp values as well as stack memory use patterns and appears to cause either the %rip or the %rsp to become corrupt during the multi-register-pop/ret sequence at the end of fill_sons_in_loop() in the GCC 4.4.7 codebase. This procedure is called as part of a deep tree recursion which exercises both the AMD RAS (Return Address Stack) hardware circuitry and probably also the write combining circuitry. * I have so far only been able to reproduce the bug on DragonFly but have to the best of my ability eliminated the OS as a possible source of the problem over the last few months. I am currently attempting to reproduce the bug running FreeBSD on the same hardware but it's virtually impossible to replicate the exact environment without adding DragonFly binary emulation to FreeBSD (which I just might have to do to truly verify that the bug is not a DragonFly OS bug). * Bug reproducability: DragonFly utilizes a 0-1023 (~16 byte aligned) random stack gap. Under normal buildworld -j 25 or similar conditions it can take anywhere up to 2 days to cause a failure. Using a fixed stack gap of 904 (sysctl kern.stackgap_random=-904) on a particular cc1 line during the compilation of gcc-4.4 using gcc-4.4, compiling gcc/mcf.c, with a carefully constructed environment and command path (to replicate a precise starting stack %rsp of for main() of 0x7fffffffe818), I was able to replicate the bug in around a 60-second time frame with approximately one out of every 16 compiles hitting the the bug and failing. * Changing the stackgap and/or modifying the code in any way (e.g. causing a shift in the %rpc values) changes the characteristics of the bug, sometimes causing it to stop appearing entirely. It was found that an adjustment of the stackgap in 32768 byte increments starting at the gap known to fail also reproduces the bug with the same consistency as the original stackgap value. * Only the fill_sons_in_loop() function in cc1 in a few particular cases appears to be able to trigger the bug, across all the compiles we've done over a year. --- sys/cpu/i386/include/cpufunc.h | 32 ++++++++++++++++++++++++++++++++ sys/cpu/x86_64/include/cpufunc.h | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 64 insertions(+) diff --git a/sys/cpu/i386/include/cpufunc.h b/sys/cpu/i386/include/cpufunc.h index 31ba2b5c75..d51d4746bb 100644 --- a/sys/cpu/i386/include/cpufunc.h +++ b/sys/cpu/i386/include/cpufunc.h @@ -251,6 +251,38 @@ cpu_ccfence(void) __asm __volatile("" : : : "memory"); } +/* + * This is a horrible, horrible hack that might have to be put at the + * end of certain procedures (on a case by case basis), just before it + * returns to avoid what we believe to be an unreported AMD cpu bug. + * Found to occur on both a Phenom II X4 820 (two of them), as well + * as a 48-core built around an Opteron 6168 (Id = 0x100f91 Stepping = 1). + * The problem does not appear to occur w/Intel cpus. + * + * The bug is likely related to either a write combining issue or the + * Return Address Stack (RAS) hardware cache. + * + * In particular, we had to do this for GCC's fill_sons_in_loop() routine + * which due to its deep recursion and stack flow appears to be able to + * tickle the amd cpu bug (w/ gcc-4.4.7). Adding a single 'nop' to the + * end of the routine just before it returns works around the bug. + * + * The bug appears to be extremely sensitive to %rip and %rsp values, to + * the point where even just inserting an instruction in an unrelated + * procedure (shifting the entire code base being run) effects the outcome. + * DragonFly is probably able to more readily reproduce the bug due to + * the stackgap randomization code. We would expect OpenBSD (where we got + * the stackgap randomization code from) to also be able to reproduce the + * issue. To date we have only reproduced the issue in DragonFly. + */ +#define __AMDCPUBUG_DFLY01_AVAILABLE__ + +static __inline void +cpu_amdcpubug_dfly01(void) +{ + __asm __volatile("nop" : : : "memory"); +} + #ifdef _KERNEL #define HAVE_INLINE_FFS diff --git a/sys/cpu/x86_64/include/cpufunc.h b/sys/cpu/x86_64/include/cpufunc.h index 7b979795a2..2e2af76927 100644 --- a/sys/cpu/x86_64/include/cpufunc.h +++ b/sys/cpu/x86_64/include/cpufunc.h @@ -226,6 +226,38 @@ cpu_ccfence(void) __asm __volatile("" : : : "memory"); } +/* + * This is a horrible, horrible hack that might have to be put at the + * end of certain procedures (on a case by case basis), just before it + * returns to avoid what we believe to be an unreported AMD cpu bug. + * Found to occur on both a Phenom II X4 820 (two of them), as well + * as a 48-core built around an Opteron 6168 (Id = 0x100f91 Stepping = 1). + * The problem does not appear to occur w/Intel cpus. + * + * The bug is likely related to either a write combining issue or the + * Return Address Stack (RAS) hardware cache. + * + * In particular, we had to do this for GCC's fill_sons_in_loop() routine + * which due to its deep recursion and stack flow appears to be able to + * tickle the amd cpu bug (w/ gcc-4.4.7). Adding a single 'nop' to the + * end of the routine just before it returns works around the bug. + * + * The bug appears to be extremely sensitive to %rip and %rsp values, to + * the point where even just inserting an instruction in an unrelated + * procedure (shifting the entire code base being run) effects the outcome. + * DragonFly is probably able to more readily reproduce the bug due to + * the stackgap randomization code. We would expect OpenBSD (where we got + * the stackgap randomization code from) to also be able to reproduce the + * issue. To date we have only reproduced the issue in DragonFly. + */ +#define __AMDCPUBUG_DFLY01_AVAILABLE__ + +static __inline void +cpu_amdcpubug_dfly01(void) +{ + __asm __volatile("nop" : : : "memory"); +} + #ifdef _KERNEL #define HAVE_INLINE_FFS -- 2.41.0