x86, mce: Xeon75xx specific interface to get corrected memory error information
commit
c773f70fd6b53ee646727f871833e53649907264 upstream (linux-2.6-tip)
Xeon 75xx doesn't log physical addresses on corrected machine check
events in the standard architectural MSRs. Instead the address has to
be retrieved in a model specific way. This makes it impossible to do
predictive failure analysis.
Implement cpu model specific code to do this in mce-xeon75xx.c using a
new hook that is called from the generic poll code. The code retrieves
the physical address/DIMM of the last corrected error from the
platform and makes the address look like a standard architectural MCA
address for further processing.
In addition the DIMM information is retrieved and put into two new
aux0/aux1 fields in struct mce. These fields are specific to a given
CPU. These fields can then be decoded by mcelog into specific DIMM
information. The latest mcelog version has support for this.
Longer term this will be likely in a different output format, but
short term that seemed like the least intrusive solution. Older mcelog
can deal with an extended record.
There's no code to print this information on a panic because this only
works for corrected errors, and corrected errors do not usually result
in panics.
The act of retrieving the DIMM/PA information can take some time, so
this code has a rate limit to avoid taking too much CPU time on a
error flood.
The whole thing can be loaded as a module and has suitable PCI-IDs so
that it can be auto-loaded by a distribution. The code also checks
explicitely for the expected CPU model number to make sure this code
doesn't run anywhere else.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <
20100121221711.GA8242@basil.fritz.box>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>