Thursday, March 12, 2009

looking for data biasing in 2008 blue sky data

Last year, a report by consultant Steven Q. Andrews highlighted apparent data biasing in Beijing's API data, particularly in the years 2006-2007. My take on his report (from last October) is here.

One of Mr. Andrews' core findings is that there were statistical anomalies in the frequency of Beijing's reported pollutant concentrations around the "Blue Sky Day" cut-off point. Specifically, there were too many reported values just below the cut-off, and too few just above, suggesting data manipulation to meet targets for number of Blue Sky Days. For reference, here is the excellent Figure 2 from Mr. Andrews' report:

Mr. Andrews' graph shows frequency of reported values vs. PM10 concentration, for which the Blue Sky Day cut-off value is 150 ug/m3.

With 2008 behind us, I decided to take a look back to see if a similar phenomenon existed last year. After querying Beijing's API data from MEP's datacenter, I parsed the data for each year into frequency by units of 5. In other words, I counted the number of days with API from 0 to 5, 6 to 10, 11 to 15, etc., all the way to 500. Rather than use PM10 concentration, I looked directly at API, for which the Blue Sky Day cut-off is 100. The results for 2006, 2007, and 2008 are graphed here:

Notes and Conclusions:

1) As expected, the 2006-2007 data biasing identified by Mr. Andrews is clearly visible here. In 2006, there were 50 days with API from 96-100, but only 2 days with API from 101-105. In 2007, there were 56 days with API from 96-100, and only 5 days with API from 101-105.

2) As for 2008, to be honest, I'm not sure how to interpret the data. Although there is clearly no dramatic spike in frequency of reported API values just below 100 (a good sign), there are only 3 days with API from 101-105. I do not know enough about statistics to know whether or not this is significant. (For reference, there were 16 days in the range 96-100, 9 days in the range 106-110, and 12 days 111-115). Anyone have any insights?

ebalkan said...


Afraid I'm ill-prepared to comment without having the data in front of me to look at the distribution and run some statistical tests. If you want to send me the data xls I can try to run it, using Stata.

Can you tell me what does API measure?


needigest said...

Nevermind (re: API). Wiki informed me of this measurement unit with Chinese characteristics (

Anonymous said...

I'd guess they are taking the days from just above 100 and spreading them more evenly below 100. A natural strategy given the uncovering of the previous strategy by Mr. Andrews.

