“If you don’t like the weather, just wait 15 minutes” – Anonymous
Every afternoon, each of the National Weather Service (NWS) offices issues temperature forecasts for the next seven days. One of the products they issue for cities within their area of responsibility is called a Point Forecast Matrix (PFM). These are auto-generated forecasts from the Digital Forecast Database for cities with long-term weather stations and are usually sited at airports. The forecast is interpolated from grids generated locally at each NWS Office. Point forecasts are generated for approximately 2,000 cities around the U.S.
The portion of the point forecast that we are focusing on are the minimum and maximum temperatures. For the overnight hours (7 p.m. to 8 a.m.) the point forecast identifies a low temperature forecast. For the 7 a.m. to 7 p.m. time period, a high temperature forecast is identified (different time spans for Alaska). An important point needs to be made here; namely, the low temperature does not always fall in the overnight hours and the high temperature does not always occur during the daytime hours. More on this momentarily.
For each of the 2,000 stations, we need to find if climate data is available. Since the point forecast is specific to a portion of the day, a simple min/max temperature is not sufficient. Instead, we need to look at the hourly observations and identify the lowest hourly temperature between 7 p.m. and 8 a.m. and then the highest hourly observation between 7 a.m. and 7 p.m. This matches the time-period of the auto-generated PFM forecast. Astute readers know that the temperatures jump up and down between the hourly observations. Unfortunately, this level of detail requires analysis of 1-minute data and therefore must wait for another study.
Once the forecast and the observed temperatures are matched-up, we can evaluate how well the forecast compares with the observed temperature. At this point, we have a significant issue that arises. A forecast that is off by 5°F in Dallas, TX, is a much bigger deal than forecast that is off by 5°F in Fargo, ND. Dallas has a lower daily temperature variability than Fargo. To accommodate these regional variations, the difference between the forecast and the observed temperature is divided by the daily normal standard deviation published by the National Center for Environmental Information (NCEI). That 5°F forecast error for Dallas, if it occurred on May 1st, is 0.80 standard deviations (z-score) from the observed temperature. For Fargo, that same 5°F forecast error is only 0.55 standard deviations from the observed temperature. Therefore, the Fargo forecast has a lower error score compared to Dallas (0.55 std. dev. versus 0.80 std. dev.).
Special consideration is required for locations with low daily variation. For example, a forecast for Honolulu that is off by 2°F scores very poorly because the daily standard deviation is so low (2.2°F on May 1). Therefore, for all stations, if the observed temperature was within 3°F of the actual temperature, it is assigned a perfect score of 0.0. [Special thanks to Rick Thoman for this suggestion.]
Formula: |observed–forecast| / daily std. dev. [if observed-forecast ≤3°F, set equal to each other]
The number of stations that have a) an auto-generated point forecast, b) hourly climate data, and c) published daily standard deviations during the 2017 study period, is 689. This number is sufficient to draw spatial conclusions.
Why call a forecast ±3°F a perfect score? It turns out that if you just compute the average daily forecast error, the worst scores are associated with the stations that have the lowest variability. In fact, Hawai’i stations are at the bottom of the list. Why? Recall that a 2°F forecast “bust” with a 2°F standard deviation means that it was off by a full standard deviation. This is a “worse” forecast than Fairbanks, AK, being off by 12°F during March!
A full 45% of the station-to-station forecast error is described by the internal temperature variability of a place when using the raw forecast error computation. If you call a forecast that is ±3°F a perfect forecast, the station-to-station forecast error variation drops to under 1%! This allows us to better describe which areas are more inherently difficult to forecast. It also lets us treat Hawai’i the same as Kansas.
My working hypothesis is that NWS forecasters across the country have the same skill level. Forecasters generally have the same set of tools, follow the same procedures, and frequently move from office to office around the country. It is unlikely that forecasters in one city are all very good and forecasters in another city are all very bad. Of course forecasters are not robots and bring varying levels of expertise to their jobs. On balance though, each office is assumed to be equally capable.
An implication of this hypothesis is that differences across the country reflect variability in the difficulty in forecasting temperatures – not variability of forecaster ability. Humans are naturally competitive though. People in offices that score poorly will probably not view this as favorably as people in office that score well. Remember, this is an exercise for showing the spatial variability in temperature forecasting difficulty, not forecaster ability.
Each Point Forecast Matrix product has a 7-day forecast. The map animation below shows the average forecast error (with the ±3°F score classification incorporated) for each of the seven days at all 689 locations between January 1 and December 31, 2017.
Before looking at the big picture, we should note that some stations appear out of sync with neighboring stations. This is more than likely an issue with the hourly temperature reporting (e.g., Des Moines, IA). A future study will look at station observation issues.
The stations on each map are divided into quantiles; that is, the 1/5 of stations with the lowest forecast errors are shown in cyan. The 1/5 of stations with the highest forecast errors are shown in red. And so on.
Areas along the Gulf coast to the Ohio River Valley score well in the short range and the long range (cyan). This indicates a high level of forecast predictability at all forecast time scales. North of the Ohio River, forecast predictability gradually diminishes with latitude. Areas west of 100°W longitude (including Alaska) do poorly at all time scales – as does the Northeast. Also, note that there is not a lot of geographical shifting from Day 1 to Day 7.
In most cases, the forecast error appears to correlate with average annual dewpoint. High dew point regions score well for forecast predictability and low dew point areas do not. The exception to this rule is the Northeast. The scatter plot below shows the relationship between average annual dew point and the average daily forecast score after adjusting for the ±3°F forecast match for forecast Day 7. This relationship at Day 7 clearly demonstrates that high dew point regions have better forecast outcomes.
The low dew point areas in the West also have great local topographic considerations. Mountains and valley present unique challenges with regard to cloud cover, cold air drainage, and many other phenomenon. The combined effect of these factors makes for challenging temperature forecasts.
When looking at the 4-, 5-, 6-, and 7-day forecasts, you should have confidence that the forecasters know what they are doing. I do. But for regions that have a lot of red dots on the maps shown above, don’t be surprised to see a forecast “bust” pretty regularly. The atmospheric system is complex. That complexity is harder to tame in some areas.