3 min read

Weather-Transaction Correlation

Weather can have a significant impact on user behavior. Many businesses think about the impact a hot summer day can have on their conversion rates. But overlapping effects from seasonality and weather require us to provide exact, session-based weather data. One approach is to extend the web analytics implementation and use a third-party weather service to fetch the information while the user is still on the site. However, these third-party services are not free, it requires us to get the user's location, and it also is an additional load for the site.

Wouldn't it be great if we could just use a public data set and some query magic to get per-session weather data historically and with no additional tracking? Guess what: We can! I prepared a little query that I want to share. But before you expect some full-on statistical analysis, be told that I am no data scientist. I'll get you the data, but someone with more stats skills has to get to the real insights.

The weather data we are using is part of our beloved bigquery-public-data project. The National Oceanic and Atmospheric Administration (NOAA) makes a dataset available in this project that goes back to 1929 and is updated daily! The weather data is coming from 9000+ weather stations globally.

The GSOD dataset has one table with all the stations that ever recorded a weather measurement and annual tables with the individual measurements. The stations table has the latitude and longitude information that we'll need later. To find the closest station for every user session, we have to cross join all stations and tables and calculate the geo distance. Then we can join the measurement of that day on to the session and station.

The cross join here is the queries Achilles heel. If we are not filtering the stations to a reasonable amount, the cross join creates a gigantic intermediate result. For me, it worked very well just filtering to a specific country for both the sessions and the stations.

WITH station AS (

SELECT
  wban,
  usaf,
  lon,
  lat
FROM
  `bigquery-public-data.noaa_gsod.stations`

WHERE
  lon IS NOT NULL
  AND lat IS NOT NULL
  AND country = "GM"

), data AS (

SELECT
  session.fullvisitorid,
  session.visitId,
  session.totals.totalTransactionRevenue totalTransactionRevenue,
  (measure.temp - 32) * (5/9) temp_cel,
  ST_DISTANCE(
    ST_GEOGPOINT(
      SAFE_CAST(session.geoNetwork.longitude  AS FLOAT64),
      SAFE_CAST(session.geoNetwork.latitude AS FLOAT64)
    ),
    ST_GEOGPOINT(station.lon, station.lat)
  ) dist
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_2017*` session,
  station
JOIN
  `bigquery-public-data.noaa_gsod.gsod2017` measure
ON
  station.wban = measure.wban
  AND station.usaf = measure.stn
  AND session.geoNetwork.country = "Germany"
  AND CONCAT(measure.year, measure.mo, measure.da) = session.date

)

SELECT
  fullvisitorId,
  visitId,
  totalTransactionRevenue,
  ARRAY_AGG(temp_cel ORDER BY dist LIMIT 1)[SAFE_OFFSET(0)] temp_cel
FROM
  data
GROUP BY
  1,2,3

Let's go through this and explain what is going on.

  • 26-37: Prepare the station data. The NOAA is using some unusual two-letter country codes. Make sure to adapt it to your country of analysis.
  • 54-55: This is the mentioned cross join. Don't let the implicit comma-cross-join confuse you.
  • 56-62: We are using a regular join to bring in the measurements and run a couple of filters.
  • 45: The temperature data is in Fahrenheit, but I prefer Celcius.
  • 46-52: The ST_DISTANCE functions returns the distance between points on earth in meters. The Google Analytics schema represents the latitude and longitude as strings, se have to get a little more verbose here when creating the ST_GEOGPOINTs.
  • 70: To avoid a correlated query, I had to get a little creative. The use of ARRAY_AGG allows us to order by dist but return the temp_cel value. We limit to one to get an array with only one member and use [SAFE_OFFSET(0)] to pull out the one member.

And there we have it: One row per session with the session and user identifier, the temperature in Celcius, and the total revenue. I'm sure you can come up with some ideas on how to use this?