Tristan Lalor

Preface

I was tasked with creating a system for assessing temperature by zip-code for every zip-code in the US, based on the last several years of temperature patterns. The way that I determined to be most efficient was to create a database that would allow the user to query average temperatures for certain regions in a certain timeframe. Based on the availability of data, this is the schema I designed for the database:

The first step of this process was finding data to use. After much research looking into the best sources for US temperature data, I ended up using bulk NOAA datasets from the ​Daily Global Historical Climatology Network (GHCN-DAILY). The data was only accessible in the form of hundreds of separate million-line CSVs containing all types of weather data from all throughout the world in the following format:

Not ideal, but it would work. 

The next thing I had to do was to find which zip-code corresponded with each weather station. I was able to access a separate file containing the coordinates of each weather station, which I could then cross reference with the coordinates of each zip-code. As it turns out, there was not a weather station in every zip code, so I had to write an algorithm using python to match each zip-code to it's closest weather station. 

Pairing Algorithm

from progressbar import ProgressBar
  pbar = ProgressBar()
  import os.path
  import pandas as pd
  import numpy as np
  import math as m
  
  def readcsv(filename):
      data = pd.read_csv(filename, dtype=str)
      return(np.array(data))
  
  #Obtain Tables
  Stationslist = readcsv("Datatables/Active_Weather_Stations.csv")
  Zip_Codes = readcsv("Datatables/Zip_Codes_Full.csv")
  
  #establish a stations dictionary with values of [lat, long] lists
  station_location_dict = dict()
  for row in Stationslist:
      station_location_dict[row[0]] = [row[1], row[2]]
  #establish a zip codes dictionary with values of [lat, long] lists
  zip_code_location_dict = dict()
  for row in Zip_Codes:
      zip_code_location_dict[row[0]] = [row[3],row[4]]
  
  #Pair EACH zipcode with the station id that has the smallest abs((actual_lat - trial_lat)) + abs((actual_lon - trial_lon))
  key = list(zip_code_location_dict.keys())
  for zip_code in pbar(list(zip_code_location_dict.keys())):
      discrepency = 1000000
      closest_station = 'null'
      actual_lat = float(zip_code_location_dict.get(zip_code)[0])
      actual_lon = float(zip_code_location_dict.get(zip_code)[1])
      for station_id in station_location_dict.keys():
          trial_lat = float(station_location_dict.get(station_id)[0])
          trial_lon = float(station_location_dict.get(station_id)[1])
          trial_discrepency = m.sqrt((actual_lat-trial_lat)**2+(actual_lon-trial_lon)**2)
          if trial_discrepency < discrepency:
              discrepency = trial_discrepency
              closest_station = station_id
      new_data = zip_code_location_dict.get(zip_code)
      new_data.append(closest_station)
      zip_code_location_dict[zip_code] = new_data
  new_column = []
  for row in Zip_Codes:
      new_column.append(zip_code_location_dict.get(row[0])[2])
  print(len(set(new_column)))
  new_column = np.array(new_column).reshape(-1,1)
  
  Zip_Codes = np.hstack((Zip_Codes, new_column))
  Headers = np.array(['Zip_Code', 'City', 'State', 'Latitude', 'Longitude', 'Timezone', 'Daylight_Savings_Time_Flag', 'Station_Id'])
  Zip_Codes = np.vstack((Headers, Zip_Codes))
  
  np.savetxt("Datatables/Full_Zip_Codes_w_FKeys.csv", Zip_Codes.astype(np.str), fmt='%s', delimiter=",")
  

Python

Each zip code was paired with its nearest weather station by finding the minimum distance between the coordinates of the zip code and the coordinates of the weather station.

The weather stations being paired to the zip codes are a selection of only the active stations for which we have data. 

In order to find optimal values in situations like this, it is common to use a discrepancy tolerance (2 degrees for instance) and return the first value that is within the tolerance. This method is useful because it works quickly and will not pair any values that are outside of the set tolerance. However, for our purposes, I found the absolute minima for each iteration, which while being more CPU intensive, yields more accurate results and ensures that each value has a pair.

For each of the 43,192 zip codes, the program ran through all active weather stations and paired the zip code with its corresponding weather station that minimizes coordinate discrepancy.
                      

Discrepancy is calculated as:

where the actuals are the coordinates of the zip code and the trials are the coordinates of the trial weather stations.

Therefore, these 7,895 or so daily data points are extrapolated to all of the 43,192 zip codes when outer joined with SQL, with each weather station representing 5.47 zip codes on average.

Now to take a closer look at how I dealt with the mass of unorganized weather data. The attached python script shows how I collected the data from each file, sorted out irrelevant station IDs (non-US), extracted only relevant fields (temperature max, min, and average), calculated week numbers based on the date, and restructured the data to be normal and consistent with the schema I designed. 

Data Structuring

from progressbar import ProgressBar
  pbar = ProgressBar()
  import numpy as np
  import pandas as pd
  def readcsv(filename):
          data = pd.read_csv(filename) #Please add four spaces here before this line
          return(np.array(data)) #Please add four spaces here before this line
  
  import os.path
  import pandas as pd
  import numpy as np
  
  from string import ascii_uppercase
  from string import ascii_lowercase
  def generate_filenames():
      ups = []
      for letter in ascii_uppercase:
          item = "X" + "A" + letter
          ups.append(item)
      for letter in ascii_uppercase:
          item = "X" + "B" + letter
          ups.append(item)
      lows = []
      for letter in ascii_lowercase:
          item = "x" + "a" + letter
          lows.append(item)
      for letter in ascii_lowercase:
          item = "x" + "b" + letter
          lows.append(item)
      other = []
      for letter in ascii_lowercase:
          other.append(letter)
  
      return other + ups + lows
  y = generate_filenames()
  count = 0
  for year in(2015, 2016, 2017, 2018):
      for xxx in y:
          filename = "Files/"+str(year)+"/Raw/"+xxx+".csv"
          if os.path.isfile(filename):
              Secondary_Array = readcsv(filename)
              if count == 0:
                  Master_Array = Secondary_Array[:,:5]
              else:
                  Master_Array = np.concatenate((Master_Array, Secondary_Array[:,:5]))
              count += 1
  
  #save progress
  import pickle
  with open("Master_Array.txt", "wb") as fp:   #Pickling
      pickle.dump(Master_Array, fp)
  import pickle
  with open("Master_Array.txt", "rb") as fp:   # Unpickling
      Master_Array = pickle.load(fp)
  
  new_column = Master_Array[:,0]
  array_form = ([])
  for i in new_column:
      array_form.append([i[:2]])
  array_form = np.asarray(array_form)
  Master_Array = np.append(Master_Array, array_form, axis=1)
  US_Array = np.empty((1,6))
  US_Array = np.vstack((US_Array, Master_Array[Master_Array[:,5] == "US"])) #this is the most powerful line of code ever written
  Temperature_column = US_Array[:,2]
  array_form = ([])
  for i in Temperature_column:
      array_form.append([str(i)[:1]])
  array_form = np.asarray(array_form)
  US_Array = np.append(US_Array, array_form, axis=1)
  SORTED = np.empty((1,7))
  SORTED = np.vstack((SORTED, US_Array[US_Array[:,6] == "T"]))
  
  #save progress
  np.savetxt("Datatables/Temperature_Table.csv", SORTED.astype(np.str), fmt='%s', delimiter=",")
  
  Temperature_Table = readcsv("Datatables/Temperature_Table.csv")
  
  tempdict = dict()
  
  #first run through
  for row in Temperature_Table[1:,:]:
      tempdict[(row[0], row[1])] = [None,None,None,None]
  
  #2nd run through
  for row in Temperature_Table[1:,:]:
      if row[2] == 'TMAX':
          tempdict[(row[0], row[1])][0] = row[3]
      elif row[2] == 'TMIN':
          tempdict[(row[0], row[1])][1] = row[3]
      elif row[2] == 'TOBS':
          tempdict[(row[0], row[1])][2] = row[3]
      elif row[2] == 'TAVG':
          tempdict[(row[0], row[1])][3] = row[3]
  
  for key in tempdict.keys():
      if tempdict[key][0] != None and tempdict[key][1] != None:
          tempdict[key].append((float(tempdict[key][0])+float(tempdict[key][1]))/2)
      else:
          tempdict[key].append(None)
  
  Temperature_Table_Normal = [['Station Id','Date', 'TMAX', 'TMIN', 'TOBS', 'TAVG - Reported', 'TAVG - Calculated']]
  for key in pbar(list(tempdict.keys())[:]):
      Temperature_Table_Normal.append([key[0], key[1], tempdict[key][0],tempdict[key][1],tempdict[key][2],tempdict[key][3],tempdict[key][4]])
  import datetime
  new_column = ['Week']
  for row in Temperature_Table_Normal:
      if row[0] != 'Station Id':
          week = datetime.date(int(float(str(row[1])[:4])),int(float(str(row[1])[4:6])),int(float(str(row[1])[6:]))).isocalendar()[1]
          new_column.append(week)
  Temperature_Table_Normal = np.array(Temperature_Table_Normal)
  new_column = np.array(new_column).reshape(-1,1)
  Temperature_Table_Normal = np.hstack((Temperature_Table_Normal, new_column))
  np.savetxt("Datatables/Temperature_Table_Normal(From a List).csv", Temperature_Table_Normal.astype(np.str), fmt='%s', delimiter=",")
  

Python

There are a total of 8,519,230 unique temperature data points across 1,079 days, with an average of 7,895 data points per day. After completion of all the scripts, the data was output into 3 final data tables to be loaded into SSMS.

This project took a long time from start to finish but I'm glad I could walk away with a finished and usable product. Reach out if you have any questions!

Tristan