SQL Projects
Drag Image to Reposition
Design and Creation of a Database of National Weather
Tristan Lalor
Preface
I was tasked with creating a system for assessing temperature by zip-code for every zip-code in the US, based on the last several years of temperature patterns. The way that I determined to be most efficient was to create a database that would allow the user to query average temperatures for certain regions in a certain timeframe. Based on the availability of data, this is the schema I designed for the database:
The first step of this process was finding data to use. After much research looking into the best sources for US temperature data, I ended up using bulk NOAA datasets from the ​Daily Global Historical Climatology Network (GHCN-DAILY). The data was only accessible in the form of hundreds of separate million-line CSVs containing all types of weather data from all throughout the world in the following format:
Not ideal, but it would work.
The next thing I had to do was to find which zip-code corresponded with each weather station. I was able to access a separate file containing the coordinates of each weather station, which I could then cross reference with the coordinates of each zip-code. As it turns out, there was not a weather station in every zip code, so I had to write an algorithm using python to match each zip-code to it's closest weather station.
Pairing Algorithm
from progressbar import ProgressBar pbar = ProgressBar() import os.path import pandas as pd import numpy as np import math as m def readcsv(filename): data = pd.read_csv(filename, dtype=str) return(np.array(data)) #Obtain Tables Stationslist = readcsv("Datatables/Active_Weather_Stations.csv") Zip_Codes = readcsv("Datatables/Zip_Codes_Full.csv") #establish a stations dictionary with values of [lat, long] lists station_location_dict = dict() for row in Stationslist: station_location_dict[row[0]] = [row[1], row[2]] #establish a zip codes dictionary with values of [lat, long] lists zip_code_location_dict = dict() for row in Zip_Codes: zip_code_location_dict[row[0]] = [row[3],row[4]] #Pair EACH zipcode with the station id that has the smallest abs((actual_lat - trial_lat)) + abs((actual_lon - trial_lon)) key = list(zip_code_location_dict.keys()) for zip_code in pbar(list(zip_code_location_dict.keys())): discrepency = 1000000 closest_station = 'null' actual_lat = float(zip_code_location_dict.get(zip_code)[0]) actual_lon = float(zip_code_location_dict.get(zip_code)[1]) for station_id in station_location_dict.keys(): trial_lat = float(station_location_dict.get(station_id)[0]) trial_lon = float(station_location_dict.get(station_id)[1]) trial_discrepency = m.sqrt((actual_lat-trial_lat)**2+(actual_lon-trial_lon)**2) if trial_discrepency < discrepency: discrepency = trial_discrepency closest_station = station_id new_data = zip_code_location_dict.get(zip_code) new_data.append(closest_station) zip_code_location_dict[zip_code] = new_data new_column = [] for row in Zip_Codes: new_column.append(zip_code_location_dict.get(row[0])[2]) print(len(set(new_column))) new_column = np.array(new_column).reshape(-1,1) Zip_Codes = np.hstack((Zip_Codes, new_column)) Headers = np.array(['Zip_Code', 'City', 'State', 'Latitude', 'Longitude', 'Timezone', 'Daylight_Savings_Time_Flag', 'Station_Id']) Zip_Codes = np.vstack((Headers, Zip_Codes)) np.savetxt("Datatables/Full_Zip_Codes_w_FKeys.csv", Zip_Codes.astype(np.str), fmt='%s', delimiter=",")
Python
Each zip code was paired with its nearest weather station by finding the minimum distance between the coordinates of the zip code and the coordinates of the weather station.
The weather stations being paired to the zip codes are a selection of only the active stations for which we have data.
In order to find optimal values in situations like this, it is common to use a discrepancy tolerance (2 degrees for instance) and return the first value that is within the tolerance. This method is useful because it works quickly and will not pair any values that are outside of the set tolerance. However, for our purposes, I found the absolute minima for each iteration, which while being more CPU intensive, yields more accurate results and ensures that each value has a pair.
For each of the 43,192 zip codes, the program ran through all active weather stations and paired the zip code with its corresponding weather station that minimizes coordinate discrepancy.
Discrepancy is calculated as:
where the actuals are the coordinates of the zip code and the trials are the coordinates of the trial weather stations.
Therefore, these 7,895 or so daily data points are extrapolated to all of the 43,192 zip codes when outer joined with SQL, with each weather station representing 5.47 zip codes on average.
Now to take a closer look at how I dealt with the mass of unorganized weather data. The attached python script shows how I collected the data from each file, sorted out irrelevant station IDs (non-US), extracted only relevant fields (temperature max, min, and average), calculated week numbers based on the date, and restructured the data to be normal and consistent with the schema I designed.
Data Structuring
from progressbar import ProgressBar pbar = ProgressBar() import numpy as np import pandas as pd def readcsv(filename): data = pd.read_csv(filename) #Please add four spaces here before this line return(np.array(data)) #Please add four spaces here before this line import os.path import pandas as pd import numpy as np from string import ascii_uppercase from string import ascii_lowercase def generate_filenames(): ups = [] for letter in ascii_uppercase: item = "X" + "A" + letter ups.append(item) for letter in ascii_uppercase: item = "X" + "B" + letter ups.append(item) lows = [] for letter in ascii_lowercase: item = "x" + "a" + letter lows.append(item) for letter in ascii_lowercase: item = "x" + "b" + letter lows.append(item) other = [] for letter in ascii_lowercase: other.append(letter) return other + ups + lows y = generate_filenames() count = 0 for year in(2015, 2016, 2017, 2018): for xxx in y: filename = "Files/"+str(year)+"/Raw/"+xxx+".csv" if os.path.isfile(filename): Secondary_Array = readcsv(filename) if count == 0: Master_Array = Secondary_Array[:,:5] else: Master_Array = np.concatenate((Master_Array, Secondary_Array[:,:5])) count += 1 #save progress import pickle with open("Master_Array.txt", "wb") as fp: #Pickling pickle.dump(Master_Array, fp) import pickle with open("Master_Array.txt", "rb") as fp: # Unpickling Master_Array = pickle.load(fp) new_column = Master_Array[:,0] array_form = ([]) for i in new_column: array_form.append([i[:2]]) array_form = np.asarray(array_form) Master_Array = np.append(Master_Array, array_form, axis=1) US_Array = np.empty((1,6)) US_Array = np.vstack((US_Array, Master_Array[Master_Array[:,5] == "US"])) #this is the most powerful line of code ever written Temperature_column = US_Array[:,2] array_form = ([]) for i in Temperature_column: array_form.append([str(i)[:1]]) array_form = np.asarray(array_form) US_Array = np.append(US_Array, array_form, axis=1) SORTED = np.empty((1,7)) SORTED = np.vstack((SORTED, US_Array[US_Array[:,6] == "T"])) #save progress np.savetxt("Datatables/Temperature_Table.csv", SORTED.astype(np.str), fmt='%s', delimiter=",") Temperature_Table = readcsv("Datatables/Temperature_Table.csv") tempdict = dict() #first run through for row in Temperature_Table[1:,:]: tempdict[(row[0], row[1])] = [None,None,None,None] #2nd run through for row in Temperature_Table[1:,:]: if row[2] == 'TMAX': tempdict[(row[0], row[1])][0] = row[3] elif row[2] == 'TMIN': tempdict[(row[0], row[1])][1] = row[3] elif row[2] == 'TOBS': tempdict[(row[0], row[1])][2] = row[3] elif row[2] == 'TAVG': tempdict[(row[0], row[1])][3] = row[3] for key in tempdict.keys(): if tempdict[key][0] != None and tempdict[key][1] != None: tempdict[key].append((float(tempdict[key][0])+float(tempdict[key][1]))/2) else: tempdict[key].append(None) Temperature_Table_Normal = [['Station Id','Date', 'TMAX', 'TMIN', 'TOBS', 'TAVG - Reported', 'TAVG - Calculated']] for key in pbar(list(tempdict.keys())[:]): Temperature_Table_Normal.append([key[0], key[1], tempdict[key][0],tempdict[key][1],tempdict[key][2],tempdict[key][3],tempdict[key][4]]) import datetime new_column = ['Week'] for row in Temperature_Table_Normal: if row[0] != 'Station Id': week = datetime.date(int(float(str(row[1])[:4])),int(float(str(row[1])[4:6])),int(float(str(row[1])[6:]))).isocalendar()[1] new_column.append(week) Temperature_Table_Normal = np.array(Temperature_Table_Normal) new_column = np.array(new_column).reshape(-1,1) Temperature_Table_Normal = np.hstack((Temperature_Table_Normal, new_column)) np.savetxt("Datatables/Temperature_Table_Normal(From a List).csv", Temperature_Table_Normal.astype(np.str), fmt='%s', delimiter=",")
Python
There are a total of 8,519,230 unique temperature data points across 1,079 days, with an average of 7,895 data points per day. After completion of all the scripts, the data was output into 3 final data tables to be loaded into SSMS.
This project took a long time from start to finish but I'm glad I could walk away with a finished and usable product. Reach out if you have any questions!
Tristan