Calculate average and normalize GPS data in Python - python

I have a dataset in json with gps coordinates:
"utc_date_and_time":"2021-06-05 13:54:34", # timestamp
"hdg":"018.0", # heading
"sog":"000.0", # speed
"lat":"5905.3262N", # latitude
"lon":"00554.2433E" # longitude
This data will be imported into a database, with one entry every second for every "vessel".
As you can imagine this is a huge amount of data that provides a level of accuracy I do not need.
My goal:
Create a new entry in the database for every X seconds
If I set X to 60 (a minute) and there are missing 10 entries within this period, 50 entries should be used. Data can be missing for certain periods, and I do not want this to create bogus positions.
Use timestamp from last entry in period.
Use the heading (hdg) that is appearing the most times within this period.
Calculate average speed within this period.
Latitude and longitude could use the last entry, but I have seen "spikes" that needs to be filtered out, or use average, and remove values that differ too much.
My script is now pushing all the data to the database via a for loop with different data-checks inside it, and this is working.
I am new to python and still learning every day through reading and youtube videos, but it would be great if anyone could point me in the right direction for how to achieve the above goal.
As of now the data is imported into a dictionary. And I am wondering if creating a dictionary where the timestamp is the key is the way to go, but I am a little lost.
Code:
import os
import json
from pathlib import Path
from datetime import datetime, timedelta, date
def generator(data):
for entry in data:
yield entry
data = json.load(open("5_gps_2021-06-05T141524.1397180000.json"))["gps_data"]
gps_count = len(data)
start_time = None
new_gps = list()
tempdata = list()
seconds = 60
i = 0
for entry in generator(data):
i = i+1
if start_time == None:
start_time = datetime.fromisoformat(entry['utc_date_and_time'])
# TODO: Filter out values with too much deviation
tempdata.append(entry)
elapsed = (datetime.fromisoformat(entry['utc_date_and_time']) - start_time).total_seconds()
if (elapsed >= seconds) or (i == gps_count):
# TODO: Calculate average values etc. instead of using last
new_gps.append(tempdata)
tempdata = []
start_time = None
print("GPS count before:" + str(gps_count))
print("GPS count after:" + str(len(new_gps)))
Output:
GPS count before:1186
GPS count after:20

Related

Efficient time series sliding window function

I am trying to create a sliding window for a time series. So far I have a function that I managed to get working that lets you take a given series, set a window size in seconds and then create a rolling sample. My issue is that it is taking very long to run and seems like an inefficient approach.
# ========== create dataset =========================== #
import pandas as pd
from datetime import timedelta, datetime
timestamp_list = ["2022-02-07 11:38:08.625",
"2022-02-07 11:38:09.676",
"2022-02-07 11:38:10.084",
"2022-02-07 11:38:10.10000",
"2022-02-07 11:38:11.2320"]
bid_price_list = [1.14338,
1.14341,
1.14340,
1.1434334,
1.1534334]
df = pd.DataFrame.from_dict(zip(timestamp_list, bid_price_list))
df.columns = ['timestamp','value']
# make date time object
df.timestamp = [datetime.strptime(time_i, "%Y-%m-%d %H:%M:%S.%f") for time_i in df.timestamp]
df.head(3)
timestamp value timestamp_to_sec
0 2022-02-07 11:38:08.625 1.14338 2022-02-07 11:38:08
1 2022-02-07 11:38:09.676 1.14341 2022-02-07 11:38:09
2 2022-02-07 11:38:10.084 1.14340 2022-02-07 11:38:10
# ========== create rolling time-series function ====== #
# get the floor of time (second value)
df["timestamp_to_sec"] = df["timestamp"].dt.floor('s')
# set rollling window length in seconds
window_dt = pd.Timedelta(seconds=2)
# containers for rolling sample statistics
n_list = []
mean_list = []
std_list =[]
# add dt (window) seconds to the original time which was floored to the second
df["timestamp_to_sec_dt"] = df["timestamp_to_sec"] + window_dt
# get unique end times
time_unique_endlist = np.unique(df.timestamp_to_sec_dt)
# remove end times that are greater than the last actual time, i.e. max(df["timestamp_to_sec"])
time_unique_endlist = time_unique_endlist[time_unique_endlist <= max(df["timestamp_to_sec"])]
# loop running the sliding window (time_i is the end time of each window)
for time_i in time_unique_endlist:
# start time of each rolling window
start_time = time_i - window_dt
# sample for each time period of sliding window
rolling_sample = df[(df.timestamp >= start_time) & (df.timestamp <= time_i)]
# calculate the sample statistics
n_list.append(len(rolling_sample)) # store n observation count
mean_list.append(rolling_sample.mean()) # store rolling sample mean
std_list.append(rolling_sample.std()) # store rolling sample standard deviation
# plot histogram for each sample of the rolling sample
#plt.hist(rolling_sample.value, bins=10)
# tested and n_list brought back the correct values
>>> n_list
[2,3]
Is there a more efficient way of doing this, a way I could improve my interpretation or an open-source package that allows me to run a rolling window like this? I know that there is the .rolling() in pandas but that rolls on the values. I want something that I can use on unevenly-spaced data, using the time to define the fixed rolling window.
It seems like this is the best performance, hope it helps anyone else.
# set rollling window length in seconds
window_dt = pd.Timedelta(seconds=2)
# add dt seconds to the original timestep
df["timestamp_to_sec_dt"] = df["timestamp_to_sec"] + window_dt
# unique end time
time_unique_endlist = np.unique(df.timestamp_to_sec_dt)
# remove end values that are greater than the last actual value, i.e. max(df["timestamp_to_sec"])
time_unique_endlist = time_unique_endlist[time_unique_endlist <= max(df["timestamp_to_sec"])]
# containers for rolling sample statistics
mydic = {}
counter = 0
# loop running the rolling window
for time_i in time_unique_endlist:
start_time = time_i - window_dt
# sample for each time period of sliding window
rolling_sample = df[(df.timestamp >= start_time) & (df.timestamp <= time_i)]
# calculate the sample statistics
mydic[counter] = {
"sample_size":len(rolling_sample),
"sample_mean":rolling_sample["value"].mean(),
"sample_std":rolling_sample["value"].std()
}
counter = counter + 1
# results in a DataFrame
results = pd.DataFrame.from_dict(mydic).T

How do I create a Mean Annual Rainfall table for various durations from a NetCDF4 using Python?

I have downloaded a NetCDF4 file of total hourly precipitation across Sierra Leone from 1974 to Present, and have started to create a code to analyze it.
I'm trying to form a table in Python that will display my average annual rainfall for different rainfall durations, rather like this one below:
I'm wondering if anyone has done anything similar to this before and could possibly help me out as I'm very new to programming?
Here is the script I've written so far that records the hourly data for each year. From here I need to find a way to store this information onto a table, then to change the duration to say, 2 hours, and repeat until I have a complete table:
import glob
import numpy as np
from netCDF4 import Dataset
import pandas as pd
import xarray as xr
all_years = []
for file in glob.glob('*.nc'):
data = Dataset(file, 'r')
time = data.variables['time']
year = time.units[11:16]
all_years.append(year)
year_start = '01-01-1979'
year_end = '31-12-2021'
date_range = pd.date_range(start = str(year_start),
end = str(year_end),
freq = 'H')
df = pd.DataFrame(0.0,columns = ['tp'], index = date_range)
lat_freetown = 8.4657
lon_freetown = 13.2317
all_years.sort()
for yr in range(1979,2021):
data = Dataset('era5_year' + str(yr)+ '.nc', 'r')
lat = data.variables['latitude'][:]
lon = data.variables['longitude'][:]
sq_diff_lat = (lat - lat_freetown)**2
sq_diff_lon = (lon - lon_freetown)**2
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
tp = data.variables['tp']
start = str(yr) + '-01-01'
end = str(yr) + '-12-31'
d_range = pd.date_range(start = start,
end = end,
freq = 'H')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for: ' + str(d_range[t_index])+str(tp[t_index, min_index_lat, min_index_lon]))
df.loc[d_range[t_index]]['tp'] = tp[t_index, min_index_lat, min_index_lon]
I gave this a try, I hope it helps.
I downloaded two years of coarse US precip data here:
https://downloads.psl.noaa.gov/Datasets/cpc_us_hour_precip/precip.hour.2000.nc
https://downloads.psl.noaa.gov/Datasets/cpc_us_hour_precip/precip.hour.2001.nc
import xarray as xr
import pandas as pd
# Read two datasets and append them so there are multiple years of hourly data
precip_full1 = xr.open_dataset('precip.hour.2000.nc') * 25.4
precip_full2 = xr.open_dataset('precip.hour.2001.nc') * 25.4
precip_full = xr.concat([precip_full1,precip_full2],dim='time')
# Select only the Western half of the US
precip = precip_full.where(precip_full.lon<257,drop=True)
# Initialize output
output = []
# Select number of hours to sum
# This assumes that the data is hourly
intervals = [1,2,6,12,24]
# Loop through each desired interval
for interval in intervals:
# Take rolling sum
# This means the value at any time is the sum of the preceeding times
# So when interval is 6, it's the sum of the previous six values
roll = precip.rolling(time=interval,center=False).sum()
# Take the annual mean and average over all space
annual = roll.groupby('time.year').mean('time').mean(['lat','lon'])
# Convert output to a pandas dataframe
# and rename the column to correspond to the interval length
tab = annual.to_dataframe().rename(columns={'precip':str(interval)})
# Keep track of the output by appending it to the output list
output.append(tab)
# Combine the dataframes into one, by rows
output = pd.concat(output,1)
The output looks like this:
1 2 6 12 24
year
2000 0.014972 0.029947 0.089856 0.179747 0.359576
2001 0.015610 0.031219 0.093653 0.187290 0.374229
Again, this assumes that the data is already hourly. It also takes the average of any (for example) 6 hour period, so it's not just 00:00-06:00, 06:00-12:00, etc., it's 00:00-06:00, 001:00-07:00, etc., and then the annual mean. If you wanted the former you could use xarray's resample function after taking the rolling sum.

Python web scraping, only collects 80 to 90% of intended data rows. Is there something wrong with my loop?

I'm trying to collect the 150 rows of data from the text that appears at the bottom of a given Showbuzzdaily.com web page (example), but my script only collects 132 rows.
I'm new to Python. Is there something I need to add to my loop to ensure all records are collected as intended?
To troubleshoot, I created a list (program_count) to verify this is happening in the code before the CSV is generated, which shows there are only 132 items in the list, rather than 150. Interestingly, the final row (#132) ends up being duplicated at the end of the CSV for some reason.
I experience similar issues scraping Google Trends (using pytrends), where only about 80% of the data I try to scrape ended up in the CSV. So I'm suspecting there's something wrong with my code or that I'm overwhelming my target with requests.
Adding time.sleep(0.1) to for while loop in this code didn't produce different results.
import time
import requests
import datetime
from bs4 import BeautifulSoup
import pandas as pd # import pandas module
from datetime import date, timedelta
# creates empty 'records' list
records = []
start_date = date(2021, 4, 12)
orig_start_date = start_date # Used for naming the CSV
end_date = date(2021, 4, 12)
delta = timedelta(days=1) # Defines delta as +1 day
print(str(start_date) + ' to ' + str(end_date)) # Visual reassurance
# begins while loop that will continue for each daily viewership report until end_date is reached
while start_date <= end_date:
start_weekday = start_date.strftime("%A") # define weekday name
start_month_num = int(start_date.strftime("%m")) # define month number
start_month_num = str(start_month_num) # convert to string so it is ready to be put into address
start_month_day_num = int(start_date.strftime("%d")) # define day of the month
start_month_day_num = str(start_month_day_num) # convert to string so it is ready to be put into address
start_year = int(start_date.strftime("%Y")) # define year
start_year = str(start_year) # convert to string so it is ready to be put into address
#define address (URL)
address = 'http://www.showbuzzdaily.com/articles/showbuzzdailys-top-150-'+start_weekday.lower()+'-cable-originals-network-finals-'+start_month_num+'-'+start_month_day_num+'-'+start_year+'.html'
print(address) # print for visual reassurance
# read the web page at the defined address (URL)
r = requests.get(address)
soup = BeautifulSoup(r.text, 'html.parser')
# we're going to deal with results that appear within <td> tags
results = soup.find_all('td')
# reads the date text at the top of the web page so it can be inserted later to the CSV in the 'Date' column
date_line = results[0].text.split(": ",1)[1] # reads the text after the colon and space (': '), which is where the date information is located
weekday_name = date_line.split(' ')[0] # stores the weekday name
month_name = date_line.split(' ',2)[1] # stores the month name
day_month_num = date_line.split(' ',1)[1].split(' ')[1].split(',')[0] # stores the day of the month
year = date_line.split(', ',1)[1] # stores the year
# concatenates and stores the full date value
mmmmm_d_yyyy = month_name+' '+day_month_num+', '+year
del results[:10] # deletes the first 10 results, which contained the date information and column headers
program_count = [] # empty list for program counting
# (within the while loop) begins a for loop that appends data for each program in a daily viewership report
for result in results:
rank = results[0].text # stores P18-49 rank
program = results[1].text # stores program name
network = results[2].text # stores network name
start_time = results[3].text # stores program's start time
mins = results[4].text # stores program's duration in minutes
p18_49 = results[5].text # stores program's P18-49 rating
p2 = results[6].text # stores program's P2+ viewer count (in thousands)
records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2)) # appends the data to the 'records' list
program_count.append(program) # adds each program name to the list.
del results[:7] # deletes the first 7 results remaining, which contained the data for 1 row (1 program) which was just stored in 'records'
print(len(program_count)) # Toubleshooting: prints to screen the number of programs counted. Should be 150.
records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2)) # appends the data to the 'records' list
print(str(start_date)+' collected...') # Visual reassurance one page/day is finished being collected
start_date += delta # at the end of while loop, advance one day
df = pd.DataFrame(records, columns=['Date','Weekday','P18-49 Rank','Program','Network','Start time','Mins','P18-49','P2+']) # Creates DataFrame using the columns listed
df.to_csv('showbuzz '+ str(orig_start_date) + ' to '+ str(end_date) + '.csv', index=False, encoding='utf-8') # generates the CSV file, using start and end dates in filename
It seems like you're making debugging a lot tougher on yourself by pulling all the table data (<td>) individually like that. After stepping through the code and making a couple of changes, my best guess is the bug is coming from the fact that you're deleting entries from results while iterating over it, which gets messy. As a side note, you're also never using result from the loop which would make the declaration pointless. Something like this ends up a little cleaner, and gets you your 150 results:
results = soup.find_all('tr')
# reads the date text at the top of the web page so it can be inserted later to the CSV in the 'Date' column
date_line = results[0].select_one('td').text.split(": ", 1)[1] # Selects first td it finds under the first tr
weekday_name = date_line.split(' ')[0]
month_name = date_line.split(' ', 2)[1]
day_month_num = date_line.split(' ', 1)[1].split(' ')[1].split(',')[0]
year = date_line.split(', ', 1)[1]
mmmmm_d_yyyy = month_name + ' ' + day_month_num + ', ' + year
program_count = [] # empty list for program counting
for result in results[2:]:
children = result.find_all('td')
rank = children[0].text # stores P18-49 rank
program = children[1].text # stores program name
network = children[2].text # stores network name
start_time = children[3].text # stores program's start time
mins = children[4].text # stores program's duration in minutes
p18_49 = children[5].text # stores program's P18-49 rating
p2 = children[6].text # stores program's P2+ viewer count (in thousands)
records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2))
program_count.append(program) # adds each program name to the list.
You also shouldn't need to use a second list to get the number of programs you've retrieved (appending programs to program_count). It ends up the same amount in both lists no matter what since you're appending a program name from every result. So instead of print(len(program_count)) you could've instead used print(len(records)). I'm assuming it was just for debugging purposes though.

Pandas- locate a value based on logical statements

I am using the this dataset for a project.
I am trying to find the total yield for each inverter for the 34 day duration of the dataset (basically use the final and initial value available for each inverter). I have been able to get the list of inverters using pd.unique()(there are 22 inverters for each solar power plant.
I am having trouble querying the total_yield data for each inverter.
Here is what I have tried:
def get_yields(arr: np.ndarray, df:pd.core.frame.DataFrame) -> np.ndarray:
delta = np.zeros(len(arr))
index =0
for i in arr:
initial = df.loc[df["DATE_TIME"]=="15-05-2020 02:00"]
initial = initial.loc[initial["INVERTER_ID"]==i]
initial.reset_index(inplace=True,drop=True)
initial = initial.at[0,"TOTAL_YIELD"]
final = df.loc[(df["DATE_TIME"]=="17-06-2020 23:45")]
final = final.loc[final["INVERTER_ID"]==i]
final.reset_index(inplace=True, drop=True)
final = final.at[0,"TOTAL_YIELD"]
delta[index] = final - initial
index = index + 1
return delta
Reference: arr is the array of inverters, listed below. df is the generation dataframe for each plant.
The problem is that not every inverter has a data point for each interval. This makes this function only work for the inverters at the first plant, not the second one.
My second approach was to filter by the inverter first, then take the first and last data points. But I get an error- 'Series' objects are mutable, thus they cannot be hashed
Here is the code for that so far:
def get_yields2(arr: np.ndarray, df: pd.core.frame.DataFrame) -> np.ndarry:
delta = np.zeros(len(arr))
index = 0
for i in arr:
initial = df.loc(df["INVERTER_ID"] == i)
index += 1
break
return delta
List of inverters at plant 1 for reference(labeled as SOURCE_KEY):
['1BY6WEcLGh8j5v7' '1IF53ai7Xc0U56Y' '3PZuoBAID5Wc2HD' '7JYdWkrLSPkdwr4'
'McdE0feGgRqW7Ca' 'VHMLBKoKgIrUVDU' 'WRmjgnKYAwPKWDb' 'ZnxXDlPa8U1GXgE'
'ZoEaEvLYb1n2sOq' 'adLQvlD726eNBSB' 'bvBOhCH3iADSZry' 'iCRJl6heRkivqQ3'
'ih0vzX44oOqAx2f' 'pkci93gMrogZuBj' 'rGa61gmuvPhdLxV' 'sjndEbLyjtCKgGv'
'uHbuxQJl8lW7ozc' 'wCURE6d3bPkepu2' 'z9Y9gH1T5YWrNuG' 'zBIq5rxdHJRwDNY'
'zVJPv84UY57bAof' 'YxYtjZvoooNbGkE']
List of inverters at plant 2:
['4UPUqMRk7TRMgml' '81aHJ1q11NBPMrL' '9kRcWv60rDACzjR' 'Et9kgGMDl729KT4'
'IQ2d7wF4YD8zU1Q' 'LYwnQax7tkwH5Cb' 'LlT2YUhhzqhg5Sw' 'Mx2yZCDsyf6DPfv'
'NgDl19wMapZy17u' 'PeE6FRyGXUgsRhN' 'Qf4GUc1pJu5T6c6' 'Quc1TzYxW2pYoWX'
'V94E5Ben1TlhnDV' 'WcxssY2VbP4hApt' 'mqwcsP2rE7J0TFp' 'oZ35aAeoifZaQzV'
'oZZkBaNadn6DNKz' 'q49J1IKaHRwDQnt' 'rrq4fwE8jgrTyWY' 'vOuJvMaM2sgwLmb'
'xMbIugepa2P7lBB' 'xoJJ8DcxJEcupym']
Thank you very much.
I can't download the dataset to test this. Getting "To May Requests" Error.
However, you should be able to do this with a groupby.
import pandas as pd
result = df.groupby('INVERTER_ID')['TOTAL_YIELD'].agg(['max','min'])
result['delta'] = result['max']-result['min']
print(result[['delta']])
So if I'm understanding this right, what you want is the TOTAL_YIELD for each inverter for the beginning of the time period starting 5-05-2020 02:00 and ending 17-06-2020 23:45. Try this:
# enumerate lets you have an index value along with iterating through the array
for i, code in enumerate(arr):
# to filter the info to between the two dates, but not necessarily assuming that
# each inverter's data starts and ends at each date
inverter_df = df.loc[df['DATE_TIME'] >= pd.to_datetime('15-05-2020 02:00:00')]
inverter_df = inverter_df.loc[inverter_df['DATE_TIME'] <= pd.to_datetime('17-06-2020
23:45:00')]
inverter_df = inverter_df.loc[inverter_df["INVERTER_ID"]==code]]
# sort by date
inverter_df.sort_values(by='DATE_TIME', inplace= True)
# grab TOTAL_YIELD at the first available date
initial = inverter_df['TOTAL_YIELD'].iloc[0]
# grab TOTAL_YIELD at the last available date
final = inverter_df['TOTAL_YIELD'].iloc[-1]
delta[index] = final - initial

Python image file manipulation

Python beginner here. I am trying to make us of some data stored in a dictionary.
I have some .npy files in a folder. It is my intention to build a dictionary that encapsulates the following: reading of the map, done with np.load, the year, month, and date of the current map (as integers), the fractional time in years (given that a month has 30 days - it does not affect my calculations afterwards), and the number of pixels, and number of pixels above a certain value. At the end I expect to get a dictionary like:
{'map0':'array(from np.load)', 'year', 'month', 'day', 'fractional_time', 'pixels'
'map1':'....}
What I managed until now is the following:
import glob
file_list = glob.glob('*.npy')
def only_numbers(seq): #for getting rid of any '.npy' or any other string
seq_type= type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
maps = {}
for i in range(0, len(file_list)-1):
maps[i] = np.load(file_list[i])
numbers[i]=list(only_numbers(file_list[i]))
I have no idea how to to get a dictionary to have more values that are under the for loop. I can only manage to generate a new dictionary, or a list (e.g. numbers) for every task. For the numbers dictionary, I have no idea how to manipulate the date in the format YYYYMMDD to get the integers I am looking for.
For the pixels, I managed to get it for a single map, using:
data = np.load('20100620.npy')
print('Total pixel count: ', data.size)
c = (data > 50).astype(int)
print('Pixel >50%: ',np.count_nonzero(c))
Any hints? Until now, image processing seems to be quite a challenge.
Edit: Managed to split the dates and make them integers using
date=list(only_numbers.values())
year=int(date[i][0:4])
month=int(date[i][4:6])
day=int(date[i][6:8])
print (year, month, day)
If anyone is interested, I managed to do something else. I dropped the idea of a dictionary containing everything, as I needed to manipulate further easier. I did the following:
file_list = glob.glob('data/...') # files named YYYYMMDD.npy
file_list.sort()
def only_numbers(seq): # i make sure that i remove all characters and symbols from the name of the file
seq_type = type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
numbers = {}
time = []
np_above_value = []
for i in range(0, len(file_list) - 1):
maps = np.load(file_list[i])
maps[np.isnan(maps)] = 0 # had some NANs and getting some errors
numbers[i] = only_numbers(file_list[i]) # getting a dictionary with the name of the files that contain only the dates - calling the function I defined earlier
date = list(numbers.values()) # registering the name of the files (only the numbers) as a list
year = int(date[i][0:4]) # selecting first 4 values (YYYY) and transform them as integers, as required
month = int(date[i][4:6]) # selecting next 2 values (MM)
day = int(date[i][6:8]) # selecting next 2 values (DD)
time.append(year + ((month - 1) * 30 + day) / 360) # fractional time
print('Total pixel count for map '+ str(i) +':', maps.size) # total number of pixels for the current map in iteration
c = (maps > value).astype(int)
np_above_value.append (np.count_nonzero(c)) # list of the pixels with a value bigger than value
print('Pixels with concentration >value% for map '+ str(i) +':', np.count_nonzero(c)) # total number of pixels with a value bigger than value for the current map in iteration
plt.plot(time, np_above_value) # pixels with concentration above value as a function of time
I know it might be very clumsy. Second week of python, so please overlook that. It does the trick :)

Categories

Resources