Rolling Average to calculate rainfall intensity - python

I have some real rainfall data recorded as the date and time, and the accumulated number of tips on a tipping bucket rain-gauge. The tipping bucket represents 0.5mm of rainfall.
I want to cycle through the file and determine the variation in intensity (rainfall/time)
So I need a rolling average over multiple fixed time frames:
So I want to accumulate rainfall, until 5minutes of rain is accumulated and determine the intensity in mm/hour. So if 3mm is recorded in 5min it is equal to 3/5*60 = 36mm/hr.
the same rainfall over 10 minutes would be 18mm/hr...
So if I have rainfall over several hours I may need to review at several standard intervals of say: 5, 10,15,20,25,30,45,60 minutes etc...
Also the data is recorded in reverse order in the raw file, so the earliest time is at the end of the file and the later and last time step appears first after a header:
Looks like... (here 975 - 961 = 14 tips = 7mm of rainfall) average intensity 1.4mm/hr
But between 16:27 and 16:34 967-961 = 6 tips = 3mm in 7 min = 27.71mm/hour
7424 Figtree (O'Briens Rd)
DATE :hh:mm Accum Tips
8/11/2011 20:33 975
8/11/2011 20:14 974
8/11/2011 20:04 973
8/11/2011 20:00 972
8/11/2011 19:35 971
8/11/2011 18:29 969
8/11/2011 16:44 968
8/11/2011 16:34 967
8/11/2011 16:33 966
8/11/2011 16:32 965
8/11/2011 16:28 963
8/11/2011 16:27 962
8/11/2011 15:30 961
Any suggestions?

I am not entirely sure what it is that you have a question about.
Do you know how to read out the file? You can do something like:
data = [] # Empty list of counts
# Skip the header
lines = [line.strip() for line in open('data.txt')][2::]
for line in lines:
print line
date, hour, count = line.split()
h,m = hour.split(':')
t = int(h) * 60 + int(m) # Compute total minutes
data.append( (t, int(count) ) ) # Append as tuple
data.reverse()
Since your data is cumulative, you need to subtract each two entries, this is where
python's list comprehensions are really nice.
data = [(t1, d2 - d1) for ((t1,d1), (t2, d2)) in zip(data, data[1:])]
print data
Now we need to loop through and see how many entries are within the last x minutes.
timewindow = 10
for i, (t, count) in enumerate(data):
# Find the entries that happened within the last [...] minutes
withinwindow = filter( lambda x: x[0] > t - timewindow, data )
# now you can print out any kind of stats about this "within window" entries
print sum( count for (t, count) in withinwindow )

Since the time stamps do not come at regular intervals, you should use interpolating to get the most accurate results. This will make the rolling average easier too. I'm using the Interpolate class in this answer in the below code.
from time import strptime, mktime
totime = lambda x: int(mktime(strptime(x, "%d/%m/%Y %H:%M")))
with open("my_file.txt", "r") as myfile:
# Skip header
for line in myfile:
if line.startswith("DATE"):
break
times = []
values = []
for line in myfile:
date, time, value = line.split()
times.append(totime(" ".join((date, time))))
values.append(int(value))
times.reverse()
values.reverse()
i = Interpolate(times, values)
Now it's just a matter of choosing your intervals and computing the difference between the endpoints of each interval. Let's create a generator function for that:
def rolling_avg(cumulative_lookup, start, stop, step_size, window_size):
for t in range(start + window_size, stop, step_size):
total = cumulative_lookup[t] - cumulative_lookup[t - window_size]
yield total / window_size
Below I'm printing the number of tips per hour in the previous hour with 10 minute intervals:
start = totime("8/11/2011 15:30")
stop = totime("8/11/2011 20:33")
for avg in rolling_avg(i, start, stop, 600, 3600):
print avg * 3600
EDIT: Made totime return an int and created the rolling_avg generator.

Related

Working with Two Different Input Files --- example: Hourly Data and Daily Data (with different lengths)

I'm working on some code to manipulate hourly and daily data for a year and am a little confused about how to combine data from the two files. What I am doing is using the hourly pattern of Data Set B but scaling it using Daily Set A. ... so in essence (using the example below) I will take the daily average (Data Set A) of 93 cfs and multiple it by 24 hrs in a day which would equal 2232 . I'll then sum the hourly cfs values for all 24hrs of each day (Data Set B)... which in this case for 1/1/2021 would equal 2596. Normally manipulating a rate in these manners doesn't make sense but in this case it doesn't matter because the units cancel out. I'd then need to take these values and divide them by each other 2232/2596 = 0.8597 and apply that to the hourly cfs values for all 24hrs of each day (Data Set B) for a new "scaled" dataset (to be Data Set C).
My problem is that I have never coded in Python using two different input datasets (I am a complete newbie). I started experimenting with the code but the problem is - is I can't seem to integrate the two datasets. If anyone can point me in the direction of how to integrate two separate input files I'd be most appreciative. Beneath the datasets is my attempts at the code (please note the reverse order of code - working first with hourly data (Data Set B) and then the daily data (Data Set A). My print out of the final scaling factor (SF) is only giving me one print out... not all 8,760 because I'm not in the loop... but how can I be in the loop of both input files at the same time???
Data Set A (Daily) -- 365 lines of data:
1/1/2021 93 cfs
1/2/2021 0 cfs
1/3/2021 70 cfs
1/4/2021 70 cfs
Data Set B (Hourly) -- 8,760 lines of data:
1/1/2021 0:00 150 cfs
1/1/2021 1:00 0 cfs
1/1/2021 2:00 255 cfs
(where summation of all 24 hrs of 1/1/2021 = 2596 cfs)
etc.
Sorry if this is a ridiculously easy question... I am very new to coding.
Here is the code that I've written so far... what I need is 8,760 lines of SF... that I can then use to multiple by the original Data Set B. The final product of Data Set C will be Date - Time - rescaled hourly data. I actually have to do this for three pumping units total... to give me a matrix of 5 columns by 8,760 rows but I think I'll be able to figure the unit thing out. My problem now is how to integrate the two data sets. Thank you for reading!
print('Solving the Temperature Model programming problem')
fhand1 = open('Interpolate_CY21_short.txt')
fhand2 = open('WSE_Daily_CY21_short.txt')
#Hourly Interpolated Pardee PowerHouse Data
for line1 in fhand1:
line1 = line1.rstrip()
words1 = line1.split()
#Hourly interpolated data - parsed down (cfs)
x = float(words1[7])
if x<100:
x = 0
#print(x)
#WSE Daily Average PowerHouse Data
for line2 in fhand2:
line2 = line2.rstrip()
words2 = line2.split()
#Daily cfs average x 24 hrs
aa = float(words2[2])*24
#print(a)
SF = x * aa
print(SF)
This is how you would get the data into two lists,
fhand1 = open('Interpolate_CY21_short.txt', 'r')
fhand2 = open('WSE_Daily_CY21_short.txt', 'r')
daily_average = fhand1.readlines()
daily = fhand2.readlines()
# this is what the to lists would look like, roughly
# each line would be a separate string
daily_average = ["1/1/2021 93 cfs","1/2/2021 0 cfs"]
daily = ["1/1/2021 0:00 150 cfs", "1/1/2021 1:00 0 cfs", "1/2/2021 1:00 0 cfs"]
Then, to process the lists could probably use a double nested for loop
for average_line in daily_average:
average_line = average_line.rstrip()
average_date, average_count, average_symbol = average_line.split()
for daily_line in daily:
daily_line = daily_line.rstrip()
date, hour, count, symbol = daily_line.split()
if average_date == date:
print(f"date={date}, average_count={average_count} count={count}")
Or a dictionary
# populate data into dictionaries
daily_average_data = dict()
for line in daily_average:
line = line.rstrip()
day, count, symbol = line.split()
daily_average_data[day] = (day, count, symbol)
daily_data = dict()
for line in daily:
line = line.rstrip()
day, hour, count, symbol = line.split()
if day not in daily_data:
daily_data[day] = list()
daily_data[day].append((day, hour, count, symbol))
# now you can access daily_average_data and daily_data as
# dictionaries instead of files
# process data
result = list()
for date in daily_data.keys():
print(date)
print(daily_average_data[date])
print(daily_data[date])
If the data items corresponded with one another line by line, you could use https://realpython.com/python-zip-function/
here is an example:
for data1, data2 in zip(daiy_average, daily):
print(f"{data1} {data2}")
Similar to what #oasispolo decribed, the solution is to make a single loop and process both lists in it. I'm personally not fond of the "zip" function. (It's a purely stylistic objection; lots of other people like it and that's fine.)
Here's a solution with syntax that I find more intuitive:
print('Solving the Temperature Model programming problem')
fhand1 = open('Interpolate_CY21_short.txt', 'r')
fhand2 = open('WSE_Daily_CY21_short.txt', 'r')
# Convert each file into a list of lines. You're doing this
# implicitly, but I like to be explicit about it.
lines1 = fhand1.readlines()
lines2 = fhand2.readlines()
if len(lines1) != len(lines2):
raise ValueError("The two files have different length!")
# Initialize an output array. You cold also construct it
# one item at a time, but that can be slow for large arrays.
# It is more efficient to initialize the entire array at
# once if possible.
sf_list = [0]*len(lines1)
for position in range(len(lines1)):
# range(L) generates numbers 0...L-1
line1 = lines1[position].rstrip()
words1 = line1.split()
x = float(words1[7])
if x<100:
x = 0
line2 = lines2[position].rstrip()
words2 = line2.split()
aa = float(words2[2])*24
sf_list[position] = x * aa
print(sf_list)

Efficient way to count/sum rows in dataframe based on conditions

I'm working with a large flight delay dataset trying to predict the flight delay based on multiple new features. Based on a plane's tailnumber, I want to count the number of flights and sum the total airtime the plane has done in the past X (to be specified) hours/days to create a new "usage" variable.
Example of data (excluded airtime:
ID tail_num deptimestamp dep_delay distance air_time
2018-11-13-1659_UA2379 N14118 13/11/2018 16:59 -3 2425 334
2018-11-09-180_UA275 N13138 09/11/2018 18:00 -3 2454 326
2018-06-04-1420_9E3289 N304PQ 04/06/2018 14:20 -2 866 119
2018-09-29-1355_WN3583 N8557Q 29/09/2018 13:55 -5 762 108
2018-05-03-815_DL2324 N817DN 03/05/2018 08:15 0 1069 138
2018-01-12-1850_NK347 N635NK 12/01/2018 18:50 100 563 95
2018-09-16-1340_OO4721 N242SY 16/09/2018 13:40 -3 335 61
2018-06-06-1458_DL2935 N351NB 06/06/2018 14:58 1 187 34
2018-06-25-1030_B61 N965JB 25/06/2018 10:30 48 1069 143
2018-12-06-1215_MQ3617 N812AE 06/12/2018 12:15 -9 427 76
Example output for give = 'all' (not based on example data):
2018-12-31-2240_B61443 (1, 152.0, 1076.0, 18.0)
I've written a function to be applied to each row that filters the dataframe for flights with the same tail number and within the specified time frame and then gives back either the number of flights/total airtime or a dataframe containing the flights in question. It works but take a long time (around 3 hours calculating for a subset of 400k flights but filtering the entire dataset of over 7m rows). Is there a way to speed this up?
def flightsbefore(ID,
give = 'number',
direction = 'before',
seconds = 0,
minutes = 0,
hours = 0,
days = 0,
weeks = 0,
months = 0,
years = 0):
""" Takes the ID of a flight and a time unit to return the flights of that plane within that timeframe"""
tail_num = dfallcities.loc[ID,'tail_num']
date = dfallcities.loc[ID].deptimestamp
#dfallcities1 = dfallcities[(dfallcities.a != -1) & (dfallcities.b != -1)]
if direction == 'before':
timeframe = dfallcities.loc[ID].deptimestamp - datetime.timedelta(seconds = seconds,
minutes = minutes,
hours = hours,
days = days,
weeks = weeks)
output = dfallcities[(dfallcities.tail_num == tail_num) & \
(dfallcities.deptimestamp >= timeframe) & \
(dfallcities.deptimestamp < date)]
else:
timeframe = dfallcities.loc[ID].deptimestamp + datetime.timedelta(seconds = seconds,
minutes = minutes,
hours = hours,
days = days,
weeks = weeks)
output = dfallcities[(dfallcities.tail_num == tail_num) &
(dfallcities.depTimestamp <= timeframe) &
(dfallcities.deptimestamp >= date)]
if give == 'number':
return output.shape[0]
elif give == 'all':
if output.empty:
prev_delay = 0
else:
prev_delay = np.max((output['dep_delay'].iloc[-1],0))
return (output.shape[0], output['air_time'].sum(),output['distance'].sum(), prev_delay)
elif give == 'flights':
return output.sort_values('deptimestamp')
else:
raise ValueError("give must be one of [number, all, flights]")
No errors but simply very slow

Converting a huge txt file

I have a huuge csv file (524 MB, notepad opens it for 4 minutes) that I need to change formatting of. Now it's like this:
1315922016 5.800000000000 1.000000000000
1315922024 5.830000000000 3.000000000000
1315922029 5.900000000000 1.000000000000
1315922034 6.000000000000 20.000000000000
1315924373 5.950000000000 12.452100000000
The lines are divided by a newline symbol, when I paste it into Excel it divides into lines. I would've done it by using Excel functions but the file is too big to be opened.
First value is the number of seconds since 1-01-1970, second is price, third is volumen.
I need it to be like this:
01-01-2009 13:55:59 5.800000000000 1.000000000000 01-01-2009 13:56:00 5.830000000000 3.000000000000
etc.
Records need to be divided by a space. Sometimes there are multiple values of price from the same second like this:
1328031552 6.100000000000 2.000000000000
1328031553 6.110000000000 0.342951630000
1328031553 6.110000000000 0.527604200000
1328031553 6.110000000000 0.876088370000
1328031553 6.110000000000 0.971026920000
1328031553 6.100000000000 0.965781090000
1328031589 6.150000000000 0.918752490000
1328031589 6.150000000000 0.940974100000
When this happens, I need the code to take average price from that second and save just one price for each second.
These are bitcoin transactions which didn't happen every second when BTC started.
When there is no record from some second, there needs to be created a new record with the following second and the values of price and volumen copied from the last known price and volumen.
Then save everything to a new txt file.
I can't seem to do it, I've been trying to write a converter in python for hours, please help.
shlex is a lexical parser. We use it to pick the numbers from the input one at a time. Function records groups these into lists where the first element of the list is an integer and the other elements are floating points.
The loop reads the results of records and averages on times as necessary. It also prints two outputs to a line.
from shlex import shlex
lexer = shlex(instream=open('temp.txt'), posix=False)
lexer.wordchars = r'0123456789.\n'
lexer.whitespace = ' \n'
lexer.whitespace_split = True
import time
def Records():
record = []
while True:
token = lexer.get_token()
if token:
token = token.strip()
if token:
record.append(token)
if len(record)==3:
record[0] = int(record[0])
record[1] = float(record[1])
record[2] = float(record[2])
yield record
record=[]
else:
break
else:
break
def conv_time(t):
return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(t))
records = Records()
pos = 1
current_date, price, volume = next(records)
price_sum = price
volume_sum = volume
count = 1
for raw_date, price, volume in records:
if raw_date == current_date:
price_sum += price
volume_sum += volume
count += 1
else:
print (conv_time(current_date), price_sum/count, volume_sum/count, end=' ' if pos else '\n')
pos = (pos+1)%2
current_date = raw_date
price_sum = price
volume_sum = volume
count = 1
print (conv_time(current_date), price_sum/count, volume_sum/count, end=' ' if pos else '\n')
Here are the results. You might need to do something about significant digits to the rights of decimal points.
2011-09-13 09:53:36 5.8 1.0 2011-09-13 09:53:44 5.83 3.0
2011-09-13 09:53:49 5.9 1.0 2011-09-13 09:53:54 6.0 20.0
2011-09-13 10:32:53 5.95 12.4521 2012-01-31 12:39:12 6.1 2.0
2012-01-31 12:39:13 6.108 0.736690442 2012-01-31 12:39:49 6.15 0.9298632950000001
1) Reading a single line from a file
data = {}
with open(<path to file>) as fh:
while True:
line = fh.readline()[:-1]
if not line: break
values = line.split(' ')
for n in range(0, len(values), 3):
dt, price, volumen = values[n:n+3]
2) Checking if it's the next second after the last record's
If so, adding the price and volumen values to a variable and increasing a counter for later use in calculating the average
3) If the second is not the next second, copy values of last price and volumen.
if not dt in data:
data[dt] = []
data[dt].append((price, volumen))
4) Divide timestamps like "1328031552" into seconds, minutes, hours, days, months, years.
Somehow take care of gap years.
for dt in data:
# seconds, minutes, hours, days, months, years = datetime (dt)
... for later use in calculating the average
p_sum, v_sum = 0
for p, v in data[dt]:
p_sum += p
v_sum += v
n = len(data[dt])
price = p_sum / n
volumen = v_sum / n
5) Arrange values in the 01-01-2009 13:55:59 1586.12 220000 order
6) Add the record to the end of the new database file.
print(datetime, price, volumen)

Sorting dates in Python

I am saving the day and time period of some scheduled tasks in a txt in this format:
Monday,10:50-11:32
Friday,18:33-18:45
Sunday,17:10-17:31
Sunday,14:10-15:11
Friday,21:10-23:11
I am opening the txt and get the contents in a list.
How can I sort the list to get the days and the time periods in order?
Like this:
Monday,10:50-11:32
Friday,18:33-18:45
Friday,21:10-23:11
Sunday,14:10-15:11
Sunday,17:10-17:31
Ok let's say you only have dayofweek and the timestamps. One alternative is to calculate the amount of minutes each item is (Monday 00:00 = 0 minutes and Sunday 23:59 = max minutes) and sort with that function.
The example below sorts with the first timestamp value. One comment from a fellow SO:er pointed out that this does not include the second timestamp (end-time). To include this we can add a decimal value by inverting the amount of minutes per day.
((int(h2)* 60 + int(m2))/24*60) # minutes divided by maximum minutes per day gives a decimal number
However the key here is the following code:
weekday[day]*24*60 + int(h1)*60 + int(m1) # gets the total minutes passed, we sort with this!
And of course the sort function with a join (double-break line). When you pass a key to sorted() and that key is a function the sorting will be based on the return values of that function (which is the amount of minutes).
'\n\n'.join(sorted(list_, key=get_min))
Enough text... let's jump to a full example updated version
import io
file= """Monday,10:50-11:32
Friday,18:33-18:45
Sunday,17:10-17:31
Sunday,17:10-15:11
Friday,21:10-23:11"""
list_ = [i.strip('\n') for i in io.StringIO(file).readlines() if i != "\n"]
weekday = dict(zip(["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"],[0,1,2,3,4,5,6]))
def get_min(time_str):
day,time = time_str.split(",")
h1, m1 = time.split('-')[0].split(":")
h2, m2 = time.split('-')[1].split(":")
return weekday[day]*24*60 + int(h1)*60 + int(m1) + ((int(h2)* 60 + int(m2))/24*60)
with open("output.txt", "w") as outfile:
outfile.write('\n\n'.join(sorted(list_, key=get_min)))
print('\n\n'.join(sorted(list_, key=get_min)))
Creates "output.txt" with:
Monday,10:50-11:32
Friday,18:33-18:45
Friday,21:10-23:11
Sunday,17:10-15:11
Sunday,17:10-17:31

python hadoop code for max/min temperature on a particular data set

I am trying to make a mapper/reducer program to calculate max/min temp from a data set. I have tried to modify by myself but the code doesn't work. The mapper runs fine but reducer doesn't, given I made changes in mapper.
My sample code:
mapper.py
import re
import sys
for line in sys.stdin:
val = line.strip()
(year, temp, q) = (val[14:18], val[25:30], val[31:32])
if (temp != "9999" and re.match("[01459]", q)):
print "%s\t%s" % (year, temp)
reducer.py
import sys
(last_key, max_val) = (None, -sys.maxint)
for line in sys.stdin:
(key, val) = line.strip().split("\t")
if last_key and last_key != key:
print "%s\t%s" % (last_key, max_val)
(last_key, max_val) = (key, int(val))
else:
(last_key, max_val) = (key, max(max_val, int(val)))
if last_key:
print "%s\t%s" % (last_key, max_val)
sample line from file:
690190,13910, 2012**0101, *42.9,18, 29.4,18, 1033.3,18, 968.7,18, 10.0,18, 8.7,18, 15.0, 999.9, 52.5, 31.6*, 0.00I,999.9, 000000,
I need the values in bold. Any idea!!
this is my output if i run mapper as a simple code:
root#ubuntu:/home/hduser/files# python maxtemp-map.py
2012 42.9
2012 50.0
2012 47.0
2012 52.0
2012 43.4
2012 52.6
2012 51.1
2012 50.9
2012 57.8
2012 50.7
2012 44.6
2012 46.7
2012 52.1
2012 48.4
2012 47.1
2012 51.8
2012 50.6
2012 53.4
2012 62.9
2012 62.6
The file contains different years data. I have to calculate min, max, and avg for each yr.
FIELD POSITION TYPE DESCRIPTION
STN--- 1-6 Int. Station number (WMO/DATSAV3 number)
for the location.
WBAN 8-12 Int. WBAN number where applicable--this is the
historical
YEAR 15-18 Int. The year.
MODA 19-22 Int. The month and day.
TEMP 25-30 Real Mean temperature. Missing = 9999.9
Count 32-33 Int. Number of observations in mean temperature
I am having trouble parsing your question, but I think it reduces to this:
You have a dataset and each line of the dataset represents different quantities related to a single time point. You would like to extract the max/min of one of these quantities from the entire dataset.
If this is the case, I'd do something like this:
temps = []
with open(file_name, 'r') as infile:
for line in infile:
line = line.strip().split(',')
year = int(line[2][:4])
temp = int(line[3])
temps.append((temp, year))
temps = sorted(temps)
min_temp, min_year = temps[0]
max_temp, max_year = temps[-1]
EDIT:
Farley, I think what you are doing with mapper/reducer may be overkill for what you want from your data. Here are some additional questions about your initial file structure.
What are the contents of each line (be specific) in the dataset? For example: date, time, temp, pressure, ....
Which piece of data from each line do you want to extract? Temperature? At what position in the line is that piece of data?
Does each file only contain data from one year?
For example, if your dataset looked like
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
then the simplest thing to do is to loop through each line and extract the relevant information. It appears you only want the year and the temperature. In this example, these are located at positions 0 and 3 in each line. Therefore, we will have a loop that looks like
from collections import defaultdict
data = defaultdict(list)
with open(file_name, 'r') as infile:
for line in infile:
line = line.strip().split(', ')
year = line[0]
temp = line[3]
data[year].append(temp)
See, we extracted the year and temp from each line in the file and stored them in a special dictionary object. What this will look like if we printed it out would be
year1: [temp1, temp2, temp3, temp4]
year2: [temp5, temp6, temp7, temp8]
year3: [temp9, temp10, temp11, temp12]
year4: [temp13, temp14, temp15, temp16]
Now, this makes it very convenient for us to do statistics on all the temperatures of a given year. For example, to compute the maximum, minimum, and average temperature, we could do
import numpy as np
for year in data:
temps = np.array( data[year] )
output = (year, temps.mean(), temps.min(), temps.max())
print 'Year: {0} Avg: {1} Min: {2} Max: {3}'.format(output)
I'm more than willing to help you sort out your problem, but I need you to be more specific about what exactly your data looks like, and what you want to extract.
If you have something like the store name and total sales from the store as intermediate result from the mapper you can use the following as reducer to find out the maximum sales and which store has the maximum sales. Similarly it will find out the minimum sales and which store has the minimum sales.
The following reducer code example assumes that you have the sales total against each store as an input file.
#! /usr/bin/python
import sys
mydict = {}
salesTotal = 0
oldKey = None
for line in sys.stdin:
data=line.strip().split("\t")
if len(data)!=2:
continue
thisKey, thisSale = data
if oldKey and oldKey != thisKey:
mydict[oldKey] = float(salesTotal)
salesTotal = 0
oldKey = thisKey
salesTotal += float(thisSale)
if oldKey!= None:
mydict[oldKey] = float(salesTotal)
maximum = max(mydict, key=mydict.get)
print(maximum, mydict[maximum])
minimum = min(mydict, key=mydict.get)
print(minimum, mydict[minimum])

Categories

Resources