Climatology frequencies and duration - python

I have a 10 years climatological dateset as follows.
dt T P
01-01-2010 3 0
02-01-2010 5 11
03-01-2010 10 50
....
31-12-2020 -1 0
I want to estimate the total number of days in each month where T and P continuously stayed greater than 0 for three days or more
I would want these columns as an output:
month Number of days/DurationT&P>0 T P
I have never used loops in python, I seem to be able to write a simple loop and nothing beyond this when the data has to be first grouped by month and year and then apply the condition. would really appreciate any hints on the construction of the loop.
A= dataset
A['dt'] = pd.to_datetime(A['dt'], format='%Y-%m-%d')
for column in A [['P', 'T']]:
for i in range (len('P')):
if i > 0:
P.value_counts()
print(i)
for j in range (len ('T')):
if i > 0:
T.value_counts()
print(j)

Here is a really naive way you could set it up by simply iterating over the rows:
df['valid'] = (df['T'] > 0) & (df['P'] > 0)
def count_total_days(df):
i = 0
total = 0
for idx, row in df.iterrows():
if row.valid:
i += 1
elif not row.valid:
if i >= 3:
total += i
i = 0
return total
Since you want it per month, you would first have to create new month and year columns to group by:
df['month'] = df['dt'].dt.month
df['year'] = df['dt'].dt.year
for date, df_subset in df.groupby(['month', 'year']):
count_total_days(df_subset)

You can use resample and sum to get the sum of day for each where the condition is true.
import pandas as pd
dt = ["01-01-2010", "01-02-2010","01-03-2010","01-04-2010", "03-01-2010",'12-31-2020']
t=[3,66,100,5,10,-1]
P=[0,77,200,11,50,0]
A=pd.DataFrame(list(zip(dt, t,P)),
columns =['dtx', 'T','P'])
A['dtx'] = pd.to_datetime(A['dtx'], format='%m-%d-%Y')
A['Mask']=A.dtx.diff().dt.days.ne(1).cumsum()
dict_freq=A['Mask'].value_counts().to_dict()
newdict = dict((k, v) for k, v in dict_freq.items() if v >= 3)
A=A[A['Mask'].isin(list(newdict.keys()))]
A['Mask']=(A['T'] >= 1) & (A['P'] >= 1)
df_summary=A.query('Mask').resample(rule='M',on='dtx')['Mask'].sum()
Which produce
2010-01-31 3

Related

Select dates in index

i have start date and end date and dataframe with daily observations. The problem is that i can't find a way, which will enable me select dates with periodicity of 3 months
for example:
2003-01-03 + 3 months = 2003-04-03 and so on
output should consist of 20 rows because 5 years with 3 months periodicity, including start and end dates
EDIT: Old solution didn't work for all cases. Therefore a new one:
start, end = returns.index[0], returns.index[-1]
length = (end.year - start.year) * 12 + (end.month - start.month)
if length % 3 == 0 and end.day >= start.day:
length += 3
new_index = []
for m in range(3, length, 3):
ydelta, month = divmod(start.month + m, 12)
day = pd.Timestamp(year=start.year + ydelta, month=month, day=1)
day += pd.Timedelta(f'{min(start.day, day.days_in_month) - 1}d')
new_index.append(day)
new_index = pd.DatetimeIndex(new_index)
returns = returns.loc[new_index]
Another version which has some slight inaccuracies around the month ends but is more compact:
add_3_months = pd.tseries.offsets.DateOffset(months=3)
new_index = pd.date_range(returns.index[0] + add_3_months,
returns.index[-1],
freq=add_3_months)
returns = returns.loc[new_index]

How can I optimize a code that is using xarray for better performance?

I'm trying to extract climate data from various .nc files I have but the process is taking extremely long, I suspect it has something to do with the fact that I'm trying to extract the data for every day of June, July, August for the next 79 years. But I'm a novice programmer and I realize there might've been a few oversights by me (efficiency wise) that might've resulted in a slightly better performance.
This is the snippet
def calculateTemp(coords, year, model):
"""
takes in all coordinates of a line between two grid stations and the year
converts the year into date
takes average of temperature of each day of the month of June for each
coordinate and then takes average of all coordinates to find average temp
for that line for the month of June
"""
print(year)
# coords represents a list of different sets of coordinates between two grids
temp3 = 0 # sum of all temps of all coordinates
for i in range(0, len(coords)):
temp2 = 0
counter = 0
# this loop represents that the 15 years data is being extracted for
# each coordinate set and average of those 15 years is being taken
for p in range(0, 15):
temp1 = 0 # sum of all temps for one coordinate in all days of June, tuly, august
if year+ p < 100:
# this loop represents the months of jun, jul, aug
for j in range(6, 9):
# 30 days of each month
for k in range(1, 31):
if k < 10:
# this if-else makes a string of date
date = '20'+str(year+p)+'-0'+str(j)+'-0'+str(k)
else:
date = '20'+str(year+p)+'-0'+str(j)+'-'+str(k)
# there are 3 variants of the climate model
# for years upto 2040, between 2041-2070
# and between 2071 and 2099
# hence this if else block
if year+p < 41:
temp1 += model[0]['tasmax'].sel(
lon=coords[i][1], lat=coords[i][0], time=date, method='nearest').data[0]
elif year+p >= 41 and year+p <71:
temp1 += model[1]['tasmax'].sel(
lon=coords[i][1], lat=coords[i][0], time=date, method='nearest').data[0]
else:
temp1 += model[2]['tasmax'].sel(
lon=coords[i][1], lat=coords[i][0], time=date, method='nearest').data[0]
counter += 1
avg = temp1/(len(range(0,30))*len(range(6,9)))
temp2 += avg
temp3 += temp2/counter
Tamb = temp3/len(coords)
return Tamb
Is there anyway I can increase the performance of this code and optimize it?
I just replaced the innermost loops k in range(1,31)and j in range(6,9)into a dict comprehension to generate all the dates and corresponding value from your model. Then simply averaged these values for every value of p and then for every coord in coords.
Give this a shot. Dicts should make the processing faster. Also check if the averages are exactly how you are calculating them in your function.
def build_date(year,p,j,k):
return '20'+str(year+p)+'-0'+str(j)+'-0'+str(k) if k<10 else '20'+str(year+p)+'-0'+str(j)+'-'+str(k)
def calculateTemp(coords, year, model):
func2 = lambda x,date:model[x]['tasmax'].sel(lon=coords[i][1],
lat=coords[i][0],
time=date,
method='nearest').data[0]
print(year)
out = {}
for i in range(len(coords)):
inner = {}
for p in range(0,15):
if year + p < 100:
dates = {build_date(year,p,j,k):func2(0,build_date(year,p,j,k)) if year+p<41 \
else func2(1,build_date(year,p,j,k)) if (year+p >= 41 and year+p <71) \
else func2(2,build_date(year,p,j,k))
for j in range(6,9) \
for k in range(1,31) }
inner[p] = sum([v for k,v in dates.items()])/len(dates)
out[i] = inner
coord_averages = {k : sum(v.values())/len(v) for k,v in out.items() }
Tamb = sum([v for k,v in coord_averages.items()])/len(coord_averages)
return Tamb

Removing value from a DataFrame column which repeats over 15 times

I'm working on forex data like this:
0 1 2 3
1 AUD/JPY 20040101 00:01:00.000 80.598 80.598
2 AUD/JPY 20040101 00:02:00.000 80.595 80.595
3 AUD/JPY 20040101 00:03:00.000 80.562 80.562
4 AUD/JPY 20040101 00:04:00.000 80.585 80.585
5 AUD/JPY 20040101 00:05:00.000 80.585 80.585
I want to go through column 2 and 3 and remove the rows in which the value is repeated for more than 15 times in a row. So far I managed to produce this piece of code:
price = 0
drop_start = 0
counter = 0
df_new = df
for i, r in df.iterrows():
if r.iloc[2] != price:
if counter >= 15:
df_new = df_new.drop(df_new.index[drop_start:i])
price = r.iloc[2]
counter = 1
drop_start = i
if r.iloc[2] == price:
counter = counter + 1
price = 0
drop_start = 0
counter = 0
df = df_new
for i, r in df.iterrows():
if r.iloc[3] != price:
if counter >= 15:
df_new = df_new.drop(df_new.index[drop_start:i])
price = r.iloc[3]
counter = 1
drop_start = i
if r.iloc[3] == price:
counter = counter + 1
print(df_new.info())
df_new.to_csv('df_new.csv', index=False, header=None)
Unfortunately when I check the output file there are some mistakes, there are some weekends which have not been removed by the program. How should I build my algorithm, so it removes the duplicated values correctly?
First 250k rows of my initial dataset is available here: https://ufile.io/omg5h
The output of this program for that sample data is available here:
https://ufile.io/2gc3d
You can see that in the output file the rows 6931+ were not succesfully removed:
The problem with your algorithm is that, you are not holding specific counter values for the row values, but rather increment the counter through the loop. This causes the result to be false I believe. Also, the comparison r.iloc[2] != price also does not make sense because you are changing the value of price every iteration, so if there are elements between the duplicates, this check do not serve a proper function. I wrote a small code to copy the behavior you asked for.
df = pd.DataFrame([[0,0.5, 2.5],[0,1, 2],[0,1.5,2.5 ],[0,2, 3],[0,2, 3],[0,3, 4],
[0,4, 5]],columns = ['A','B','C'])
df_new = df
dict = {}
print('Initial DF')
print(df)
print()
for i, r in df.iterrows():
counter = dict.get(r.iloc[1])
if counter == None:
counter = 0
dict[r.iloc[1]] = counter + 1
if dict[r.iloc[1]] >= 2:
df_new = df_new[df_new.B != r.iloc[1]]
print('2nd col. deleted DF')
print(df_new)
print()
df_fin = df_new
dict2 = {}
for i, r in df_new.iterrows():
counter = dict2.get(r.iloc[2])
if counter == None:
counter = 0
dict2[r.iloc[2]] = counter + 1
if dict2[r.iloc[2]] >= 2:
df_fin = df_fin[df_fin.C != r.iloc[2]]
print('3rd col. deleted DF')
print(df_fin)
Here, I hold the counter value for each unique value in the rows of column 2 and 3. Then, according to the threshold(which is 2 in this case) I remove the rows which are exceeding the threshold. I first eliminate values according to the 2nd column, then forward this modified array to the next loop and eliminate values according to the 3rd column and finish the process.

Adding simple moving average as an additional column to python DataFrame

I have sales data in sales_training.csv that looks like this -
time_period sales
1 127
2 253
3 123
4 253
5 157
6 105
7 244
8 157
9 130
10 221
11 132
12 265
I want to add 3rd column that contains the moving average. My code -
import pandas as pd
df = pd.read_csv("./Sales_training.csv", index_col="time_period")
periods = df.index.tolist()
period = int(input("Enter a period for the moving average :"))
sum1 = 0
for i in periods:
if i < period:
df['forecast'][i] = i
else:
for j in range(period):
sum1 += df['sales'][i-j]
df['forecast'][i] = sum1/period
sum1 = 0
print(df)
df.to_csv("./forecast_mannual.csv")
This is giving KeyError: 'forecast' at the line df['forecast'][i] = i. What is the issue?
one simple solution for it, just df['forecast'] = df['sales']
import pandas as pd
df = pd.read_csv("./Sales_training.csv", index_col="time_period")
periods = df.index.tolist()
period = int(input("Enter a period for the moving average :"))
sum1 = 0
df['forecast'] = df['sales'] # add one line
for i in periods:
if i < period:
df['forecast'][i] = i
else:
for j in range(period):
sum1 += df['sales'][i-j]
df['forecast'][i] = sum1/period
sum1 = 0
print(df)
df.to_csv("./forecast_mannual.csv")
Your code is giving 'keyerror' because of incorrect way of referencing column value for 'forecast'.Because the first time your code runs,'forecast' column is not yet created and when it tries to reference df'forecast' for first iteration then it gives key error.
Here,our task is to update values in dynamically created new column called 'forecast'. Therefore, instead of df['forecast'][i] you can write df.at[i,'forecast'].
There is another issue in the code.When value of i is less than period you are assigning 'i' to forecast which to my understanding is not correct.It should not display any thing in such case.
Here is my version of corrected code:
import pandas as pd
df = pd.read_csv("./sales.csv", index_col="time_period")
periods = df.index.tolist()
period = int(input("Enter a period for the moving average :"))
sum1 = 0
for i in periods:
print(i)
if i < period:
df.at[i,'forecast'] = ''
else:
for j in range(period):
sum1 += df['sales'][i-j]
df['forecast'][i] = sum1/period
sum1 = 0
print(df)
df.to_csv("./forecast_mannual.csv")
Output when I entered period=2 to calculate moving average:
Hope this helps.

Counting particular item in defaultdict list

So here is the structure of my defaultdict
#x = {lead_id:[[month,pid,year]]
x={'123':[[1,9,2015],[2,9,2015]],'345':[[2,10,2015],[2,13,2014]],'159':[1,3,2015].....}
I have more than 1000 lead_id's in this dictionary. Each one has random number of lists.In the other words, that same lead_id has duplicates but with different month or pid or year. Now i want to count all the lead_id's in January 2015.I want to count it as two if its two times or more than that according to its occurrence . Can anyone please help me to figure out how i can make an automated code so that it will check the length as well as the number of times that month with same year occurred.
For example:
x={'123':[[1,3,2015],[2,5,2014],[1,5,2015]],'987':[[3,55,2014]],'456':[[1,37,2015]]}
count of jan 2015 = 3
You can use this also...
sum(1 for i in x for j in x[i] if j[0] == 1 and j[2] == 2015)
You can do conditionals on the index values. date[0] is 1 for Jan. date[2] is 2015
#!/usr/bin/python
x={'123':[[1,3,2015],[2,5,2014],[1,5,2015]],'987':[[3,55,2014]],'456':[[1,37,2015]]}
#Set query dates
query_month = 1 #jan
query_year = 2015 #year
#Set a counter
jan_counts = 0
for list_of_dates in x.values():
for date in list_of_dates:
if (date[0] == query_month) and (date[2] == query_year):
jan_counts += 1
print jan_counts
#3
This should give your result:
>>> day = 1
>>> year = 2015
>>> x = {'123':[[1,3,2015],[2,5,2014],[1,5,2015]],'987':[[3,55,2014]],'456':[[1,37,2015]]}
>>> sum([1 for k, v in x.iteritems() for i in v if i[0] == day and i[2] == year])
3

Categories

Resources