So here is the structure of my defaultdict
#x = {lead_id:[[month,pid,year]]
x={'123':[[1,9,2015],[2,9,2015]],'345':[[2,10,2015],[2,13,2014]],'159':[1,3,2015].....}
I have more than 1000 lead_id's in this dictionary. Each one has random number of lists.In the other words, that same lead_id has duplicates but with different month or pid or year. Now i want to count all the lead_id's in January 2015.I want to count it as two if its two times or more than that according to its occurrence . Can anyone please help me to figure out how i can make an automated code so that it will check the length as well as the number of times that month with same year occurred.
For example:
x={'123':[[1,3,2015],[2,5,2014],[1,5,2015]],'987':[[3,55,2014]],'456':[[1,37,2015]]}
count of jan 2015 = 3
You can use this also...
sum(1 for i in x for j in x[i] if j[0] == 1 and j[2] == 2015)
You can do conditionals on the index values. date[0] is 1 for Jan. date[2] is 2015
#!/usr/bin/python
x={'123':[[1,3,2015],[2,5,2014],[1,5,2015]],'987':[[3,55,2014]],'456':[[1,37,2015]]}
#Set query dates
query_month = 1 #jan
query_year = 2015 #year
#Set a counter
jan_counts = 0
for list_of_dates in x.values():
for date in list_of_dates:
if (date[0] == query_month) and (date[2] == query_year):
jan_counts += 1
print jan_counts
#3
This should give your result:
>>> day = 1
>>> year = 2015
>>> x = {'123':[[1,3,2015],[2,5,2014],[1,5,2015]],'987':[[3,55,2014]],'456':[[1,37,2015]]}
>>> sum([1 for k, v in x.iteritems() for i in v if i[0] == day and i[2] == year])
3
Related
I have a 10 years climatological dateset as follows.
dt T P
01-01-2010 3 0
02-01-2010 5 11
03-01-2010 10 50
....
31-12-2020 -1 0
I want to estimate the total number of days in each month where T and P continuously stayed greater than 0 for three days or more
I would want these columns as an output:
month Number of days/DurationT&P>0 T P
I have never used loops in python, I seem to be able to write a simple loop and nothing beyond this when the data has to be first grouped by month and year and then apply the condition. would really appreciate any hints on the construction of the loop.
A= dataset
A['dt'] = pd.to_datetime(A['dt'], format='%Y-%m-%d')
for column in A [['P', 'T']]:
for i in range (len('P')):
if i > 0:
P.value_counts()
print(i)
for j in range (len ('T')):
if i > 0:
T.value_counts()
print(j)
Here is a really naive way you could set it up by simply iterating over the rows:
df['valid'] = (df['T'] > 0) & (df['P'] > 0)
def count_total_days(df):
i = 0
total = 0
for idx, row in df.iterrows():
if row.valid:
i += 1
elif not row.valid:
if i >= 3:
total += i
i = 0
return total
Since you want it per month, you would first have to create new month and year columns to group by:
df['month'] = df['dt'].dt.month
df['year'] = df['dt'].dt.year
for date, df_subset in df.groupby(['month', 'year']):
count_total_days(df_subset)
You can use resample and sum to get the sum of day for each where the condition is true.
import pandas as pd
dt = ["01-01-2010", "01-02-2010","01-03-2010","01-04-2010", "03-01-2010",'12-31-2020']
t=[3,66,100,5,10,-1]
P=[0,77,200,11,50,0]
A=pd.DataFrame(list(zip(dt, t,P)),
columns =['dtx', 'T','P'])
A['dtx'] = pd.to_datetime(A['dtx'], format='%m-%d-%Y')
A['Mask']=A.dtx.diff().dt.days.ne(1).cumsum()
dict_freq=A['Mask'].value_counts().to_dict()
newdict = dict((k, v) for k, v in dict_freq.items() if v >= 3)
A=A[A['Mask'].isin(list(newdict.keys()))]
A['Mask']=(A['T'] >= 1) & (A['P'] >= 1)
df_summary=A.query('Mask').resample(rule='M',on='dtx')['Mask'].sum()
Which produce
2010-01-31 3
I'm trying to extract climate data from various .nc files I have but the process is taking extremely long, I suspect it has something to do with the fact that I'm trying to extract the data for every day of June, July, August for the next 79 years. But I'm a novice programmer and I realize there might've been a few oversights by me (efficiency wise) that might've resulted in a slightly better performance.
This is the snippet
def calculateTemp(coords, year, model):
"""
takes in all coordinates of a line between two grid stations and the year
converts the year into date
takes average of temperature of each day of the month of June for each
coordinate and then takes average of all coordinates to find average temp
for that line for the month of June
"""
print(year)
# coords represents a list of different sets of coordinates between two grids
temp3 = 0 # sum of all temps of all coordinates
for i in range(0, len(coords)):
temp2 = 0
counter = 0
# this loop represents that the 15 years data is being extracted for
# each coordinate set and average of those 15 years is being taken
for p in range(0, 15):
temp1 = 0 # sum of all temps for one coordinate in all days of June, tuly, august
if year+ p < 100:
# this loop represents the months of jun, jul, aug
for j in range(6, 9):
# 30 days of each month
for k in range(1, 31):
if k < 10:
# this if-else makes a string of date
date = '20'+str(year+p)+'-0'+str(j)+'-0'+str(k)
else:
date = '20'+str(year+p)+'-0'+str(j)+'-'+str(k)
# there are 3 variants of the climate model
# for years upto 2040, between 2041-2070
# and between 2071 and 2099
# hence this if else block
if year+p < 41:
temp1 += model[0]['tasmax'].sel(
lon=coords[i][1], lat=coords[i][0], time=date, method='nearest').data[0]
elif year+p >= 41 and year+p <71:
temp1 += model[1]['tasmax'].sel(
lon=coords[i][1], lat=coords[i][0], time=date, method='nearest').data[0]
else:
temp1 += model[2]['tasmax'].sel(
lon=coords[i][1], lat=coords[i][0], time=date, method='nearest').data[0]
counter += 1
avg = temp1/(len(range(0,30))*len(range(6,9)))
temp2 += avg
temp3 += temp2/counter
Tamb = temp3/len(coords)
return Tamb
Is there anyway I can increase the performance of this code and optimize it?
I just replaced the innermost loops k in range(1,31)and j in range(6,9)into a dict comprehension to generate all the dates and corresponding value from your model. Then simply averaged these values for every value of p and then for every coord in coords.
Give this a shot. Dicts should make the processing faster. Also check if the averages are exactly how you are calculating them in your function.
def build_date(year,p,j,k):
return '20'+str(year+p)+'-0'+str(j)+'-0'+str(k) if k<10 else '20'+str(year+p)+'-0'+str(j)+'-'+str(k)
def calculateTemp(coords, year, model):
func2 = lambda x,date:model[x]['tasmax'].sel(lon=coords[i][1],
lat=coords[i][0],
time=date,
method='nearest').data[0]
print(year)
out = {}
for i in range(len(coords)):
inner = {}
for p in range(0,15):
if year + p < 100:
dates = {build_date(year,p,j,k):func2(0,build_date(year,p,j,k)) if year+p<41 \
else func2(1,build_date(year,p,j,k)) if (year+p >= 41 and year+p <71) \
else func2(2,build_date(year,p,j,k))
for j in range(6,9) \
for k in range(1,31) }
inner[p] = sum([v for k,v in dates.items()])/len(dates)
out[i] = inner
coord_averages = {k : sum(v.values())/len(v) for k,v in out.items() }
Tamb = sum([v for k,v in coord_averages.items()])/len(coord_averages)
return Tamb
I'm working on forex data like this:
0 1 2 3
1 AUD/JPY 20040101 00:01:00.000 80.598 80.598
2 AUD/JPY 20040101 00:02:00.000 80.595 80.595
3 AUD/JPY 20040101 00:03:00.000 80.562 80.562
4 AUD/JPY 20040101 00:04:00.000 80.585 80.585
5 AUD/JPY 20040101 00:05:00.000 80.585 80.585
I want to go through column 2 and 3 and remove the rows in which the value is repeated for more than 15 times in a row. So far I managed to produce this piece of code:
price = 0
drop_start = 0
counter = 0
df_new = df
for i, r in df.iterrows():
if r.iloc[2] != price:
if counter >= 15:
df_new = df_new.drop(df_new.index[drop_start:i])
price = r.iloc[2]
counter = 1
drop_start = i
if r.iloc[2] == price:
counter = counter + 1
price = 0
drop_start = 0
counter = 0
df = df_new
for i, r in df.iterrows():
if r.iloc[3] != price:
if counter >= 15:
df_new = df_new.drop(df_new.index[drop_start:i])
price = r.iloc[3]
counter = 1
drop_start = i
if r.iloc[3] == price:
counter = counter + 1
print(df_new.info())
df_new.to_csv('df_new.csv', index=False, header=None)
Unfortunately when I check the output file there are some mistakes, there are some weekends which have not been removed by the program. How should I build my algorithm, so it removes the duplicated values correctly?
First 250k rows of my initial dataset is available here: https://ufile.io/omg5h
The output of this program for that sample data is available here:
https://ufile.io/2gc3d
You can see that in the output file the rows 6931+ were not succesfully removed:
The problem with your algorithm is that, you are not holding specific counter values for the row values, but rather increment the counter through the loop. This causes the result to be false I believe. Also, the comparison r.iloc[2] != price also does not make sense because you are changing the value of price every iteration, so if there are elements between the duplicates, this check do not serve a proper function. I wrote a small code to copy the behavior you asked for.
df = pd.DataFrame([[0,0.5, 2.5],[0,1, 2],[0,1.5,2.5 ],[0,2, 3],[0,2, 3],[0,3, 4],
[0,4, 5]],columns = ['A','B','C'])
df_new = df
dict = {}
print('Initial DF')
print(df)
print()
for i, r in df.iterrows():
counter = dict.get(r.iloc[1])
if counter == None:
counter = 0
dict[r.iloc[1]] = counter + 1
if dict[r.iloc[1]] >= 2:
df_new = df_new[df_new.B != r.iloc[1]]
print('2nd col. deleted DF')
print(df_new)
print()
df_fin = df_new
dict2 = {}
for i, r in df_new.iterrows():
counter = dict2.get(r.iloc[2])
if counter == None:
counter = 0
dict2[r.iloc[2]] = counter + 1
if dict2[r.iloc[2]] >= 2:
df_fin = df_fin[df_fin.C != r.iloc[2]]
print('3rd col. deleted DF')
print(df_fin)
Here, I hold the counter value for each unique value in the rows of column 2 and 3. Then, according to the threshold(which is 2 in this case) I remove the rows which are exceeding the threshold. I first eliminate values according to the 2nd column, then forward this modified array to the next loop and eliminate values according to the 3rd column and finish the process.
So I this code which is suppose to return a list with the closest leap year of a list of years.
For example: calling the function with [1995 1750 2018] should return
1996 1748 2016
Which it does for that set of numbers.
The problem I am having is that when a leap year is in the input for example 2008 it does not give me back the closest leap year to 2008. I get back 2008.
Any suggestions as to how I can modify the code to make it work?
code
def is_leap(year):
leap = False
if year % 4 == 0:
if year % 100 != 0 or year % 400 == 0:
leap = True
return leap
major_b = []
major_f = []
newLst = []
def year_forward(yearBounds):
for item in yearBounds:
counter = 0
while not is_leap(item):
item = item + 1
counter += 1
major_f.append([item, counter])
return major_f
def year_backward(yearBounds):
for item in yearBounds:
counter = 0
while not is_leap(item):
item = item - 1
counter -= 1
major_b.append([item,counter])
return major_b
def findLastLeapYears(yearBounds):
forward = year_forward(yearBounds)
backward = year_backward(yearBounds)
counter = 0
for item in forward:
if abs(item[1]) < abs(backward[counter][1]):
newLst.append (str(item[0]))
counter+=1
elif abs(item[1]) == abs(backward[counter][1]):
if item[0] < backward[counter][0]:
newLst.append (str(item[0]))
counter += 1
else:
newLst.append (str(backward[counter][0]))
counter += 1
else:
newLst.append (str(backward[counter][0]))
counter+=1
return newLst
I'd avoid trying to roll your own leap year detection code. Use calendar.isleap to determine whether a year is a leap year or not.
Then go in a loop, like this:
import calendar
def find_nearest_leap(year):
offset = 1
while True:
if calendar.isleap(year - offset):
return year - offset
if calendar.isleap(year + offset):
return year + offset
offset += 1
To find the list of nearest leap years for a list of values, do this:
nearest_leap_years = [find_nearest_leap(year) for year in years]
Where years is the list of years you are interested in.
I'm also assuming the nearest leap year isn't the year itself, which seems to be a constraint of the problem...
I'm trying to build a function that recieves a date and adds days, updating everything in case it changes, so far i've come up with this:
def addnewDate(date, numberOfDays):
date = date.split(":")
day = int(date[0])
month = int(date[1])
year = int(date[2])
new_days = 0
l = 0
l1 = 28
l2 = 30
l3 = 31
#l's are the accordingly days of the month
while numberOfDays > l:
numberOfDays = numberOfDays - l
if month != 12:
month += 1
else:
month = 1
year += 1
if month in [1, 3, 5, 7, 8, 10, 12]:
l = l3
elif month in [4, 6, 9, 11]:
l = l2
else:
l = l1
return str(day) + ':' + str(month) + ':' + str(year) #i'll deal
#with fact that it doesn't put the 0's in the < 10 digits later
Desired output:
addnewDate('29:12:2016', 5):
'03:01:2017'
I think the problem is with either the variables, or the position i'm using them in, kinda lost though..
Thanks in advance!
p.s I can't use python build in functions :)
Since you cannot use standard library, here's my attempt. I hope I did not forget anything.
define a table for month lengths
tweak it if leap year detected (every 4 year, but special cases)
work on zero-indexed days & months, much easier
add the number of days. If lesser that current month number of days, end, else, substract current month number of days and retry (while loop)
when last month reached, increase year
add 1 to day and month in the end
code:
def addnewDate(date, numberOfDays):
month_days = [31,28,31,30,31,30,31,31,30,31,30,31]
date = date.split(":")
day = int(date[0])-1
month = int(date[1])-1
year = int(date[2])
if year%4==0 and year%400!=0:
month_days[1]+=1
new_days = 0
#l's are the accordingly days of the month
day += numberOfDays
nb_days_month = month_days[month]
done = False # since you don't want to use break, let's create a flag
while not done:
nb_days_month = month_days[month]
if day < nb_days_month:
done = True
else:
day -= nb_days_month
month += 1
if month==12:
year += 1
month = 0
return "{:02}:{:02}:{:04}".format(day+1,month+1,year)
test (may be not exhaustive):
for i in ("28:02:2000","28:02:2004","28:02:2005","31:12:2012","03:02:2015"):
print(addnewDate(i,2))
print(addnewDate(i,31))
result:
02:03:2000
31:03:2000
01:03:2004
30:03:2004
02:03:2005
31:03:2005
02:01:2013
31:01:2013
05:02:2015
06:03:2015
of course, this is just for fun. Else use time or datetime modules!