Pandas: filter by date proximity - python

I have a frame like:
id title date
0 1211 jingle bells 2019-01-15
1 1212 jingle bells 2019-01-15
2 1225 tom boat 2019-06-15
3 2112 tom boat 2019-06-15
4 3122 tom boat 2017-03-15
5 1762 tom boat 2017-03-15
An item is defined as the group of id with the same title and with date within 70 days of the first. I need a dictionary of ids grouped by title if date is within 70 days of each other. Expected outcome here is:
d = {0: [1211,1212], 1: [1225,2112], 2: [3122,1762]}
Any given title can have uncapped number of dictionary entries or just one. id are unique to one title. At the moment, I do something like:
itemlist = []
for i in list(df.title):
dates = list(df.loc[df.title==i,'date'])
if (max(dates)-min(dates)).days > 70:
items = []
while len(dates)>0:
extract = [i for i in dates if (i-min(dates)).days<70]
items.append(list(df.loc[(df.title==i)&(df.date.isin(extract)),'id'])
dates = [i for i in dates if i not in extract
else:
items = [list(df.loc[df.title==i,'id'])]
itemlist += items
d = {j:i for i in range(len(itemlist)) for j in itemlist[i]}
It doesn't quite work yet, I'm bugfixing. That said, I feel like this is a lot of iteration - any ideas on how to do this better?
another acceptable output would be a list of dataframes, one per item.

I think sorting your dataframe can help you solve the problem much more efficiently.
df = df.sort_values(['title', 'date'])
itemlist = []
counter = 0 # to get items at constant time
for title in set(df.title):
dates = df.loc[df['title']==title].date.tolist()
item = []
min_date = dates[0]
for date in dates:
if (date-min_date).days>70: # we need a new item
itemlist.append(item) # append original item
item = [df.iloc[counter, 0]] # new item
min_date = date
else:
item.append(df.iloc[counter, 0])
counter += 1
itemlist.append(item)
d = {i:j for i,j in enumerate(itemlist)}
print(d)
Even though the code became a bit long, there are only two loops (except the last one to change the list into dict) and it loops n_rows time in total, which means it only looks at every row once.
The use of counter is to use df.iloc which uses positional index (instead of labels or conditional statements like df.loc), hence computes faster-with O(1).

Related

i have multiple dictionaries in a python list. i want to find what they share in values & then compare them, while looping through a list of them

ok so here's an example dataset:
returntime= '9:00'
data1 = {Name:'jim', cardriven: '20123', time:'7:30'}
data1 = {Name:'bob', cardriven: '20123', time:'10:30'}
data1 = {Name:'jim', cardriven: '201111', time:'8:30'}
data1 = {Name:'bob', cardriven: '201314', time:'9:30'}
my problem is that i need to be able to loop over these dictionaries & find the car that both of them have driven & then compare the times they drove them to see who returned the car closest to 9:00
i have tried many loops & created lists etc... but i know theres gotta be a simple way to just say...
for [data1, data2....] who returned the car closest to the time... and here is the info from that record.
thanx in advance
This will iterate through the data you offered and put cars in a dictionary, which will keep track of whichever car has the closest time to the goal.
import datetime
returntime = "09:00"
data = [
dict(name="Jim", cardriven="20123", time="7:30"),
dict(name="Bob", cardriven="20123", time="10:30"),
dict(name="Jim", cardriven="201111", time="8:30"),
dict(name="Bob", cardriven="201314", time="9:30"),
]
def parsedelta(s):
t = datetime.datetime.strptime(s, "%M:%S")
return datetime.timedelta(minutes=t.minute, seconds=t.second)
deltareturn = parsedelta(returntime)
def diffreturn(s):
return abs(deltareturn.seconds - parsedelta(s).seconds)
cars = {}
for datum in data:
car = datum["cardriven"]
if car not in cars:
cars[car] = datum
continue
if diffreturn(datum["time"]) < diffreturn(cars[car]["time"]):
cars[car] = datum
print(cars)
Since we want to find a car both of them drove in, we could create a dictionary where each key is the car driven and each value is list of name-time pairs as well as a list of cars both drove in. Then compare the times and see who returned it closest to returntime.
from datetime import datetime
temp = {}
both_drove = []
for data in [data1, data2, data3, data4]:
if data['cardriven'] in temp:
temp[data['cardriven']].append((data['Name'], data['time']))
both_drove.append(data['cardriven'])
else:
temp[data['cardriven']] = [(data['Name'], data['time'])]
returntime = datetime.strptime(returntime, '%H:%M')
for car in both_drove:
p1, p2 = temp[car]
if abs(datetime.strptime(p1[1], '%H:%M') - returntime) > abs(datetime.strptime(p2[1], '%H:%M') - returntime):
print(p2)
else:
print(p1)
Output:
('jim', '7:30')
N.B. It's not clear which is closer to returntime, 10:30 or 7:30.
The test data is a bit funky for the question. You are basically looking for a groupby and sort approach but 2 out of the 3 groups in your test data has only a single entry. Furthermore, for car 20123, the times are equal distance (delta_min in my answer below) from the returntime. In this case, the sort_values step below won't affect the order. If you know how equal distance entries should be ranked, then that is a next step you can work on.
Nevertheless, I think the best course of action is to convert it into a pandas dateframe and create a pipeline. For this data
data1 = {"Name":'jim', "cardriven": '20123', "time":'7:30'}
data2 = {"Name":'bob', "cardriven": '20123', "time":'10:30'}
data3 = {"Name":'jim', "cardriven": '201111', "time":'8:30'}
data4 = {"Name":'bob', "cardriven": '201314', "time":'9:30'}
We can design a pipeline that uses a modified version of the excellent parsedelta function proposed in ljmc´s answer.
import datetime
import pandas as pd
data = pd.DataFrame([data1, data2, data3, data4])
# Name cardriven time
# 0 jim 20123 7:30
# 1 bob 20123 10:30
# 2 jim 201111 8:30
# 3 bob 201314 9:30
def timedelta(time):
t = datetime.datetime.strptime(time, "%H:%M")
return datetime.timedelta(hours=t.hour, minutes=t.minute).seconds / 60
returntime= '9:00'
latest_entries = (
data
.assign(delta_min=lambda d: abs(d["time"].apply(timedelta) - timedelta(returntime)))
.sort_values("delta_min")
.drop("delta_min", axis = 1) # comment this out if you want the minute difference
.drop_duplicates(subset="cardriven")
)
print(latest_entries)
Which gives us
Name cardriven time
2 jim 201111 8:30
0 jim 20123 7:30
3 bob 201314 9:30
Going further, we could simplify the pipeline by passing the timedelta function directly as the key parameter in the sort_values step. We also split the timedelta function.
def _timedelta(tm):
t = datetime.datetime.strptime(tm, "%H:%M")
return datetime.timedelta(hours=t.hour, minutes=t.minute).seconds / 60
def timedelta(time, rtrn_time):
return abs(_timedelta(time) - _timedelta(rtrn_time))
returntime= '9:00'
latest_entries = (
data
.sort_values("time", key=lambda d: d.apply(timedelta, rtrn_time=returntime))
.drop_duplicates(subset="cardriven")
)
print(latest_entries)
Name cardriven time
2 jim 201111 8:30
0 jim 20123 7:30
3 bob 201314 9:30
Maybe you can trying using only 1 dict where each entry in the dict is another dict with the key being maybe the name of the driver or an ID code.
Then you can loop over that dict and find out which dict entries had driven the same car.
Here's a simplified example of what I mean
returntime= '9:00'
data1 = {'Name':'jim', 'cardriven': '20123', 'time': "7:30"}
data2 = {'Name':'bob', 'cardriven': '20123', 'time': "10:30"}
data3 = {'Name':'jim', 'cardriven': '201111', 'time': "8:30"}
dict = {}
dict[0] = data1
dict[1] = data2
dict[2] = data3
for i in range(len(dict)):
if dict[i]["cardriven"] == '20123':
print(dict[i]["Name"])
Output:
jim
bob
Also a pro-tip: you can enter the time into the dict as a datetime object and that would help you greatly in comparing the time.

Spliting DataFrame into Multiple Frames by Dates Python

I fully understand there are a few versions of this questions out there, but none seem to get at the core of my problem. I have a pandas Dataframe with roughly 72,000 rows from 2015 to now. I am using a calculation that finds the most impactful words for a given set of text (tf_idf). This calculation does not account for time, so I need to break my main Dataframe down into time-based segments, ideally every 15 and 30 days (or n days really, not week/month), then run the calculation on each time-segmented Dataframe in order to see and plot what words come up more and less over time.
I have been able to build part of this this out semi-manually with the following:
def dateRange():
start = input("Enter a start date (MM-DD-YYYY) or '30' for last 30 days: ")
if (start != '30'):
datetime.strptime(start, '%m-%d-%Y')
end = input("Enter a end date (MM-DD-YYYY): ")
datetime.strptime(end, '%m-%d-%Y')
dataTime = data[(data['STATUSDATE'] > start) & (data['STATUSDATE'] <= end)]
else:
dataTime = data[data.STATUSDATE > datetime.now() - pd.to_timedelta('30day')]
return dataTime
dataTime = dateRange()
dataTime2 = dateRange()
def calcForDateRange(dateRangeFrame):
##### LONG FUNCTION####
return word and number
calcForDateRange(dataTime)
calcForDateRange(dataTime2)
This works - however, I have to manually create the 2 dates which is expected as I created this as a test. How can I split the Dataframe by increments and run the calculation for each dataframe?
dicts are allegedly the way to do this. I tried:
dict_of_dfs = {}
for n, g in data.groupby(data['STATUSDATE']):
dict_of_dfs[n] = g
for frame in dict_of_dfs:
calcForDateRange(frame)
The dict result was 2015-01-02: Dataframe with no frame. How can I break this down into a 100 or so Dataframes to run my function on?
Also, I do not fully understand how to break down ['STATUSDATE'] by number of days specifically?
I would to avoid iterating as much as possible, but I know I probably will have to someehere.
THank you
Let us assume you have a data frame like this:
date = pd.date_range(start='1/1/2018', end='31/12/2018', normalize=True)
x = np.random.randint(0, 1000, size=365)
df = pd.DataFrame(x, columns = ["X"])
df['Date'] = date
df.head()
Output:
X Date
0 328 2018-01-01
1 188 2018-01-02
2 709 2018-01-03
3 259 2018-01-04
4 131 2018-01-05
So this data frame has 365 rows, one for each day of the year.
Now if you want to group this data into intervals of 20 days and assign each group to a dict, you can do the following
df_dict = {}
for k,v in df.groupby(pd.Grouper(key="Date", freq='20D')):
df_dict[k.strftime("%Y-%m-%d")] = pd.DataFrame(v)
print(df_dict)
How about something like this. It creates a dictionary of non empty dataframes keyed on the
starting date of the period.
import datetime as dt
start = '12-31-2017'
interval_days = 30
start_date = pd.Timestamp(start)
end_date = pd.Timestamp(dt.date.today() + dt.timedelta(days=1))
dates = pd.date_range(start=start_date, end=end_date, freq=f'{interval_days}d')
sub_dfs = {d1.strftime('%Y%m%d'): df.loc[df.dates.ge(d1) & df.dates.lt(d2)]
for d1, d2 in zip(dates, dates[1:])}
# Remove empty dataframes.
sub_dfs = {k: v for k, v in sub_dfs.items() if not v.empty}

Pandas: How to map the list of dictionary in a column as a new row

The dataframe which is in below format has to be converted like "op_df",
ip_df=pd.DataFrame({'class':['I','II','III'],'details':[[{'sec':'A','assigned_to':'tom'},{'sec':'B','assigned_to':'sam'}],[{'sec':'B','assigned_to':'joe'}],[]]})
ip_df:
class details
0 I [{'sec':'A','assigned_to':'tom'},{'sec':'B','assigned_to':'sam'}]
1 II [{'sec':'B','assigned_to':'joe'}]
2 III []
The required output dataframe is suppose to be,
op_df:
class sec assigned_to
0 I A tom
1 I B sam
2 II B joe
3 III NaN NaN
How to change each dictionaries of "details" column as a new row with keys of the dictionary as column name and value of the dictionary as its respective column value?
I have tried with,
ip_df.join(ip_df['details'].apply(pd.Series))
whereas, I am unable to frame like "op_df".
I am sure there are better ways to do it, but I had to deconstruct your details list and create your dataframe as follows:
dict_values = {'class':['I','II','III'],'details':[[{'sec':'A','assigned_to':'tom'},{'sec':'B','assigned_to':'sam'}],[{'sec':'B','assigned_to':'joe'}],[]]}
all_values = []
for cl, detail in zip(dict_values['class'], dict_values['details']):
if len(detail) > 0:
for innerdict in detail:
row = {'class': cl}
for innerkey in innerdict.keys():
row[innerkey] = innerdict[innerkey]
all_values.append(row)
else:
row = {'class': cl}
all_values.append(row)
op_df = pd.DataFrame(all_values)

How to append a list of elements into a single feature of a dataframe?

I have two dataframes, a df of actors who have a feature that is a list of movie identifier numbers for films that they've worked on. I also have a list of movies that have an identifier number that will show up in the actor's list if the actor was in that movie.
I've attempted to iterate through the movies dataframe, which does produce results but is too slow.
It seems like iterating through the list of movies from the actors dataframe would result in less looping, but I've been unable to save results.
Here is the actors dataframe:
print(actors[['primaryName', 'knownForTitles']].head())
primaryName knownForTitles
0 Rowan Atkinson tt0109831,tt0118689,tt0110357,tt0274166
1 Bill Paxton tt0112384,tt0117998,tt0264616,tt0090605
2 Juliette Binoche tt1219827,tt0108394,tt0116209,tt0241303
3 Linda Fiorentino tt0110308,tt0119654,tt0088680,tt0120655
4 Richard Linklater tt0243017,tt1065073,tt2209418,tt0405296
And the movies dataframe:
print(movies[['tconst', 'primaryTitle']].head())
tconst primaryTitle
0 tt0001604 The Fatal Wedding
1 tt0002467 Romani, the Brigand
2 tt0003037 Fantomas: The Man in Black
3 tt0003593 Across America by Motor Car
4 tt0003830 Detective Craig's Coup
As you can see, the movies['tconst'] identifier shows up in a list in the actors dataframe.
My very slow iteration through the movie dataframe is as follows:
def add_cast(movie_df, actor_df):
results = movie_df.copy()
length = len(results)
#create an empty feature
results['cast'] = ""
#iterate through the movie identifiers
for index, value in results['tconst'].iteritems():
#create a new dataframe containing all the cast associated with the movie id
cast = actor_df[actor_df['knownForTitles'].str.contains(value)]
#check to see if the 'primaryName' list is empty
if len(list(cast['primaryName'].values)) != 0:
#set the new movie 'cast' feature equal to a list of the cast names
results.loc[index]['cast'] = list(cast['primaryName'].values)
#logging
if index % 1000 == 0:
logging.warning(f'Results location: {index} out of {length}')
#delete cast df to free up memory
del cast
return results
This generates some results but is not fast enough to be useful. One observation is that by creating a new dataframe of all the actors who have the movie identifier in their knownForTitles is that this list can be put into a single feature of the movies dataframe.
Whereas for my attempt to loop through the actors dataframe below, I don't seem to be able to append items into the movies dataframe:
def actors_loop(movie_df, actor_df):
results = movie_df.copy()
length = len(actor_df)
#create an empty feature
results['cast'] = ""
#iterate through all actors
for index, value in actor_df['knownForTitles'].iteritems():
#skip empties
if str(value) == r"\N":
logging.warning(f'skipping: {index} with a value of {value}')
continue
#generate a list of movies that this actor has been in
cinemetography = [x.strip() for x in value.split(',')]
#iterate through every movie the actor has been in
for movie in cinemetography:
#pull out the movie info if it exists
movie_info = results[results['tconst'] == movie]
#continue if empty
if len(movie_info) == 0:
continue
#set the cast variable equal to the actor name
results[results['tconst'] == movie]['cast'] = (actor_df['primaryName'].loc[index])
#delete the df to save space ?maybe
del movie_info
#logging
if index % 1000 == 0:
logging.warning(f'Results location: {index} out of {length}')
return results
So if I run the above code, I get a very fast result, but the 'cast' field remains empty.
I figured out the problem I was having with def actors_loop(movie_df, actor_df) function. The problem is that
results['tconst'] == movie]['cast'] = (actor_df['primaryName'].loc[index])
is setting the value equal to a copy of the results dataframe. It would be better to use the df.set_value() method or the df.at[] method.
I also figured out a much faster solution to the problem, rather than iterate through two dataframes and create recursive looping, it would be better to iterate once. So I created a list of tuples:
def actor_tuples(actor_df):
tuples =[]
for index, value in actor_df['knownForTitles'].iteritems():
cinemetography = [x.strip() for x in value.split(',')]
for movie in cinemetography:
tuples.append((actor_df['primaryName'].loc[index], movie))
return tuples
This created a list of tuples of the following form:
[('Fred Astaire', 'tt0043044'),
('Lauren Bacall', 'tt0117057')]
I then created a list of movie identifier numbers and index points (from the movie dataframe), that took this form:
{'tt0000009': 0,
'tt0000147': 1,
'tt0000335': 2,
'tt0000502': 3,
'tt0000574': 4,
'tt0000615': 5,
'tt0000630': 6,
'tt0000675': 7,
'tt0000676': 8,
'tt0000679': 9}
I then used the below function to iterate through the actor tuples and use the movie identifier as the key in the movie dictionary, this returned the correct movie index, which I used to add the actor name tuple to the target dataframe:
def add_cast(movie_df, Atuples, Mtuples):
results_df = movie_df.copy()
results_df['cast'] = ''
counter = 0
total = len(Atuples)
for tup in Atuples:
#this passes the movie ID into the movie_dict that will return an index
try:
movie_index = Mtuples[tup[1]]
if results_df.at[movie_index, 'cast'] == '':
results_df.at[movie_index, 'cast'] += tup[0]
else:
results_df.at[movie_index, 'cast'] += ',' + tup[0]
except KeyError:
pass
#logging
counter +=1
if counter % 1000000 == 0:
logging.warning(f'Index {counter} out of {total}, {counter/total}% finished')
return results_df
It ran in 10 minutes (making 2 sets of tuples, then the adding function) for 16.5 million actor tuples. The results are below:
0 tt0000009 Miss Jerry 1894 Romance
1 tt0000147 The Corbett-Fitzsimmons Fight 1897 Documentary,News,Sport
2 tt0000335 Soldiers of the Cross 1900 Biography,Drama
3 tt0000502 Bohemios 1905 \N
4 tt0000574 The Story of the Kelly Gang 1906 Biography,Crime,Drama
cast
0 Blanche Bayliss,Alexander Black,William Courte...
1 Bob Fitzsimmons,Enoch J. Rector,John L. Sulliv...
2 Herbert Booth,Joseph Perry,Orrie Perry,Reg Per...
3 Antonio del Pozo,El Mochuelo,Guillermo Perrín,...
4 Bella Cola,Sam Crewes,W.A. Gibson,Millard John...
Thank you stack overflow!

Pandas: Loop through dataframe with out counter

I have a data frame df which has dates in it:
df['Survey_Date'].head(4)
Out[65]:
0 1990-09-28
1 1991-07-26
2 1991-11-23
3 1992-10-15
I am interested in calculating a metric between two of the dates, using a separate data frame flow_df.
flow_df looks like:
date flow
0 1989-01-01 7480
1 1989-01-02 5070
2 1989-01-03 6410
3 1989-01-04 10900
4 1989-01-05 11700
For instance, I would like to query another data frame based on the current_date and early_date. The first time period of interest would be:
current_date = 1991-07-26
early_date = 1990-09-28
I have written a clunky for loop and it gets the job done, but I am sure there is a more elegant way:
My approach with a counter and for loop:
def find_peak(early_date,current_date,flow_df):
mask = (flow_df['date']>= early_date) & (flow_df['date'] < current_date)
query = flow_df.loc[mask]
peak_flow = np.max(query['flow'])*0.3048**3
return peak_flow
n=0
for thing in df['Survey_Date'][1:]:
early_date = df['Survey_Date'][n]
current_date = thing
peak_flow = find_peak(early_date,current_date,flow_df)
n+=1
df['Avg_Stage'][n] = peak_flow
How can I do this without a counter and for loop?
The desired output looks like:
Survey_Date Avg_Stage
0 1990-09-28
1 1991-07-26 574.831986
2 1991-11-23 526.693347
3 1992-10-15 458.732915
4 1993-04-01 855.168767
5 1993-11-17 470.059653
6 1994-04-07 419.089330
7 1994-10-21 450.237861
8 1995-04-24 498.376500
9 1995-06-23 506.871554
You can define a new variable that identifies survey period and use pandas.DataFrame.groupby to avoid for loop. It should be much faster when flow_df is large.
#convert both to datetime, if they are not
df['Survey_Date'] = pd.to_datetime(df['Survey_Date'])
flow_df['date'] = pd.to_datetime(flow_df['date'])
#Merge Survey_Date to flow_df. Most rows of flow_df['Survey_Date'] should be NaT
flow_df = flow_df.merge(df, left_on='date', right_on='Survey_Date', how='outer')
# In case not all Survey_Date in flow_df['date'] or data not sorted by date.
flow_df['date'].fillna(flow_df['Survey_Date'], inplace=True)
flow_df.sort_values('date', inplace=True)
#Identify survey period. In your example: [1990-09-28, 1991-07-26) is represented by 0; [1991-07-26, 1991-11-23) = 1; etc.
flow_df['survey_period'] = flow_df['Survey_Date'].notnull().cumsum()
#calc Avg_Stage in each survey_period. I did .shift(1) because you want to align period [1990-09-28, 1991-07-26) to 1991-07-26
df['Avg_Stage'] = (flow_df.groupby('survey_period')['flow'].max()*0.3048**3).shift(1)
You can use zip():
for early_date, current_date in zip(df['Survey_Date'], df['Survey_Date'][1:]):
#do whatever yo want.
Of course you can put it into a list comprehension:
[some_metric(early_date, current_date) for early_date, current_date in zip(df['Survey_Date'], df['Survey_Date'][1:])]

Categories

Resources