I have a dataframe:
df1 = pd.DataFrame(
[['2011-01-01','2011-01-03','A'], ['2011-04-01','2011-04-01','A'], ['2012-08-28','2012-08-30','B'], ['2015-04-03','2015-04-05','A'], ['2015-08-21','2015-08-21','B']],
columns=['d0', 'd1', 'event'])
d0 d1 event
0 2011-01-01 2011-01-03 A
1 2011-04-01 2011-04-01 A
2 2012-08-28 2012-08-30 B
3 2015-04-03 2015-04-05 A
4 2015-08-21 2015-08-21 B
It contains some events A and B that occurred in the specified interval from d0 to d1. (There are actually more events, they are mixed, but they have no intersection by dates.) Moreover, this interval can be 1 day (d0 = d1). I need to go from df1 to df2 in which these time intervals are "unrolled" for each event, i.e.:
df2 = pd.DataFrame(
[['2011-01-01','A'], ['2011-01-02','A'], ['2011-01-03','A'], ['2011-04-01','A'], ['2012-08-28','B'], ['2012-08-29','B'], ['2012-08-30','B'], ['2015-04-03','A'], ['2015-04-04','A'], ['2015-04-05','A'], ['2015-08-21','B']],
columns=['Date', 'event'])
Date event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
I tried doing this based on resample and comparing areas where ffill = bfill but couldn't come up with anything. How can this be done in the most simple way?
We can set_index to event then create date_range per row, then explode to unwind the ranges and reset_index to create the DataFrame:
df2 = (
df1.set_index('event')
.apply(lambda r: pd.date_range(r['d0'], r['d1']), axis=1)
.explode()
.reset_index(name='Date')[['Date', 'event']]
)
df2:
Date event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
Let us try comprehension to create the pairs of date and event
pd.DataFrame(((d, c) for (*v, c) in df1.to_numpy()
for d in pd.date_range(*v)), columns=['Date', 'Event'])
Date Event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
I don't know if this is the "most simple," but it's the most intuitive way I can think to do it. I iterate over the rows and unroll it manually into a new dataframe. This means that I look at each row, iterate over the dates between d0 and d1, and construct a row for each of them and compile them into a dataframe:
from datetime import timedelta
def unroll_events(df):
rows = []
for _, row in df.iterrows():
event = row['event']
start = row['d0']
end = row['d1']
current = start
while current != end:
rows.append(dict(Date=current, event=event))
current += timedelta(days=1)
rows.append(dict(Date=current, event=event)) # make sure last one is included
return pd.DataFrame(rows)
Related
I need to pick one value per 30 day period for the entire dataframe. For instance if I have the following dataframe:
df:
Date Value
0 2015-09-25 e
1 2015-11-11 b
2 2015-11-24 c
3 2015-12-02 d
4 2015-12-14 a
5 2016-02-01 b
6 2016-03-23 c
7 2016-05-02 d
8 2016-05-25 a
9 2016-06-15 a
10 2016-06-28 a
I need to pick the first entry and then filter out any entry within the next 30 days of that entry and then proceed along the dataframe. For instance, indexes, 0 and 1 should stay since they are at least 30 days apart, but 2 and 3 are within 30 days of 1 so they should be removed. This should continue chronologically until we have 1 entry per 30 day period:
Date Value
0 2015-09-25 e
1 2015-11-11 b
4 2015-12-14 a
5 2016-02-01 b
6 2016-03-23 c
7 2016-05-02 d
9 2016-06-15 a
The end result should have only 1 entry per 30 day period. Any advice or assistance would be greatly appreciated!
I have tried df.groupby(pd.Grouper(freq='M')).first() but that picks the first entry in each month rather than each entry that is at least 30 days from the previous entry.
I came up with a simple iterative solution which uses the fact that the DF is sorted, but its fairly slow:
index = df.index.values
dates = df['Date'].tolist()
index_to_keep = []
curr_date = None
for i in range(len(dates)):
if not curr_date or (dates[i] - curr_date).days > 30:
index_to_keep.append(index[i])
curr_date = dates[i]
df_out = df.loc[index_to_keep, :]
return df_out
Any ideas on how to speed it up?
I think this should be what you are looking for.
You need to transform your date column into a datetime datastructure to not be interpreted as a string.
here is what it looks like:
df = pd.DataFrame({'Date': ['2015-09-25', '2015-11-11','2015-11-24', '2015-12-02','2015-12-14'],
'Value' : ['e', 'b', 'c','d','a']})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df = df.groupby(pd.Grouper(freq='30D')).nth(0)
and here is the result
Value
Date
2015-09-25 e
2015-10-25 b
2015-11-24 c
So I have the following dataframe:
Period group ID
20130101 A 10
20130101 A 20
20130301 A 20
20140101 A 20
20140301 A 30
20140401 A 40
20130101 B 11
20130201 B 21
20130401 B 31
20140401 B 41
20140501 B 51
I need to count how many different ID there are by group in the last year. So my desired output would look like this:
Period group num_ids_last_year
20130101 A 2 # ID 10 and 20 in the last year
20130301 A 2
20140101 A 2
20140301 A 2 # ID 30 enters, ID 10 leaves
20140401 A 3 # ID 40 enters
20130101 B 1
20130201 B 2
20130401 B 3
20140401 B 2 # ID 11 and 21 leave
20140501 B 2 # ID 31 leaves, ID 51 enters
Period is in datetime format. I tried many things along the lines of:
df.groupby(['group','Period'])['ID'].nunique() # Get number of IDs by group in a given period.
df.groupby(['group'])['ID'].nunique() # Get total number of IDs by group.
df.set_index('Period').groupby('group')['ID'].rolling(window=1, freq='Y').nunique()
But the last one isn't even possible. Is there any straightforward way to do this? I'm thinking maybe some kind of combination of cumcount() and pd.DateOffset or maybe ge(df.Period - dt.timedelta(365), but I can't find the answer.
Thanks.
Edit: added the fact that I can find more than one ID in a given Period
looking at your data structure, I am guessing you have MANY duplicates, so start with dropping them. drop_duplicates tend to be fast
I am assuming that df['Period'] columns is of dtype datetime64[ns]
df = df.drop_duplicates()
results = dict()
for start in df['Period'].drop_duplicates():
end = start.date() - relativedelta(years=1)
screen = (df.Period <= start) & (df.Period >= end) # screen for 1 year of data
singles = df.loc[screen, ['group', 'ID']].drop_duplicates() # screen for same year ID by groups
x = singles.groupby('group').count()
results[start] = x
results = pd.concat(results, 0)
results
ID
group
2013-01-01 A 2
B 1
2013-02-01 A 2
B 2
2013-03-01 A 2
B 2
2013-04-01 A 2
B 3
2014-01-01 A 2
B 3
2014-03-01 A 2
B 1
2014-04-01 A 3
B 2
2014-05-01 A 3
B 2
is that any faster?
p.s. if df['Period'] is not a datetime:
df['Period'] = pd.to_datetime(df['Period'],format='%Y%m%d', errors='ignore')
Here the solution using groupby and rolling. Note: your desired ouput counts a year from YYYY0101 to next year YYYY0101, so you need rolling 366D instead of 365D
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
df = df.set_index('Period')
df_final = (df.groupby('group')['ID'].rolling(window='366D')
.apply(lambda x: np.unique(x).size, raw=True)
.reset_index(name='ID_count')
.drop_duplicates(['group','Period'], keep='last'))
Out[218]:
group Period ID_count
1 A 2013-01-01 2.0
2 A 2013-03-01 2.0
3 A 2014-01-01 2.0
4 A 2014-03-01 2.0
5 A 2014-04-01 3.0
6 B 2013-01-01 1.0
7 B 2013-02-01 2.0
8 B 2013-04-01 3.0
9 B 2014-04-01 2.0
10 B 2014-05-01 2.0
Note: On 18M+ rows, I don't think this solution will make it at 10 mins. I hope it would take about 30 mins.
from dateutil.relativedelta import relativedelta
df.sort_values(by=['Period'], inplace=True) # if not already sorted
# create new output df
df1 = (df.groupby(['Period','group'])['ID']
.apply(lambda x: list(x))
.reset_index())
df1['num_ids_last_year'] = df1.apply(lambda x: len(set(df1.loc[(df1['Period'] >= x['Period']-relativedelta(years=1)) & (df1['Period'] <= x['Period']) & (df1['group'] == x['group'])].ID.apply(pd.Series).stack())), axis=1)
df1.sort_values(by=['group'], inplace=True)
df1.drop('ID', axis=1, inplace=True)
df1 = df1.reset_index(drop=True)
import pandas as pd
mydate = ["01/01/2018","19/01/2018","24/01/2018" ,
"27/01/2018","29/01/2018","30/01/2018" ,
"22/02/2018","23/03/2018"]
mydate = pd.to_datetime(mydate)
events = ["a" , "b" , "c" , "d" , "e" , "f" ,"g" , "h"]
df = pd.DataFrame({"date" :mydate,"events" :events})
df
date events
0 2018-01-01 a
1 2018-01-19 b
2 2018-01-24 c
3 2018-01-27 d
4 2018-01-29 e
5 2018-01-30 f
6 2018-02-22 g
7 2018-03-23 h
I want to slice data on every 20 days and store them in separate data frame. I have looked group-by , date_range and other functionality but could not find solution for my problem. I can do this using typical for loop but I am looking to do using some pandas functionality.
Expected result
df = [df1 , df2 , df3 , df4]
where df1 contain row 0 ,1
df2 contains row 2,3,4,5
df3 contain row 6
df4 contain row 7
You can use pd.Grouper with freq='20d':
In [8]: final_list = [e for _, e in df.groupby(pd.Grouper(key='date', freq='20d')) if not e.empty]
In [9]: for e in final_list: print(e)
date events
0 2018-01-01 a
1 2018-01-19 b
date events
2 2018-01-24 c
3 2018-01-27 d
4 2018-01-29 e
5 2018-01-30 f
date events
6 2018-02-22 g
date events
7 2018-03-23 h
Here's a solution, though it does use a simple loop:
import pandas as pd
from datetime import datetime
df = 'your dataframe'
dfs = []
delta = df.date.max() - df.date.min()
for i in range(0, delta.days+1, 20):
mask = (df['date'] >= df.date.min()+datetime.timedelta(days=i)) & (df['date'] <= df.date.min() + datetime.timedelta(days=i+20))
dfs.append(df.loc[mask])
I tried this,
minimum=df['date'].min()
df['diff']=(df['date']-minimum)/datetime.timedelta(days=1)
df['s']=df.groupby(pd.cut(df['diff'],np.arange(-0.000001, df['diff'].max()+20, 20))).grouper.group_info[0]
for u,v in df.groupby('s'):
del v['s']
print v
Output
date events diff
0 2018-01-01 a 0.0
1 2018-01-19 b 18.0
date events diff
2 2018-01-24 c 23.0
3 2018-01-27 d 26.0
4 2018-01-29 e 28.0
5 2018-01-30 f 29.0
date events diff
6 2018-02-22 g 52.0
date events diff
7 2018-03-23 h 81.0
I've got a function which I've set to return two values (call them Site & Date). I'm trying to use df.apply to create two new columns, each representing one of the returned values. I don't want to apply this function twice, or more, times because it will take ages, so I need some way to set the values of two columns to two, or more, values from the function. Here is my code.
df1[['Site','Site Date']] = df1.apply(
lambda row: firstSite(biomass, row['lat'], row['long'], row['Date']),
axis = 1)
The input value biomass is a dataframe of coordinates, row 'lat', 'lng', 'Date' are all columns from df1. If I decide to apply this function to df['Site'] it works perfectly but when I want to apply values to two columns I get this error.
ValueError: Shape of passed values is (999, 2), indices imply (999, 28)
def firstSite(biomass, lat, long, date):
biomass['Date of Operation'] = pd.to_datetime(biomass['Date of Operation'])
biomass = biomass[biomass['Date of Operation'] <= date]
biomass['distance'] = biomass.apply(
lambda row: distanceBetweenCm(lat, long, row['Lat'], row['Lng']),
axis=1)
biomass['Site Name'] = np.where((biomass['distance'] <= 2), biomass['Site Name'], "Null")
biomass = biomass.drop_duplicates('Site Name')
Site = biomass.loc[biomass['Date of Operation'].idxmin(),'Site Name']
Lat = biomass.loc[biomass['Date of Operation'].idxmin(),'Lat']
return Site, Lat
This function has a few tasks:
1 - It removes any rows from biomass where the date is after df1['Date'].
2 - If the distance between coordinates is more than 2, the 'Site Name' is changed to 'Null'
3 - It removes any duplicates from the site name, ensuring that there will only be one row with the value 'Null'.
4 - It returns the value of 'Site Name' & 'Lat' where the 'Date of Operation' is least.
I need my code to return the first (by date) record from biomass where the distance between the coordinates from df1 & biomass is less than 2km.
Hopefully I'll be able to return the first record for many different radius', such as first biomass site within 2km, 4km, 6km, 8km, 10km.
I think your function need return Series with 2 values:
df1 = pd.DataFrame({'A':list('abcdef'),
'lat':[4,5,4,5,5,4],
'long':[7,8,9,4,2,3],
'Date':pd.date_range('2011-01-01', periods=6),
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df1)
A Date E F lat long
0 a 2011-01-01 5 a 4 7
1 b 2011-01-02 3 a 5 8
2 c 2011-01-03 6 a 4 9
3 d 2011-01-04 9 b 5 4
4 e 2011-01-05 2 b 5 2
5 f 2011-01-06 4 b 4 3
biomass = 10
def firstSite(a,b,c,d):
return pd.Series([a + b, d])
df1[['Site','Site Date']] = df1.apply(lambda row: firstSite(biomass,
row['lat'], row['long'], row['Date']),
axis = 1)
print (df1)
A Date E F lat long Site Site Date
0 a 2011-01-01 5 a 4 7 14 2011-01-01
1 b 2011-01-02 3 a 5 8 15 2011-01-02
2 c 2011-01-03 6 a 4 9 14 2011-01-03
3 d 2011-01-04 9 b 5 4 15 2011-01-04
4 e 2011-01-05 2 b 5 2 15 2011-01-05
5 f 2011-01-06 4 b 4 3 14 2011-01-06
I have a pandas data frame mydf that has two columns,and both columns are datetime datatypes: mydate and mytime. I want to add three more columns: hour, weekday, and weeknum.
def getH(t): #gives the hour
return t.hour
def getW(d): #gives the week number
return d.isocalendar()[1]
def getD(d): #gives the weekday
return d.weekday() # 0 for Monday, 6 for Sunday
mydf["hour"] = mydf.apply(lambda row:getH(row["mytime"]), axis=1)
mydf["weekday"] = mydf.apply(lambda row:getD(row["mydate"]), axis=1)
mydf["weeknum"] = mydf.apply(lambda row:getW(row["mydate"]), axis=1)
The snippet works, but it's not computationally efficient as it loops through the data frame at least three times. I would just like to know if there's a faster and/or more optimal way to do this. For example, using zip or merge? If, for example, I just create one function that returns three elements, how should I implement this? To illustrate, the function would be:
def getHWd(d,t):
return t.hour, d.isocalendar()[1], d.weekday()
Here's on approach to do it using one apply
Say, df is like
In [64]: df
Out[64]:
mydate mytime
0 2011-01-01 2011-11-14
1 2011-01-02 2011-11-15
2 2011-01-03 2011-11-16
3 2011-01-04 2011-11-17
4 2011-01-05 2011-11-18
5 2011-01-06 2011-11-19
6 2011-01-07 2011-11-20
7 2011-01-08 2011-11-21
8 2011-01-09 2011-11-22
9 2011-01-10 2011-11-23
10 2011-01-11 2011-11-24
11 2011-01-12 2011-11-25
We'll take the lambda function out to separate line for readability and define it like
In [65]: lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
And, apply and store the result to df[['hour', 'weekday', 'weeknum']]
In [66]: df[['hour', 'weekday', 'weeknum']] = df.apply(lambdafunc, axis=1)
And, the output is like
In [67]: df
Out[67]:
mydate mytime hour weekday weeknum
0 2011-01-01 2011-11-14 0 52 5
1 2011-01-02 2011-11-15 0 52 6
2 2011-01-03 2011-11-16 0 1 0
3 2011-01-04 2011-11-17 0 1 1
4 2011-01-05 2011-11-18 0 1 2
5 2011-01-06 2011-11-19 0 1 3
6 2011-01-07 2011-11-20 0 1 4
7 2011-01-08 2011-11-21 0 1 5
8 2011-01-09 2011-11-22 0 1 6
9 2011-01-10 2011-11-23 0 2 0
10 2011-01-11 2011-11-24 0 2 1
11 2011-01-12 2011-11-25 0 2 2
To complement John Galt's answer:
Depending on the task that is performed by lambdafunc, you may experience some speedup by storing the result of apply in a new DataFrame and then joining with the original:
lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
newcols = df.apply(lambdafunc, axis=1)
newcols.columns = ['hour', 'weekday', 'weeknum']
newdf = df.join(newcols)
Even if you do not see a speed improvement, I would recommend using the join. You will be able to avoid the (always annoying) SettingWithCopyWarning that may pop up when assigning directly on the columns:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
You can do this in a somewhat cleaner method by having the function you apply return a pd.Series with named elements:
def process(row):
return pd.Series(dict(b=row["a"] * 2, c=row["a"] + 2))
my_df = pd.DataFrame(dict(a=range(10)))
new_df = my_df.join(my_df.apply(process, axis="columns"))
The result is:
a b c
0 0 0 2
1 1 2 3
2 2 4 4
3 3 6 5
4 4 8 6
5 5 10 7
6 6 12 8
7 7 14 9
8 8 16 10
9 9 18 11
def getWd(d):
d.isocalendar()[1], d.weekday()
def getH(t):
return t.hour
mydf["hour"] = zip(*df["mytime"].map(getH))
mydf["weekday"], mydf["weeknum"] = zip(*df["mydate"].map(getWd))