Iteratively Subset DataFrame based on Unique Times - python

Given the following example DataFrame:
>>> df
Times Values
0 05/10/2017 01:01:03 1
1 05/10/2017 01:05:00 2
2 05/10/2017 01:06:10 3
3 05/11/2017 08:25:20 4
4 05/11/2017 08:30:14 5
5 05/11/2017 08:30:35 6
I want to subset this DataFrame by the 'Time' column, by matching a partial string up to the hour. For example, I want to subset using partial strings which contain "05/10/2017 01:" and "05/11/2017 08:" which breaks up the subsets into two new data frames:
>>> df1
Times Values
0 05/10/2017 01:01:03 1
1 05/10/2017 01:05:00 2
2 05/10/2017 01:06:10 3
and
>>> df2
0 05/11/2017 08:25:20 4
1 05/11/2017 08:30:14 5
2 05/11/2017 08:30:35 6
Is it possible to make this subset iterative in Pandas, for multiple dates/times that similarly have the date/hour as the common identifier?

First, cast your Times column into a datetime format, and set it as the index:
df['Times'] = pd.to_datetime(df['Times'])
df.set_index('Times', inplace = True)
Then use the groupby method, with a TimeGrouper:
g = df.groupby(pd.TimeGrouper('h'))
g is an iterator that yields tuple pairs of times and sub-dataframes of those times. If you just want the sub-dfs, you can do zip(*g)[1].
A caveat: the sub-dfs are indexed by the timestamp, and pd.TimeGrouper only works when the times are the index. If you want to have the timestamp as a column, you could instead do:
df['Times'] = pd.to_datetime(df['Times'])
df['time_hour'] = df['Times'].dt.floor('1h')
g = df.groupby('time_hour')
Alternatively, you could just call .reset_index() on each of the dfs from the former method, but this will probably be much slower.

Convert Times to a hour period, groupby and then extract each group as a DF.
df1,df2=[g.drop('hour',1) for n,g in\
df.assign(hour=pd.DatetimeIndex(df.Times)\
.to_period('h')).groupby('hour')]
df1
Out[874]:
Times Values
0 2017-05-10 01:01:03 1
1 2017-05-10 01:05:00 2
2 2017-05-10 01:06:10 3
df2
Out[875]:
Times Values
3 2017-05-11 08:25:20 4
4 2017-05-11 08:30:14 5
5 2017-05-11 08:30:35 6

First make sure that the Times column is of type DateTime.
Second, set times column as index.
Third, use between_time method.
df['Times'] = pd.to_datetime(df['Times'])
df.set_index('Times', inplace=True)
df1 = df.between_time('1:00:00', '1:59:59')
df2 = df.between_time('8:00:00', '8:59:59')

If you use the datetime type you can extract things like hours and days.
times = pd.to_datetime(df['Times'])
hours = times.apply(lambda x: x.hour)
df1 = df[hours == 1]

You can use the str[] accessor to truncate the string representation of your date (you might have to cast astype(str) if your columns is a datetime and then use groupby.groups to access the dataframes as a dictionary where the keys are your truncated date values:
>>> df.groupby(df.Times.astype(str).str[0:13]).groups
{'2017-05-10 01': DatetimeIndex(['2017-05-10 01:01:03', '2017-05-10 01:05:00',
'2017-05-10 01:06:10'],
dtype='datetime64[ns]', name='time', freq=None),
'2017-05-11 08': DatetimeIndex(['2017-05-11 08:25:20', '2017-05-11 08:30:14',
'2017-05-11 08:30:35'],
dtype='datetime64[ns]', name='time', freq=None)}

Related

How to assign random values from a list to a column in a pandas dataframe?

I am working with Python in Bigquery and have a large dataframe df (circa 7m rows). I also have a list lst that holds some dates (say all days in a given month).
I am trying to create an additional column "random_day" in df with a random value from lst in each row.
I tried running a loop and apply function but being quite a large dataset it is proving challenging.
My attempts passed by the loop solution:
df["rand_day"] = ""
for i in a["row_nr"]:
rand_day = sample(day_list,1)[0]
df.loc[i,"rand_day"] = rand_day
And the apply solution, defining first my function and then calling it:
def random_day():
rand_day = sample(day_list,1)[0]
return day
df["rand_day"] = df.apply(lambda row: random_day())
Any tips on this?
Thank you
Use numpy.random.choice and if necessary convert dates by to_datetime:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
})
day_list = pd.to_datetime(['2015-01-02','2016-05-05','2015-08-09'])
#alternative
#day_list = pd.DatetimeIndex(['2015-01-02','2016-05-05','2015-08-09'])
df["rand_day"] = np.random.choice(day_list, size=len(df))
print (df)
A B rand_day
0 a 4 2016-05-05
1 b 5 2016-05-05
2 c 4 2015-08-09
3 d 5 2015-01-02
4 e 5 2015-08-09
5 f 4 2015-08-09

how to replace string at specific index in pandas dataframe

I have following dataframe in pandas
code bucket
0 08:30:00-9:00:00
1 10:00:00-11:00:00
2 12:00:00-13:00:00
I want to replace 7th character 0 with 1, my desired dataframe is
code bucket
0 08:30:01-9:00:00
1 10:00:01-11:00:00
2 12:00:01-13:00:00
How to do it in pandas?
Use indexing with str:
df['bucket'] = df['bucket'].str[:7] + '1' + df['bucket'].str[8:]
Or list comprehension:
df['bucket'] = [x[:7] + '1' + x[8:] for x in df['bucket']]
print (df)
code bucket
0 0 08:30:01-9:00:00
1 1 10:00:01-11:00:00
2 2 12:00:01-13:00:00
Avoid string operations where possible
You lose a considerable amount of functionality by working with strings only. While this may be a one-off operation, you will find that repeated string manipulations will quickly become expensive in terms of time and memory efficiency.
Use pd.to_datetime instead
You can add additional series to your dataframe with datetime objects. Below is an example which, in addition, creates an object dtype series in the format you desire.
# split by '-' into 2 series
dfs = df.pop('bucket').str.split('-', expand=True)
# convert to datetime
dfs = dfs.apply(pd.to_datetime, axis=1)
# add 1s to first series
dfs[0] = dfs[0] + pd.Timedelta(seconds=1)
# create object series from 2 times
form = '%H:%M:%S'
dfs[2] = dfs[0].dt.strftime(form) + '-' + dfs[1].dt.strftime(form)
# join to original dataframe
res = df.join(dfs)
print(res)
code 0 1 2
0 0 2018-10-02 08:30:01 2018-10-02 09:00:00 08:30:01-09:00:00
1 1 2018-10-02 10:00:01 2018-10-02 11:00:00 10:00:01-11:00:00
2 2 2018-10-02 12:00:01 2018-10-02 13:00:00 12:00:01-13:00:00

How to reduce time complexity or improve the efficiency of the program finding gaps of month using python pandas

Input is like this
Data Id
201505 A
201507 A
201509 A
200001 B
200001 C
200002 C
200005 C
i am finding date gaps and using this.But it is taking too long time to complete the function for large data how can i reduce time complexity of
#convert to datetimes
month['data'] = pd.to_datetime(month['data'], format='%Y%m')
#resample by start of months with asfreq
mdf = month.set_index('data').groupby(['series_id','symbol'])['series_id'].resample('MS').asfreq().rename('val').reset_index()
x = mdf['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
mdf.index = x.cumsum()
#filter only NaNs row and aggregate first, last and count.
mdf = (mdf[~x.values].groupby(['series_id','symbol','g'])['data'].agg(['first','last','size']).reset_index(level=2, drop=True).reset_index())
print mdf
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2
How can i reduce the time complexity or some other way to find the date gaps.
The assumptions made are the following:
All values in the Data column are unique, even across groups
The data in the data column are integers
The data is sorted by group first and then by value.
Here is my algorithm (mdf is the input df):
import pandas as pd
df2 = pd.DataFrame({'Id':mdf['Id'],'First':mdf['Data']+1,'Last':(mdf['Data']-1).shift(-1)})
df2 = df2.groupby('Id').apply(lambda g: g[g['Data'] != g['Data'].max()]).reset_index(drop=True)
print(df2[~df['First'].isin(mdf['Data'])&~df['Last'].isin(mdf['Data'])])
So using a bit the idea #RushabhMehta, you can us pd.DateOffset to create the output dataframe. Your input dataframe is called month, with column 'data' and 'series_id', according to your code. Here is the idea:
month['data'] = pd.to_datetime(month['data'], format='%Y%m')
month = month.sort_values(['series_id','data'])
# create mdf with the column you want
mdf = pd.DataFrame({'Id':month.series_id, 'first':month.data + pd.DateOffset(months=1),
'last': (month.groupby('series_id').data.shift(-1) - pd.DateOffset(months=1))})
Note how the column 'last' is created, using groupby, shift the value and substract a month with pd.DateOffset(months=1). Now select only the rows where the date in 'first' is before the one in 'last' and create the column size such as:
mdf = mdf.loc[mdf['first'] <= mdf['last']]
mdf['size'] = (mdf['last']- mdf['first']).astype('timedelta64[M]')+1
mdf looks like:
first Id last size
0 2015-06-01 A 2015-06-01 1.0
1 2015-08-01 A 2015-08-01 1.0
3 2000-02-01 B 2000-02-01 1.0
6 2000-03-01 C 2000-04-01 2.0
Just need to reorder column and reset_index if you want.

How to correctly perform all-VS-all row-by-row comparisons between series in two pandas dataframes?

I have two pandas dataframes, df1 and df2. Both contain time series data.
df1
Event Number Timestamp_A
A 1 7:00
A 2 8:00
A 3 9:00
df2
Event Number Timestamp_B
B 1 9:01
B 2 8:01
B 3 7:01
Basically, I want to determine the Event B which is closest to Event A, and assign this correctly.
Therefore, I need to substract (1) every Timestamp_B in df2 from ever Timestamp_A in df1, row by row. This results in a series of values, of which I want to take the minumum and put it to a new column in df1.
Event Number Timestamp_A Closest_Timestamp_B
A 1 7:00 7:01
A 2 8:00 8:01
A 3 9:00 9:01
I am not familiar with row-by-row operations in pandas.
When I am doing:
for index, row in df1.iterrows():
s = df1.Timestamp_A.values - df2["Timestamp_B"][:]
Closest_Timestamp_B = s.min()
The result I get is a ValueError:
ValueError: operands could not be broadcast together with shapes(3,) (4,)
How to correctly perform row-by-row comparisons between two pandas dataframes?
There might be a better way to do this but here is one way:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Event':['A','A','A'],'Number':[1,2,3],
'Timestamp_A':['7:00','8:00','9:00']})
df2 = pd.DataFrame({'Event':['B','B','B'],'Number':[1,2,3],
'Timestamp_B':['7:01','8:01','9:01']})
df1['Closest_timestamp_B'] = np.zeros(len(df1.index))
for index, row in df1.iterrows():
df1['Closest_timestamp_B'].iloc[index] = df2.Timestamp_B.loc[np.argmin(np.abs(pd.to_datetime(df2.Timestamp_B) -pd.to_datetime(row.Timestamp_A)))]
df1
Event Number Timestamp_A Closest_timestamp_B
0 A 1 7:00 7:01
1 A 2 8:00 8:01
2 A 3 9:00 9:01
Your best bet is to use the underlying numpy data structure to create a matrix of Timestamp_A by Timestamp_B. Since you need to compare every event in A to every event in B, this is an O(N^2) calculation, well suited for a matrix.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([['A',1,'7:00'],
['A',2,'8:00'],
['A',3,'9:00']], columns=['Event', 'Number', 'Timestamp_A'])
df2 = pd.DataFrame([['B',1,'9:01'],
['B',2,'8:01'],
['B',3,'7:01']], columns=['Event', 'Number', 'Timestamp_B'])
df1.Timestamp_A = pd.to_datetime(df1.Timestamp_A)
df2.Timestamp_B = pd.to_datetime(df2.Timestamp_B)
# create a matrix with the index of df1 as the row index, and the index
# of df2 as the column index
M = df1.Timestamp_A.values.reshape((len(df1),1)) - df2.Timestamp_B.values
# use argmin to find the index of the lowest value (after abs())
index_of_B = np.abs(M).argmin(axis=0)
df1['Closest_timestamp_B'] = df2.Timestamp_B[index_of_B]
df1
# returns:
Event Number Timestamp_A Closest_timestamp_B
0 A 1 2017-07-05 07:00:00 2017-07-05 09:01:00
1 A 2 2017-07-05 08:00:00 2017-07-05 08:01:00
2 A 3 2017-07-05 09:00:00 2017-07-05 07:01:00
If you want to return to the original formatting for the timestamps, you can use:
df1.Timestamp_A = df1.Timestamp_A.dt.strftime('%H:%M').str.replace(r'^0','')
df1.Closest_timestamp_B = df1.Closest_timestamp_B.dt.strftime('%H:%M').str.replace(r'^0','')
df1
# returns:
Event Number Timestamp_A Closest_timestamp_B
0 A 1 7:00 9:01
1 A 2 8:00 8:01
2 A 3 9:00 7:01
What about using merge_asof to get the closest events?
Make sure your data types are correct:
df1.Timestamp_A = df1.Timestamp_A.apply(pd.to_datetime)
df2.Timestamp_B = df2.Timestamp_B.apply(pd.to_datetime)
Sort by the times:
df1.sort_values('Timestamp_A', inplace=True)
df2.sort_values('Timestamp_B', inplace=True)
Now you can merge the two dataframes on the closest time:
df3 = pd.merge_asof(df2, df1,
left_on='Timestamp_B',
right_on='Timestamp_A',
suffixes=('_df2', '_df1'))
#clean up the datetime formats
df3[['Timestamp_A', 'Timestamp_B']] = df3[['Timestamp_A', 'Timestamp_B']] \
.applymap(pd.datetime.time)
#put df1 columns on the right
df3 = df3.iloc[:,::-1]
print(df3)
Timestamp_A Number_df1 Event_df1 Timestamp_B Number_df2 Event_df2
0 07:00:00 1 A 07:01:00 3 B
1 08:00:00 2 A 08:01:00 2 B
2 09:00:00 3 A 09:01:00 1 B
Use apply to compare Timestamp_A on each row with all Timestamp_B and get the index of the row with min diff, then extract Timestamp_B using the index.
df1['Closest_Timestamp_B'] = (
df1.apply(lambda x: abs(pd.to_datetime(x.Timestamp_A).value -
df2.Timestamp_B.apply(lambda x: pd.to_datetime(x).value))
.idxmin(),axis=1)
.apply(lambda x: df2.Timestamp_B.loc[x])
)
df1
Out[271]:
Event Number Timestamp_A Closest_Timestamp_B
0 A 1 7:00 7:01
1 A 2 8:00 8:01
2 A 3 9:00 9:01

Slice by date in pandas without re-indexing

I have a pandas dataframe where one of the columns is made up of strings representing dates, which I then convert to python timestamps by using pd.to_datetime().
How can I select the rows in my dataframe that meet conditions on date.
I know you can use the index (like in this question) but my timestamps are not unique.
How can I select the rows where the 'Date' field is say, after 2015-03-01?
You can use a mask on the date, e.g.
df[df['date'] > '2015-03-01']
Here is a full example:
>>> df = pd.DataFrame({'date': pd.date_range('2015-02-15', periods=5, freq='W'),
'val': np.random.random(5)})
>>> df
date val
0 2015-02-15 0.638522
1 2015-02-22 0.942384
2 2015-03-01 0.133111
3 2015-03-08 0.694020
4 2015-03-15 0.273877
>>> df[df.date > '2015-03-01']
date val
3 2015-03-08 0.694020
4 2015-03-15 0.273877

Categories

Resources