Modifying a dataframe with day offset - python

I am dealing with a very large size dataframe. A small sample is in bellow:
import pandas as pd
df = pd.DataFrame({'nodes': ['A', 'B', 'C'],
'dept': ['20:00', '02:00', '21:00'],
'arrv': ['20:00', '17:00', '21:00'],
'dept_offset_day': [0, 1, 0],
'arrv_offset_day': [0, 1, 0],
'stop_num':[0,1,2]})
print(df)
nodes dept arrv dept_offset_day arrv_offset_day
0 A 20:00 20:00 0 0
1 B 02:00 17:00 1 1
2 C 21:00 21:00 0 0
I am trying to 1) add a date into the start and end time by considering the day offsets. 2) break nodes column to two nodes_start and nodes_end columns i.e points to points. Something like:
nodes_start nodes_end start_datetime end_datetime
A B 2019-5-9 20:00 2019-5-10 02:00
B C 2019-5-10 17:00 2019-5-10 21:00
I tried using pd.offsets.Day() and loop through each line, but it makes the exec time very slow and I get wrong dates. Thanks for your help.

Try constructing a new data-frame, with new columns (copied columns really :D):
df2 = pd.DataFrame()
df2['nodes_start'] = df['nodes'][:2]
df2['nodes_end'] = df['nodes'][-2:].reset_index(drop=True)
df2['start_datetime'] = pd.to_datetime(df['arrv'][:2])
df2['end_datetime'] = pd.to_datetime(df['dept'][-2:].reset_index(drop=True))
df2['start_datetime'] = [df2['start_datetime'][0] - pd.Timedelta(days=1)] + [df2['start_datetime'][1]]
print(df2)
Output:
nodes_start nodes_end start_datetime end_datetime
0 A B 2019-05-09 20:00:00 2019-05-10 02:00:00
1 B C 2019-05-10 17:00:00 2019-05-10 21:00:00

Related

look up a time if it falls within a time range and return the corresponding value in Pandas?

Still trying to learn Pandas. Let's assume a dataframe includes start and end of an event for an even-type and the event's Val. Here is an example:
>>> df = pd.DataFrame({ 'start': ["11:00","13:00", "14:00"], 'end': ["12:00","14:00", "15:00"], 'event_type':[1,2,3], 'Val':['a','b','c']})
>>> df['start'] = pd.to_datetime(df['start'])
>>> df['end'] = pd.to_datetime(df['end'])
>>> df
Start End event_type Val
0 2021-03-05 11:00:00 2021-03-05 12:00:00 1 a
1 2021-03-05 13:00:00 2021-03-05 14:00:00 2 b
2 2021-03-05 14:00:00 2021-03-05 15:00:00 3 c
What is the best way for example to find a corresponding value for an event that starts at 11:10 and ends 11:30 of event_type 1, in the Val column. For instance, for this event example, since start and end times fall withing the first row of the df, it should return a.
Try pd.IntervalIndex.from_arrays
df.index = pd.IntervalIndex.from_arrays(left = df.start, right = df.end)
Out put like below
df.loc['11:30']
Out[73]:
start 2021-03-05 11:00:00
end 2021-03-05 12:00:00
event_type 1
Val a
Name: (2021-03-05 11:00:00, 2021-03-05 12:00:00], dtype: object
df.loc['11:30','Val']
Out[75]: 'a'

Python/Pandas - TypeError when concatenating MultiIndex DataFrames

I have trouble concatenating a list of MultiIndex DataFrames with 2 levels, and adding a third one to distinguish them.
As an example, I have following input data.
import pandas as pd
import numpy as np
# Input data
start = '2020-01-01 00:00+00:00'
end = '2020-01-01 02:00+00:00'
pr1h = pd.period_range(start=start, end=end, freq='1h')
midx1 = pd.MultiIndex.from_tuples([('Sup',1),('Sup',2),('Inf',1),('Inf',2)], names=['Data','Position'])
df1 = pd.DataFrame(np.random.rand(3,4), index=pr1h, columns=midx1)
df3 = pd.DataFrame(np.random.rand(3,4), index=pr1h, columns=midx1)
midx2 = pd.MultiIndex.from_tuples([('Sup',3),('Inf',3)], names=['Data','Position'])
df2 = pd.DataFrame(np.random.rand(3,2), index=pr1h, columns=midx2)
df4 = pd.DataFrame(np.random.rand(3,2), index=pr1h, columns=midx2)
So df1 & df2 have data for the same tag 1h and while they have the same column names at Data level, they don't have the same column names at Position level.
df1
Data Sup Inf
Position 1 2 1 2
2020-01-01 00:00 0.660795 0.538452 0.861801 0.502479
2020-01-01 01:00 0.205806 0.847124 0.474861 0.906546
2020-01-01 02:00 0.681480 0.479512 0.631771 0.961844
df2
Data Sup Inf
Position 3 3
2020-01-01 00:00 0.758533 0.672899
2020-01-01 01:00 0.096463 0.304843
2020-01-01 02:00 0.080504 0.990310
Now, df3 and df4 follow the same logic and same column names. To distinguish them from df1 & df2, I want to use a different tag, 2h for instance.
I want to add this third level named Period during the call to pd.concat. For this, I am trying to use keys parameter in pd.concat(). I tried following code.
df_list = [df1, df2, df3, df4]
period_list = ['1h', '1h', '2h', '2h']
concatenated = pd.concat(df_list, keys=period_list, names=('Period', 'Data', 'Position'), axis=1)
But this raises following error.
TypeError: int() argument must be a string, a bytes-like object or a number, not 'slice'
Please, any idea what is the correct call for this?
I thank you for your help. Bests,
EDIT 05/05
As requested, here is desired result (copied directly from the answer given. Result obtained from given answer is the one I am looking for).
Period 1h \
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.309778 0.597582 0.872392 0.983021 0.659965 0.214953
2020-01-01 01:00 0.467403 0.875744 0.296069 0.131291 0.203047 0.382865
2020-01-01 02:00 0.842818 0.659036 0.595440 0.436354 0.224873 0.114649
Period 2h
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.356250 0.587131 0.149471 0.171239 0.583017 0.232641
2020-01-01 01:00 0.397165 0.637952 0.372520 0.002407 0.556518 0.523811
2020-01-01 02:00 0.548816 0.126972 0.079793 0.235039 0.350958 0.705332
A quick fix would be to use different names in period_list and rename just after the concat. Something like:
df_list = [df1, df2, df3, df4]
period_list = ['1h_a', '1h_b', '2h_a', '2h_b']
concatenated = pd.concat(df_list,
keys=period_list,
names=('Period', 'Data', 'Position'),
axis=1)\
.rename(columns={col:col.split('_')[0] for col in period_list},
level='Period')
print (concatenated)
Period 1h \
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.309778 0.597582 0.872392 0.983021 0.659965 0.214953
2020-01-01 01:00 0.467403 0.875744 0.296069 0.131291 0.203047 0.382865
2020-01-01 02:00 0.842818 0.659036 0.595440 0.436354 0.224873 0.114649
Period 2h
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.356250 0.587131 0.149471 0.171239 0.583017 0.232641
2020-01-01 01:00 0.397165 0.637952 0.372520 0.002407 0.556518 0.523811
2020-01-01 02:00 0.548816 0.126972 0.079793 0.235039 0.350958 0.705332
Edit: as speed is a concern, it seems that rename is slow, so you can do:
concatenated = pd.concat(df_list,
keys=period_list,
axis=1)
concatenated.columns = pd.MultiIndex.from_tuples([(col[0].split('_')[0], col[1], col[2])
for col in concatenated.columns],
names=('Period', 'Data', 'Position'), )
Consider an inner concat on similar data frames then run a final concat to bind all together:
concatenated = pd.concat([pd.concat([df1, df2], axis=1),
pd.concat([df3, df4], axis=1)],
keys = ['1h', '2h'],
names=('Period', 'Data', 'Position'),
axis=1)
print(concatenated)
Period 1h \
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.189802 0.675083 0.624484 0.781774 0.453101 0.224525
2020-01-01 01:00 0.249818 0.829180 0.190488 0.923107 0.495873 0.278201
2020-01-01 02:00 0.602634 0.494915 0.612672 0.903609 0.426809 0.248981
Period 2h
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.746499 0.385714 0.008561 0.961152 0.988231 0.897454
2020-01-01 01:00 0.643730 0.365023 0.812249 0.291733 0.045417 0.414968
2020-01-01 02:00 0.887567 0.680102 0.978388 0.018501 0.695866 0.679730

Inserting flag on occurence of date

I have a pandas dataframe data-
Round Number Date
1 7/4/2018 20:00
1 8/4/2018 16:00
1 8/4/2018 20:00
1 9/4/2018 20:00
Now I want to create a new dataframe which has two columns
['Date' ,'flag']
The Date column will have the dates of the range of dates in the data dataframe(in the actual data the dates are in the range of 7/4/2018 8:00:00 PM to 27/05/2018 19:00 so the date column in the new dataframe will have dates from 1/4/2018 to 30/05/2018 since 7/4/2018 8:00:00 PM is in the month of April so we will include the whole month of April and similarly since 27/05/2018 is in May so we include dates from 1/05/2018 t0 30/05/2018.
In the flag column we put 1 if that particular date was there in the old dataframe.
Output(partial)-
Date Flag
1/4/2018 0
2/4/2018 0
3/4/2018 0
4/4/2018 0
5/4/2018 0
6/4/2018 0
7/4/2018 1
8/4/2018 1
and so on...
I would use np.where() to address this issue. Furthermore, I'm working to improve the answer by setting the dateranges from old_df to be input of new_df
import pandas as pd
import numpy as np
old_df = pd.DataFrame({'date':['4/7/2018 20:00','4/8/2018 20:00'],'value':[1,2]})
old_df['date'] = pd.to_datetime(old_df['date'],infer_datetime_format=True)
new_df = pd.DataFrame({'date':pd.date_range(start='4/1/2018',end='5/30/2019',freq='d')})
new_df['flag'] = np.where(new_df['date'].dt.date.astype(str).isin(old_df['date'].dt.date.astype(str).tolist()),1,0)
print(new_df.head(10))
Output:
date flag
0 2018-04-01 0
1 2018-04-02 0
2 2018-04-03 0
3 2018-04-04 0
4 2018-04-05 0
5 2018-04-06 0
6 2018-04-07 1
7 2018-04-08 1
8 2018-04-09 0
9 2018-04-10 0
Edit:
Improved version, full code:
import pandas as pd
import numpy as np
old_df = pd.DataFrame({'date':['4/7/2018 20:00','4/8/2018 20:00','5/30/2018 20:00'],'value':[1,2,3]})
old_df['date'] = pd.to_datetime(old_df['date'],infer_datetime_format=True)
if old_df['date'].min().month < 10:
start_date = pd.to_datetime(
("01/0"+str(old_df['date'].min().month)+"/"+str(old_df['date'].min().year)))
else:
start_date = pd.to_datetime(
("01/"+str(old_df['date'].min().month)+"/"+str(old_df['date'].min().year)))
end_date = old_df['date'].max()
end_date = pd.to_datetime(old_df['date'].max())
new_df = pd.DataFrame({'date':pd.date_range(start=start_date,end=end_date,freq='d')})
new_df['flag'] = np.where(new_df['date'].dt.date.astype(str).isin(old_df['date'].dt.date.astype(str).tolist()),1,0)

Python - Select min values in dataframe

I have a data frame that looks like this:
How can I make a new data frame that contains only the minimum 'Time' values for a user on the same date?
So I want to have a data frame with the same structure, but only one 'Time' for a 'Date' for a user.
So it should be like this:
Sort values by time column and check for duplicates in Date+User_name. However to make sure 09:00 is lower than 10:00 we can convert the strings to time first.
import pandas as pd
data = {
'User_name':['user1','user1','user1', 'user2'],
'Date':['8/29/2016','8/29/2016', '8/31/2016', '8/31/2016'],
'Time':['9:07:41','9:07:42','9:07:43', '9:31:35']
}
# Recreate sample dataframe
df = pd.DataFrame(data)
Alternative 1 (quicker):
#100 loops, best of 3: 1.73 ms per loop
# Create a mask
m = (df.reindex(pd.to_datetime(df['Time']).sort_values().index)
.duplicated(['Date','User_name']))
# Apply inverted mask
df = df.loc[~m]
Alternative 2 (more readable):
One easier way would be too remake the df['Time'] column to datetime and group it by date and User_name and get the idxmin(). This will be our mask. (Credit to jezrael)
# 100 loops, best of 3: 4.34 ms per loop
# Create a mask
m = pd.to_datetime(df['Time']).groupby([df['Date'],df['User_name']]).idxmin()
df = df.loc[m]
Output:
Date Time User_name
0 8/29/2016 9:07:41 user1
2 8/31/2016 9:07:43 user1
3 8/31/2016 9:31:35 user2
Update 1
#User included into grouping
Not the best way but simple
df = pd.DataFrame(np.datetime64('2016')+
np.random.randint(0,3*24,
size=(7,1)).astype('<m8[h]'),
columns =['DT']).join(pd.Series(list('abcdefg'),name='str_val')
).join(pd.Series(list('UAUAUAU'),name='User'))
df['Date'] = df.DT.dt.date
df['Time'] = df.DT.dt.time
df.drop(columns = ['DT'],inplace=True)
print (df)
Output:
str_val User Date Time
0 a U 2016-01-01 04:00:00
1 b A 2016-01-01 10:00:00
2 c U 2016-01-01 20:00:00
3 d A 2016-01-01 22:00:00
4 e U 2016-01-02 04:00:00
5 f A 2016-01-02 23:00:00
6 g U 2016-01-02 09:00:00
Code to get values
print (df.sort_values(['Date','User','Time']).groupby(['Date','User']).first())
Output:
Date User
2016-01-01 A b 10:00:00
U a 04:00:00
2016-01-02 A f 23:00:00
U e 04:00:00

Matching two datasets with different daterange and different length

I have two csv-files with different dateformats and lenght.
First, I load these two files:
frameA = pd.read_csv("fileA.csv", dtype=str, delimiter=";", skiprows = None)
File A has 102216 rows x 3 columns, ends at 01.07.2012 00:00. Date and Time are in one column. Head looks like this:
Date Buy Sell
0 01.08.2009 00:15 0 0
1 01.08.2009 00:30 0 0
2 01.08.2009 00:45 0 0
3 01.08.2009 01:00 0 0
4 01.08.2009 01:15 0 0
.
frameB = pd.read_csv("fileB.csv", dtype=str, delimiter=";", skiprows = None)
File B has 92762 rows x 4 columns, ends at 22.07.2012 00:00. Date and Time are separate. Head looks like this:
Date Time Buy Sell
0 01.08.2009 01:00 0 0
1 01.08.2009 02:00 0 0
2 01.08.2009 03:00 0 0
3 01.08.2009 04:00 0 10
4 01.08.2009 05:00 0 32
How can I match these datas like this:
Buy A Sell A Buy B Sell B
0 01.08.2009 00:15 0 0 0 0
1 01.08.2009 00:30 0 0 0 0
Both has to start and to end with the same date and the frequency has to be 15 min.
How can I get this? What should I do first?
OK, first thing is to make sure both df's have datetimes as dtypes for the first df:
frameA = pd.read_csv("fileA.csv", dtype=str, delimiter=";", skiprows = None, parse_dates=['Date'])
and for the other df:
frameB = pd.read_csv("fileB.csv", dtype=str, delimiter=";", skiprows = None, parse_dates=[['Date','Time']])
Now I would reset the minute value of the first df like so:
In [149]:
df['Date'] = df['Date'].apply(lambda x: x.replace(minute=0))
df
Out[149]:
Date Buy Sell
index
0 2009-01-08 04:00:00 0 0
1 2009-01-08 04:00:00 0 0
2 2009-01-08 04:00:00 0 0
3 2009-01-08 05:00:00 0 0
4 2009-01-08 05:00:00 0 0
Now we can merge the dfs:
In [150]:
merged = df.merge(df1, left_on=['Date'], right_on=['Date_Time'], how='left',suffixes=[' A', ' B'])
merged
Out[150]:
Date Buy A Sell A Date_Time Buy B Sell B
0 2009-01-08 04:00:00 0 0 2009-01-08 04:00:00 0 10
1 2009-01-08 04:00:00 0 0 2009-01-08 04:00:00 0 10
2 2009-01-08 04:00:00 0 0 2009-01-08 04:00:00 0 10
3 2009-01-08 05:00:00 0 0 2009-01-08 05:00:00 0 32
4 2009-01-08 05:00:00 0 0 2009-01-08 05:00:00 0 32
Obviously replace df, df1 with frameA and frameB in your case
Another thing you could do is set the date to the index:
As the answer above correctly states, the first step is to parse them into an identical format.
frameA = pd.read_csv("fileA.csv", dtype=str, delimiter=";", skiprows = None, parse_dates=['Date'])
frameB = pd.read_csv("fileB.csv", dtype=str, delimiter=";", skiprows = None, parse_dates=[['Date','Time']])
After including the data into the arrays, as shown above, we can set the index to the data to guide the merger:
frameA.index = frameA['Date']
frameB.index = frameB['Date']
Then, they will merge on the exact same index, and since they have similar columns ('Buy', 'Sell'), we need to specify suffixes for the merger:
merge = frameA.join(frameB, lsuffix = ' A', rsuffix = ' B')
The result would look exactly like this.
Buy A Sell A Buy B Sell B
0 01.08.2009 00:15 0 0 0 0
1 01.08.2009 00:30 0 0 0 0
The advantage of this approach is that if your second data set ('Buy B', 'Sell B') is missing times present in the first slot, the merger will still work and you won't have data misassigned to the improper time. Let's say we have an arbitrary numerical index from 1-10000 for both, and the second dataframe is missing 3 values (index only goes from 1-9997). This will cause a shift, and then we improperly assign values to the wrong index is the one guiding the joining.
Here, as long as the dataframe guiding the joining is longer than the second dataframe, we won't lose any data, and we will never poorly assign it to the wrong index.
So for example:
if len(frameA.index) >= len(frameB.index):
merge = frameA.join(frameB)
else:
print 'Missing Values, define your own function here'
quit()
EDIT:
Another way to make sure all data is reported, regardless of whether it occurs in both columns would be to define a new dataframe with a unique list of dates present in both dataframes, and use that to guide the merger.
For example,
unique_index = sorted(list(set(frameA.index.tolist() + frameB.index.tolist())))
Defines a unique index by summing both index lists, turning it to a set, and back to a list. Sets remove all redundant values, so you have a unique list, and the list is sorted since sets are not ordered.
Then, you merge the dataframes:
merge = pd.DataFrame(index = unique_index)
merge = merge.join(frameA)
merge = merge.join(frameB, lsuffix = ' A', rsuffix = ' B')
Just make sure to export it with the index ON, or redefine the index as a column (exporting to a csv or an excel sheet automatically has the index on unless you turn it off, so just be sure not to set index = False).
And then any missing data from your 'Buy A', 'Sell A' columns that is present in 'Buy B', 'Sell B' will be 'nan', as will be data missing from 'Buy B', 'Sell B' that is present in 'Buy A', 'Sell A'.

Categories

Resources