Python dataframe import matching column and index data - python

I have a master data frame and auxiliary data frame. Both have the same timestamp index and columns with master having few more columns. I want to copy a certain column's data from aux to master.
My code:
maindf = pd.DataFrame({'A':[0.0,NaN],'B':[10,20],'C':[100,200],},index=pd.date_range(start='2020-05-04 08:00:00', freq='1h', periods=2))
auxdf= pd.DataFrame({'A':[1,2],'B':[30,40],},index=pd.date_range(start='2020-05-04 08:00:00', freq='1h', periods=2))
maindf =
A B C
2020-05-04 08:00:00 0.0 10 100
2020-05-04 09:00:00 NaN 20 200
auxdf =
A B
2020-05-04 08:00:00 1 30
2020-05-04 09:00:00 2 40
Expected answer: I want o take column A data in auxdf and copy to maindf by matching the index.
maindf =
A B C
2020-05-04 08:00:00 1 10 100
2020-05-04 09:00:00 2 20 200
My solution:
maindf['A'] = auxdf['A']
My solution is not correct because I am copying values directly without checking for matching index. how do I achieve the solution?

You can use .update(), as follows:
maindf['A'].update(auxdf['A'])
.update() uses non-NA values from passed Series to make updates. Aligns on index.
Note also that the original dtype of maindf['A'] is retained: remains as float type even when auxdf['A'] is of int type.
Result:
print(maindf)
A B C
2020-05-04 08:00:00 1.0 10 100
2020-05-04 09:00:00 2.0 20 200

Related

Number of rows between two dates

Let's say I have a pandas df with a Date column (datetime64[ns]):
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0
I want to get a column (rows_num in the above example) with the number of rows I need to claw back to find the current row date minus 365 days (1 year before).
So, in the above example, for index 7 (date 2021-01-26) I want to know how many rows before I can find the date 2020-01-26.
If a perfect match is not available (like in the example df), I should reference the closest available date (or the closest smaller/larger date: it doesn't really matter in my case).
Any idea? Thanks
Edited to reflect OP's original question. Created a demo dataframe. Created a column to hold that row_count value to reflect number of business days. Then, for each row, create a filter to grab all rows between the start date and 365 days later. the shape[0] of that filtered dataframe represents the number of business days, and we add it into the appropriate field of the df.
# Import Pandas package
import pandas as pd
from datetime import datetime, timedelta
# Create a sample dataframe
df = pd.DataFrame({'num_posts': [4, 6, 3, 9, 1, 14, 2, 5, 7, 2],
'date' : ['2020-08-09', '2020-08-25', '2020-09-05',
'2020-09-12', '2020-09-29', '2020-10-15',
'2020-11-21', '2020-12-02', '2020-12-10',
'2020-12-18']})
#create the column for the row count:
df.insert(2, "row_count", '')
# Convert the date to datetime64
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
for row in range(len(df['date'])):
start_date = str(df['date'].iloc[row])
end_date = str(df['date'].iloc[row] + timedelta(days=365)) #set the end date for the filter
# Filter data between two dates
filtered_df = df.loc[(df['date'] >= start_date) & (df['date'] < end_date)]
df['row_count'][row] = filtered_df.shape[0] # fill in the row_count column with the number of rows returned by your filter
source
You can use pd.merge_asof, which performs the exact nearest-match lookup you describe. You can even choose to use backward (smaller), forward (larger), or nearest search types.
# setup
text = StringIO(
"""
Date
2020-01-01
2020-02-25
2020-04-23
2020-06-28
2020-08-17
2020-10-11
2020-12-06
2021-01-26
2021-03-17
"""
)
data = pd.read_csv(text, delim_whitespace=True, parse_dates=["Date"])
# calculate the reference date from 1 year (365 days) ago
one_year_ago = data["Date"] - pd.Timedelta("365D")
# we only care about the index values for the original and matched dates
merged = pd.merge_asof(
one_year_ago.reset_index(),
data.reset_index(),
on="Date",
suffixes=("_original", "_matched"),
direction="backward",
)
data["rows_num"] = merged["index_original"] - merged["index_matched"]
Result:
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0

replace dataframe values from another list with specific indexes

I have a dataframe which has a column date and I'm trying to replace with another list based on index, for example:
wrong_dates_indexes has list of indexes where date is in wrong format in original dataframe df:
dirty_dates_indexes=[4,33,48,54,59,91,95,132,160,175,180,197,203,206,229,237,266,271,278,294,298,333,348,373,380,420,442]
formated_dates=['2019-04-25','2019-12-01','2019-06-16','2019-10-07','2019-08-06','2019-02-17','2019-11-20','2019-03-10','2019-10-11','2019-03-04','2019-07-31','2019-10-12','2019-09-13','2019-08-26','2019-12-29','2019-10-11','2019-11-20','2019-06-16','2019-12-12','2019-03-22','2019-01-21','2019-03-21','2019-10-15','2019-12-01','2019-03-20','2019-09-08','2019-08-19']
I'm trying to replace all values in df with indexed in
wrong_dates_indexes with values in formated_dates.
I've tried the following code, however receiving an error:
for index in dirty_dates_indexes:
df.loc[index].date.replace(df.loc[index].date,formated_dates(f for f in range(0,len(range(formated_dates)))))
Error:
TypeError: 'list' object cannot be interpreted as an integer
How to solve this? or is there any better approach?
You are trying to get the value from dirty_dates_indexes and use that to lookup the position in formatted_dates. It may be messing you up.
You are using loc instead of iloc to reach the specific row.
Here's what I did.
dirty_dates_indexes=[4,33,48,54,
59,91,95,132,
160,175,180,197,
203,206,229,237,
266,271,278,294,
298,333,348,373,
380,420,442]
formated_dates=['2019-04-25','2019-12-01','2019-06-16','2019-10-07',
'2019-08-06','2019-02-17','2019-11-20','2019-03-10',
'2019-10-11','2019-03-04','2019-07-31','2019-10-12',
'2019-09-13','2019-08-26','2019-12-29','2019-10-11',
'2019-11-20','2019-06-16','2019-12-12','2019-03-22',
'2019-01-21','2019-03-21','2019-10-15','2019-12-01',
'2019-03-20','2019-09-08','2019-08-19']
import pandas as pd
df = pd.DataFrame()
df['dirty_dates'] = pd.date_range('2019-01-01', periods=500,freq='D')
for i,row_id in enumerate(dirty_dates_indexes):
df.dirty_dates.iloc[row_id] = pd.to_datetime(formated_dates[i])
print (df.head(20))
The results are as follows:
dirty_dates
0 2019-01-01
1 2019-01-02
2 2019-01-03
3 2019-01-04
4 2019-04-25 # <-- this row changed
5 2019-01-06
6 2019-01-07
7 2019-01-08
8 2019-01-09
9 2019-01-10
10 2019-01-11
11 2019-01-12
12 2019-01-13
13 2019-01-14
14 2019-01-15
15 2019-01-16
16 2019-01-17
17 2019-01-18
18 2019-01-19
19 2019-01-20

How can I add parts of a column to a new pandas data frame?

So I have a pandas data frame of lenght 90 which isn't important
Lets say I have :
df
A date
1 2012-01-01
4 2012-02-01
5 2012-03-01
7 2012-04-01
8 2012-05-01
9 2012-06-01
2 2012-07-01
1 2012-08-01
3 2012-09-01
2 2012-10-01
5 2012-11-01
9 2012-12-01
0 2013-01-01
6 2013-02-01
and I have created a new data frame
df_copy=df.copy()
index = range(0,3)
df1 = pd.DataFrame(index=index, columns=range((len(df_copy.columns))))
df1.columns = df_copy.columns
df1['date'] = pd.date_range('2019-11-01','2020-01-01' , freq='MS')-pd.offsets.MonthBegin(1)
which should create a data frame like this
A date
na 2019-10-01
na 2019-11-01
na 2019-12-01
So I use the following code to get the values of A in my new data frame
df1['A'] = df1['A'].iloc[9:12]
And I want the outcome to be this
A date
2 2019-10-01
5 2019-11-01
9 2019-12-01
so I want that the last 3 values are assigned the value that has iloc position 9-12 in the new data frame, the indexes are different and so are the dates in both data frames. Is there a way to do this because
df1['A'] = df1['A'].iloc[9:12]
doesn't seem to work
According to my knowledge you can solve this by genearting several new data frames
df_copy=df.copy()
index = range(0,1)
df1 = pd.DataFrame(index=index, columns=range((len(df_copy.columns))))
df1.columns = df_copy.columns
df1['date'] = pd.date_range('2019-11-01','2019-11-01' , freq='MS')-pd.offsets.MonthBegin(1)
df1['A'] = df1['A'].iloc[9]
Then appending to your original data frame and repeating it is a bit overwhemling but it seems like the only solution i could came up with

Resampling a dataframe into a new one while doing some additional operations

I am working with a dataframe where each entry (row) comes with a start time, a duration and other attributes. I would like to create a new dataframe from this one where I would sort of transform each entry from the original one into 15 minutes intervals while keeping all other attributes the same. The amount of entries in the new dataframe per entry in the old one would depend on the actual duration of the original one.
At first I tried using pd.resample but it did not do exactly what I expected. I then constructed a function using itertuples() that works quite well but it took about half an hour with a dataframe of around 3000 rows. Now I want to do the same for 2 million rows so I am looking for other possibilities.
Let's say I have the following dataframe:
testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm'], 'id': [1,2,3,4]}
testdf = pd.DataFrame(testdict)
testdf.loc[:,['start']] = pd.to_datetime(testdf['start'])
print(testdf)
>>>testdf
start duration Attribute_A id
0 2018-01-05 11:48:00 22 abc 1
1 2018-05-04 09:05:00 8 def 2
2 2018-08-09 07:15:00 35 hij 3
3 2018-09-27 15:00:00 2 klm 4
And I would like my outcome to be like the following:
>>>resultdf
start duration Attribute_A id
0 2018-01-05 11:45:00 12 abc 1
1 2018-01-05 12:00:00 10 abc 1
2 2018-05-04 09:00:00 8 def 2
3 2018-08-09 07:15:00 15 hij 3
4 2018-08-09 07:30:00 15 hij 3
5 2018-08-09 07:45:00 5 hij 3
6 2018-09-27 15:00:00 2 klm 4
This is the function that I built with itertuples which produced the desired result (the one I showed just above this):
def min15_divider(df,newdf):
for row in df.itertuples():
orig_min = row.start.minute
remains = orig_min % 15 # Check if it is already a multiple of 15
if remains == 0:
new_time = row.start.replace(second=0)
if row.duration < 15: # if it shorter than 15 min just use that for the duration
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,
'duration': row.duration, 'id':row.id}
newdf = newdf.append(to_append, ignore_index=True)
else: # if not, divide that in 15 min intervals until duration is exceeded
cumu_dur = 15
while cumu_dur < row.duration:
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id}
if cumu_dur < 15:
to_append['duration'] = cumu_dur
else:
to_append['duration'] = 15
new_time = new_time + pd.Timedelta('15 minutes')
cumu_dur = cumu_dur + 15
newdf = newdf.append(to_append, ignore_index=True)
else: # add the remainder in the last 15 min interval
final_dur = row.duration - (cumu_dur - 15)
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,'duration': final_dur, 'id':row.id}
newdf = newdf.append(to_append, ignore_index=True)
else: # When it is not an exact multiple of 15 min
new_min = orig_min - remains # convert to multiple of 15
new_time = row.start.replace(minute=new_min)
new_time = new_time.replace(second=0)
cumu_dur = 15 - remains # remaining minutes in the initial interval
while cumu_dur < row.duration: # divide total in 15 min intervals until duration is exceeded
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id}
if cumu_dur < 15:
to_append['duration'] = cumu_dur
else:
to_append['duration'] = 15
new_time = new_time + pd.Timedelta('15 minutes')
cumu_dur = cumu_dur + 15
newdf = newdf.append(to_append, ignore_index=True)
else: # when we reach the last interval or the starting duration was less than the remaining minutes
if row.duration < 15:
final_dur = row.duration # original duration less than remaining minutes in first interval
else:
final_dur = row.duration - (cumu_dur - 15) # remaining duration in last interval
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'duration': final_dur, 'id':row.id}
newdf = newdf.append(to_append, ignore_index=True)
return newdf
Is there any other way to do this without using itertuples that could save me some time?
Thanks in advance.
PS. I apologize for anything that may seem a bit weird in my post as it is the first time that I have asked a question myself here in stackoverflow.
EDIT
Many entries can have the same starting time, so .groupby 'start' could be problematic. There is, however, a column with unique values for each entry called simply "id".
Using pd.resample is a good idea, but since you have only the starting time each row, you need to build the end row before you can use it.
The code below assumes that each starting time in 'start' column is unique, so that grouby can be used in a bit unusual way, since it will extract only one row.
I use groupby because it will automatically regroups the dataframes produced by the custom function used by apply.
Note also that the column 'duration' is converted to timedelta in minutes in order to better perform some math later.
import pandas as pd
testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm']}
testdf = pd.DataFrame(testdict)
testdf['start'] = pd.to_datetime(testdf['start'])
testdf['duration'] = pd.to_timedelta(testdf['duration'], 'T')
print(testdf)
def calcduration(df, starttime):
if len(df) == 1:
return
elif len(df) == 2:
df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0])
df['duration'].iloc[1] = df['duration'].iloc[1] - df['duration'].iloc[0]
elif len(df) > 2:
df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0])
df['duration'].iloc[1:-1] = pd.Timedelta(15, 'T')
df['duration'].iloc[-1] = df['duration'].iloc[-1] - df['duration'].iloc[:-1].sum()
def expandtime(x):
frow = x.copy()
frow['start'] = frow['start'] + frow['duration']
gdf = pd.concat([x, frow], axis=0)
gdf = gdf.set_index('start')
resdf = gdf.resample('15T').nearest()
calcduration(resdf, x['start'].iloc[0])
return resdf
findf = testdf.groupby('start', as_index=False).apply(expandtime)
print(findf)
This code produces:
duration Attribute_A
start
0 2018-01-05 11:45:00 00:12:00 abc
2018-01-05 12:00:00 00:10:00 abc
1 2018-05-04 09:00:00 00:08:00 def
2 2018-08-09 07:15:00 00:15:00 hij
2018-08-09 07:30:00 00:15:00 hij
2018-08-09 07:45:00 00:05:00 hij
3 2018-09-27 15:00:00 00:02:00 klm
A bit of explanation
expandtime is the first custom function. It takes a dataframe of one row (because we assume that 'start' values are uniques), builds a second row whose 'start' is equal to 'start' of the first row + duration and then uses resample to sample it in time intervals of 15 minutes. Values of all other columns are duplicated.
calcduration is used to do some math on the column 'duration' in order to calculate the correct duration of each row.
So, starting with your df:
testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm']}
df = pd.DataFrame(testdict)
df.loc[:,['start']] = pd.to_datetime(df['start'])
print(df)
First calculate an ending time for each row:
df['dur'] = pd.to_timedelta(df['duration'], unit='m')
df['end'] = df['start'] + df['dur']
Then create two new columns that hold the regular interval (15 minute) start and end dates:
df['start15'] = df['start'].dt.floor('15min')
df['end15'] = df['end'].dt.floor('15min')
At this point, the dataframe looks like:
Attribute_A duration start dur end start15 end15
0 abc 22 2018-01-05 11:48:00 00:22:00 2018-01-05 12:10:00 2018-01-05 11:45:00 2018-01-05 12:00:00
1 def 8 2018-05-04 09:05:00 00:08:00 2018-05-04 09:13:00 2018-05-04 09:00:00 2018-05-04 09:00:00
2 hij 35 2018-08-09 07:15:00 00:35:00 2018-08-09 07:50:00 2018-08-09 07:15:00 2018-08-09 07:45:00
3 klm 2 2018-09-27 15:00:00 00:02:00 2018-09-27 15:02:00 2018-09-27 15:00:00 2018-09-27 15:00:00
The start15 and end15 columns combine to have the right times, but you need to merge them:
df = pd.melt(df, ['dur', 'start', 'Attribute_A', 'end'], ['start15', 'end15'], value_name='start15')
df = df.drop('variable', 1).drop_duplicates('start15').sort_values('start15').set_index('start15')
Output:
dur start Attribute_A
start15
2018-01-05 11:45:00 00:22:00 2018-01-05 11:48:00 abc
2018-01-05 12:00:00 00:22:00 2018-01-05 11:48:00 abc
2018-05-04 09:00:00 00:08:00 2018-05-04 09:05:00 def
2018-08-09 07:15:00 00:35:00 2018-08-09 07:15:00 hij
2018-08-09 07:45:00 00:35:00 2018-08-09 07:15:00 hij
2018-09-27 15:00:00 00:02:00 2018-09-27 15:00:00 klm
Looking good, but the 2018-08-09 07:30:00 row is missing. Fill in this and any other missing rows with groupby and resample:
df = df.groupby('start').resample('15min').ffill().reset_index(0, drop=True).reset_index()
Get the end15 column back, it was dropped during the melt operation earlier:
df['end15'] = df['end'].dt.floor('15min')
Then calculate the correct durations for each row. I split this into two calculations (durations that spread across multiple timesteps, and ones that don't) to keep it readable:
df.loc[df['start15'] != df['end15'], 'duration'] = np.minimum(df['end15'] - df['start'], pd.Timedelta('15min').to_timedelta64())
df.loc[df['start15'] == df['end15'], 'duration'] = np.minimum(df['end'] - df['end15'], df['end'] - df['start'])
Then just some clean-up to make it look like you wanted:
df['duration'] = (df['duration'].dt.seconds/60).astype(int)
print(df)
df = df[['start15', 'duration', 'Attribute_A']].copy()
Result:
start15 duration Attribute_A
0 2018-01-05 11:45:00 12 abc
1 2018-01-05 12:00:00 10 abc
2 2018-05-04 09:00:00 8 def
3 2018-08-09 07:15:00 15 hij
4 2018-08-09 07:30:00 15 hij
5 2018-08-09 07:45:00 5 hij
6 2018-09-27 15:00:00 2 klm
Please note, portions of this answer were based on this answer

How to separate the data in python series based on week to plot them in cycle graph?

i have five week seasonal data in a single series with date and time, how do i separate it based on week wise,like week1,week2...week5 so that i can plot all the week data in same graph.
i tried re sampling the data on day wise by finding the mean. but the data is still in single series. i just want to separate the data based on weeks like 2019-04-02 to 2019-04-08 in different dataframe,2019-04-08 to 2019-04-16 in separate df
df.open.resample('M').mean()
date pageload day
0 2019-04-02 10:48:00 -79.002023 Tue
1 2019-04-02 10:49:00 33.563679 Tue
2 2019-04-02 10:50:00 -76.448319 Tue
3 2019-04-02 10:51:00 30.974816 Tue
4 2019-04-02 10:52:00 -68.789962 Tue
5 2019-04-02 10:53:00 30.593374 Tue
21 2019-04-16 11:34:00 40.333445 Fri
data frame separated on week wise. To plot all the week data in single graph.
I do not think you want to resample like Shijith is showing. I think you want different dataframes for each week. IMO you want to use groupby (doc) for that. The Pandas Groupby function can be used to split the data in a dataframe by columns or indexes. The method returns a pandas groupby object which can be used to perform operations on the groups before merging them back.
In the snippet of code, I first create a column to group the data by (i.e. the "weeks" column). Than I group the data by the weeks column. This creates a groupby object which among others, consists of a dictionary that has the unique values of the "weeks" column as keys and list of indices of the dataframe which have the same value for the "weeks" column as values. This can be seen by typing print(grps.groups) in the console. Than I loop over the keys of the groups and add each of the week dataframe into a dictionary by calling the get_group method on the groupby object.
import pandas as pd
# Make sample data
index=pd.date_range(start='2014-01-01', end='2014-1-31',freq='D')
df = pd.DataFrame({"vals":np.random.randint(-5,5,len(index))}, index=index)
df["csum"] = df.vals.cumsum()
# Add a column for weeks to enable grouping
df["weeks"] = df.index.week
# Group the data
grps = df.groupby("weeks")
# split the group into separate dataframes
df_dict = {}
for gi in grps.groups:
df_dict[gi] = grps.get_group(gi)
I start off with something like this:
vals csum weeks
2014-01-01 4 4 1
2014-01-02 -5 -1 1
...
2014-01-30 -2 -9 5
2014-01-31 -5 -14 5
and end up with dataframe like the following
1
vals csum weeks
2014-01-01 4 4 1
2014-01-02 -5 -1 1
2014-01-03 -4 -5 1
2014-01-04 4 -1 1
2014-01-05 -5 -6 1
2
vals csum weeks
2014-01-06 -5 -11 2
2014-01-07 2 -9 2
2014-01-08 4 -5 2
2014-01-09 -1 -6 2
2014-01-10 -1 -7 2
2014-01-11 -3 -10 2
2014-01-12 -2 -12 2
If your data frame df is indexed on date
print(df)
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000
2019-04-08 200.229996 196.339996 196.419998 200.100006 25881700 200.100006
2019-04-09 202.850006 199.229996 200.320007 199.500000 35768200 199.500000
2019-04-10 200.740005 198.179993 198.679993 200.619995 21695300 200.619995
2019-04-11 201.000000 198.440002 200.850006 198.949997 20900800 198.949997
2019-04-12 200.139999 196.210007 199.199997 198.869995 27760700 198.869995
do,
weekly_summary = pd.DataFrame()
weekly_summary['Open'] = df.open.resample('W').first()
print(weekly_summary)
Open
Date
2019-04-07 191.639999
2019-04-14 196.419998
if its not indexed on date time do,
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df.sort_index(inplace=True)
weekly_summary = pd.DataFrame()
weekly_summary['Open'] = df.open.resample('W').first()
for the above code data frame will be indexed on 'sundays' , if you want it to be indexed on 'Mondays' (i.e., starting day of the week), do as below.
weekly_summary = pd.DataFrame()
weekly_summary['Open'] = df.open.resample('W',loffset=pd.offsets.timedelta(days=-6)).first()
print(weekly_summary)
Open
Date
2019-04-01 191.639999
2019-04-08 196.419998

Categories

Resources