Python: Iterate Over Year and Month in DatetimeIndex - python

I have two DataFrames:
df1:
A B
Date
01/01/2020 2 4
02/01/2020 6 8
df2:
A B
Date
01/01/2020 5 10
I want to get the following:
df3:
A B
Date
01/01/2020 10 40
02/01/2020 30 80
What I want is to multiply the column entries based on year and month in DatetimeIndex. But I'm not sure how to iterate over datetime.

use to_numpy():
df3=pd.DataFrame(df1.to_numpy()*df2.to_numpy(),index=df1.index,columns=df1.columns)
output of df3:
A B
Date
01/01/2020 10 40
02/01/2020 30 80

You may need reindex
df1.index = pd.to_datetime(df1.index,dayfirst=True)
df2.index = pd.to_datetime(df2.index,dayfirst=True)
df2.index = df2.index.strftime('%Y-%m')
df1[:] *= df2.reindex(df1.index.strftime('%Y-%m')).values
df1
Out[529]:
A B
Date
2020-01-01 10 40
2020-01-02 30 80

Related

JOIN two DataFrames and replace Column values in Python

I have dataframe df1:
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 5
3 txn vol(new) 2020-02-01 20
4 txn vol(tenu) 2020-01-01 30
5 txn vol(tenu) 2020-02-01 40
Second Dataframe df2:
Expenses Calendar Actual
0 txn vol(new) 2020-01-01 23
1 txn vol(new) 2020-02-01 32
2 txn vol(tenu) 2020-01-01 60
Now I wanted to read all data from df1, and join on df2 with Expenses + Calendar, then replace actual value in df1 from df2.
Expected output is:
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 23
3 txn vol(new) 2020-02-01 32
4 txn vol(tenu) 2020-01-01 60
5 txn vol(tenu) 2020-02-01 40
I am using below code
cols_to_replace = ['Actual']
df1.loc[df1.set_index(['Calendar','Expenses']).index.isin(df2.set_index(['Calendar','Expenses']).index), cols_to_replace] = df2.loc[df2.set_index(['Calendar','Expenses']).index.isin(df1.set_index(['Calendar','Expenses']).index),cols_to_replace].values
It is working when I have small data in df1. When it has (10K records), updates are happening with wrong values. df1 has 10K records, and df2 has 150 records.
Could anyone please suggest how to resolve this?
Thank you
If I understand your solution correctly, it seems to assume that (1) the Calendar-Expenses combinations are unique and (2) that their occurrences in both dataframes are aligned (same order)? I suspect that (2) isn't actually the case?
Another option - .merge() is fine! - could be:
df1 = df1.set_index(["Expenses", "Calendar"])
df2 = df2.set_index(["Expenses", "Calendar"])
df1.loc[list(set(df1.index).intersection(df2.index)), "Actual"] = df2["Actual"]
df2 = df2.reset_index() # If the original df2 is still needed
df1 = df1.reset_index()
here is one way to do it, using pd.merge
df=df.merge(df2,
on=['Expenses', 'Calendar'],
how='left',
suffixes=('_x', None)).ffill(axis=1).drop(columns='Actual_x')
df['Actual']=df['Actual'].astype(int)
df
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 23
3 txn vol(new) 2020-02-01 32
4 txn vol(tenu) 2020-01-01 60
5 txn vol(tenu) 2020-02-01 40

I want to select duplicate rows between 2 dataframes

I want to filter rolls (df1) with date column that in datetime64[ns] from df2 (same column name and dtype). I tried searching for a solution but I get the error:
Can only compare identically-labeled Series objects | 'Timestamp' object is not iterable or other.
sample df1
id
date
value
1
2018-10-09
120
2
2018-10-09
60
3
2018-10-10
59
4
2018-11-25
120
5
2018-08-25
120
sample df2
date
2018-10-09
2018-10-10
sample result that I want
id
date
value
1
2018-10-09
120
2
2018-10-09
60
3
2018-10-10
59
In fact, I want this program to run 1 time in every 7 days, counting back from the day it started. So I want it to remove dates that are not in these past 7 days.
# create new dataframe -> df2
data = {'date':[]}
df2 = pd.DataFrame(data)
#Set the date to the last 7 days.
days_use = 7 # 7 -> 1
for x in range (days_use,0,-1):
days_use = x
use_day = date.today() - timedelta(days=days_use)
df2.loc[x] = use_day
#Change to datetime64[ns]
df2['date'] = pd.to_datetime(df2['date'])
Use isin:
>>> df1[df1["date"].isin(df2["date"])]
id date value
0 1 2018-10-09 120
1 2 2018-10-09 60
2 3 2018-10-10 59
If you want to create df2 with the dates for the past week, you can simply use pd.date_range:
df2 = pd.DataFrame({"date": pd.date_range(pd.Timestamp.today().date()-pd.DateOffset(7),periods=7)})
>>> df2
date
0 2022-05-03
1 2022-05-04
2 2022-05-05
3 2022-05-06
4 2022-05-07
5 2022-05-08
6 2022-05-09

Get data from another data frame in python

My data frame df1:
ID Date
0 90 02/01/2021
1 101 01/01/2021
2 30 12/01/2021
My data frame df2:
ID City 01/01/2021 02/01/2021 12/01/2021
0 90 A 20 14 22
1 101 B 15 10 5
2 30 C 12 9 13
I need to create a column in df1 'New'. It should contain data from df2 with respect to 'ID' and 'Date' of df1. I am finding difficult in merging data. How could I do it?
Use DataFrame.melt with DataFrame.merge:
df22 = df2.drop('City', 1).melt(['ID'], var_name='Date', value_name='Val')
df = df1.merge(df22, how='left')
print (df)
ID Date Val
0 90 02/01/2021 14
1 101 01/01/2021 15
2 30 12/01/2021 13
You can melt and merge:
df1.merge(df2.melt(id_vars=['ID', 'City'], var_name='Date'), on=['ID', 'Date'])
output:
ID Date City value
0 90 02/01/2021 A 14
1 101 01/01/2021 B 15
2 30 12/01/2021 C 13
Alternative:
df1.merge(df2.melt(id_vars='ID',
value_vars=df2.filter(regex='/'),
var_name='Date'),
on=['ID', 'Date'])
output:
ID Date value
0 90 02/01/2021 14
1 101 01/01/2021 15
2 30 12/01/2021 13

Getting complicated average values for pandas DataFrame

I have a simple DataFrame with 2 columns - date and value. I need to create another DataFrame with would contain an average value for every month of every year. For example, I have daily data in range from 2015-01-01 till 2018-12-31
I need averages for every month in 2015, 2016 etc.
Which is the easiest way to do that?
You can aggregate by month period with Series.dt.to_period and mean:
df['date'] = pd.to_datetime(df['date'])
df1 = df.groupby(df['date'].dt.to_period('m'))['col'].mean().reset_index()
Another solution with year and months in separate columns:
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df1 = df.groupby(['year','month'])['col'].mean().reset_index()
Sample:
df = pd.DataFrame({'date':['2015-01-02','2016-03-02','2015-01-23','2016-01-12','2015-03-02'],
'col':[1,2,5,4,6]})
print (df)
date col
0 2015-01-02 1
1 2016-03-02 2
2 2015-01-23 5
3 2016-01-12 4
4 2015-03-02 6
df['date'] = pd.to_datetime(df['date'])
df1 = df.groupby(df['date'].dt.to_period('m'))['col'].mean().reset_index()
print (df1)
date col
0 2015-01 3
1 2015-03 6
2 2016-01 4
3 2016-03 2
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df2 = df.groupby(['year','month'])['col'].mean().reset_index()
print (df2)
year month col
0 2015 1 3
1 2015 3 6
2 2016 1 4
3 2016 3 2
To get the monthly average values of a Data Frame when the DataFrame has daily data rows, I would:
Convert the column with the dates , df['dates']into the index of the DataFrame df: df.set_index('date',inplace=True)
Then I'll convert the index dates into a month-index: df.index.month
Finally I'll calculate the mean of the DataFrame GROUPED BY MONTH: df.groupby(df.index.month).data.mean()
I go slowly throw each step here:
Generation DataFrame with dates and values
You need first to import Pandas and Numpy, as well as the module datetime
from datetime import datetime
Generate a Column 'date' between 1/1/2019 and the 3/05/2019, at week 'W' intervals. And a column 'data'with random values between 1-100:
date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W')
df = pd.DataFrame(date_rng, columns=['date'])
df['data']=np.random.randint(0,100,size=(len(date_rng)))
the df has two columns 'date' and 'data':
date data
0 2018-01-07 42
1 2018-01-14 54
2 2018-01-21 30
3 2018-01-28 43
4 2018-02-04 65
5 2018-02-11 40
6 2018-02-18 3
7 2018-02-25 55
8 2018-03-04 81
Set 'date'column as the index of the DataFrame:
df.set_index('date',inplace=True)
df has one column 'data' and the index is 'date':
data
date
2018-01-07 42
2018-01-14 54
2018-01-21 30
2018-01-28 43
2018-02-04 65
2018-02-11 40
2018-02-18 3
2018-02-25 55
2018-03-04 81
Capture the month number from the index
months=df.index.month
Obtain the mean value of each month groupping by month:
monthly_avg=df.groupby(months).data.mean()
The mean of the dataset by month 'monthly_avg' is:
date
1 42.25
2 40.75
3 81.00
Name: data, dtype: float64

Shift time series with missing dates in Pandas

I have a times series with some missing entries, that looks like this:
date value
---------------
2000 5
2001 10
2003 8
2004 72
2005 12
2007 13
I would like to do create a column for the "previous_value". But I only want it to show values for consecutive years. So I want it to look like this:
date value previous_value
-------------------------------
2000 5 nan
2001 10 5
2003 8 nan
2004 72 8
2005 12 72
2007 13 nan
However just applying pandas shift function directly to the column 'value' would give 'previous_value' = 10 for 'time' = 2003, and 'previous_value' = 12 for 'time' = 2007.
What's the most elegant way to deal with this in pandas? (I'm not sure if it's as easy as setting the 'freq' attribute).
In [588]: df = pd.DataFrame({ 'date':[2000,2001,2003,2004,2005,2007],
'value':[5,10,8,72,12,13] })
In [589]: df['previous_value'] = df.value.shift()[ df.date == df.date.shift() + 1 ]
In [590]: df
Out[590]:
date value previous_value
0 2000 5 NaN
1 2001 10 5
2 2003 8 NaN
3 2004 72 8
4 2005 12 72
5 2007 13 NaN
Also see here for a time series approach using resample(): Using shift() with unevenly spaced data
Your example doesn't look like real time series data with timestamps. Let's take another example with the missing date 2020-01-03:
df = pd.DataFrame({"val": [10, 20, 30, 40, 50]},
index=pd.date_range("2020-01-01", "2020-01-05"))
df.drop(pd.Timestamp('2020-01-03'), inplace=True)
val
2020-01-01 10
2020-01-02 20
2020-01-04 40
2020-01-05 50
To shift by one day you can set the freq parameter to 'D':
df.shift(1, freq='D')
Output:
val
2020-01-02 10
2020-01-03 20
2020-01-05 40
2020-01-06 50
To combine original data with the shifted one you can merge both tables:
df.merge(df.shift(1, freq='D'),
left_index=True,
right_index=True,
how='left',
suffixes=('', '_previous'))
Output:
val val_previous
2020-01-01 10 NaN
2020-01-02 20 10.0
2020-01-04 40 NaN
2020-01-05 50 40.0
Other offset aliases you can find here

Categories

Resources