JOIN two DataFrames and replace Column values in Python - python

I have dataframe df1:
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 5
3 txn vol(new) 2020-02-01 20
4 txn vol(tenu) 2020-01-01 30
5 txn vol(tenu) 2020-02-01 40
Second Dataframe df2:
Expenses Calendar Actual
0 txn vol(new) 2020-01-01 23
1 txn vol(new) 2020-02-01 32
2 txn vol(tenu) 2020-01-01 60
Now I wanted to read all data from df1, and join on df2 with Expenses + Calendar, then replace actual value in df1 from df2.
Expected output is:
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 23
3 txn vol(new) 2020-02-01 32
4 txn vol(tenu) 2020-01-01 60
5 txn vol(tenu) 2020-02-01 40
I am using below code
cols_to_replace = ['Actual']
df1.loc[df1.set_index(['Calendar','Expenses']).index.isin(df2.set_index(['Calendar','Expenses']).index), cols_to_replace] = df2.loc[df2.set_index(['Calendar','Expenses']).index.isin(df1.set_index(['Calendar','Expenses']).index),cols_to_replace].values
It is working when I have small data in df1. When it has (10K records), updates are happening with wrong values. df1 has 10K records, and df2 has 150 records.
Could anyone please suggest how to resolve this?
Thank you

If I understand your solution correctly, it seems to assume that (1) the Calendar-Expenses combinations are unique and (2) that their occurrences in both dataframes are aligned (same order)? I suspect that (2) isn't actually the case?
Another option - .merge() is fine! - could be:
df1 = df1.set_index(["Expenses", "Calendar"])
df2 = df2.set_index(["Expenses", "Calendar"])
df1.loc[list(set(df1.index).intersection(df2.index)), "Actual"] = df2["Actual"]
df2 = df2.reset_index() # If the original df2 is still needed
df1 = df1.reset_index()

here is one way to do it, using pd.merge
df=df.merge(df2,
on=['Expenses', 'Calendar'],
how='left',
suffixes=('_x', None)).ffill(axis=1).drop(columns='Actual_x')
df['Actual']=df['Actual'].astype(int)
df
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 23
3 txn vol(new) 2020-02-01 32
4 txn vol(tenu) 2020-01-01 60
5 txn vol(tenu) 2020-02-01 40

Related

I want to select duplicate rows between 2 dataframes

I want to filter rolls (df1) with date column that in datetime64[ns] from df2 (same column name and dtype). I tried searching for a solution but I get the error:
Can only compare identically-labeled Series objects | 'Timestamp' object is not iterable or other.
sample df1
id
date
value
1
2018-10-09
120
2
2018-10-09
60
3
2018-10-10
59
4
2018-11-25
120
5
2018-08-25
120
sample df2
date
2018-10-09
2018-10-10
sample result that I want
id
date
value
1
2018-10-09
120
2
2018-10-09
60
3
2018-10-10
59
In fact, I want this program to run 1 time in every 7 days, counting back from the day it started. So I want it to remove dates that are not in these past 7 days.
# create new dataframe -> df2
data = {'date':[]}
df2 = pd.DataFrame(data)
#Set the date to the last 7 days.
days_use = 7 # 7 -> 1
for x in range (days_use,0,-1):
days_use = x
use_day = date.today() - timedelta(days=days_use)
df2.loc[x] = use_day
#Change to datetime64[ns]
df2['date'] = pd.to_datetime(df2['date'])
Use isin:
>>> df1[df1["date"].isin(df2["date"])]
id date value
0 1 2018-10-09 120
1 2 2018-10-09 60
2 3 2018-10-10 59
If you want to create df2 with the dates for the past week, you can simply use pd.date_range:
df2 = pd.DataFrame({"date": pd.date_range(pd.Timestamp.today().date()-pd.DateOffset(7),periods=7)})
>>> df2
date
0 2022-05-03
1 2022-05-04
2 2022-05-05
3 2022-05-06
4 2022-05-07
5 2022-05-08
6 2022-05-09

Python: Iterate Over Year and Month in DatetimeIndex

I have two DataFrames:
df1:
A B
Date
01/01/2020 2 4
02/01/2020 6 8
df2:
A B
Date
01/01/2020 5 10
I want to get the following:
df3:
A B
Date
01/01/2020 10 40
02/01/2020 30 80
What I want is to multiply the column entries based on year and month in DatetimeIndex. But I'm not sure how to iterate over datetime.
use to_numpy():
df3=pd.DataFrame(df1.to_numpy()*df2.to_numpy(),index=df1.index,columns=df1.columns)
output of df3:
A B
Date
01/01/2020 10 40
02/01/2020 30 80
You may need reindex
df1.index = pd.to_datetime(df1.index,dayfirst=True)
df2.index = pd.to_datetime(df2.index,dayfirst=True)
df2.index = df2.index.strftime('%Y-%m')
df1[:] *= df2.reindex(df1.index.strftime('%Y-%m')).values
df1
Out[529]:
A B
Date
2020-01-01 10 40
2020-01-02 30 80

How to filter by a pandas column if the number of unique value in another column is equal a given value

please I want to filter AccountID that has transaction data for at least >=3 months ?. This is just a small fraction of the entire dataset
Here is what I did but, I am not sure it is right.
data = data.groupby('AccountID').apply(lambda x: x['TransactionDate'].nunique() >= 3)
I get a series as an output with boolean values. I want to get a pandas dataframe
TransactionDate AccountID TransactionAmount
0 2020-12-01 8 400000.0
1 2020-12-01 22 25000.0
2 2020-12-02 22 551500.0
3 2020-01-01 17 116.0
4 2020-01-01 24 2000.0
5 2020-01-02 68 6000.0
6 2020-03-03. 20 180000.0
7 2020-03-01 66 34000.0
8 2020-02-01 66 20000.0
9 2020-02-01 66 40600.0
The ouput I get
AccountID
1 True
2 True
3 True
4 True
5 True
You are close, need GroupBy.transform for repeat aggregated values for Series with same size like original df, so possible filtering in boolean indexing:
data = data[data.groupby('AccountID')['TransactionDate'].transform('nunique') >= 3]
If possible some dates has no same day, 1 use Series.dt.to_period for helper column filler by months periods:
s = data.assign(new = data['TransactionDate'].dt.to_period('m')).groupby('AccountID')['new'].transform('nunique')
data = data[s >= 3]

Counting number of entries per month pandas

I have a df in format:
start end
0 2020-01-01 2020-01-01
1 2020-01-01 2020-01-01
2 2020-01-02 2020-01-02
...
57 2020-04-01 2020-04-01
58 2020-04-02 2020-04-02
And I want to count the number of entries in each month and place it in a new df i.e. the number of 'start' entries for Jan, Feb, etc, to give me:
Month Entries
2020-01 3
...
2020-04 2
I am currently trying something like this, but its not what I'm needing:
df.index = pd.to_datetime(df['start'],format='%Y-%m-%d')
df.groupby(pd.Grouper(freq='M'))
df['start'].value_counts()
Use Groupby.count with Series.dt:
In [1282]: df
Out[1282]:
start end
0 2020-01-01 2020-01-01
1 2020-01-01 2020-01-01
2 2020-01-02 2020-01-02
57 2020-04-01 2020-04-01
58 2020-04-02 2020-04-02
# Do this only when your `start` and `end` columns are object. If already datetime, you can ignore below 2 statements
In [1284]: df.start = pd.to_datetime(df.start)
In [1285]: df.end = pd.to_datetime(df.end)
In [1296]: df1 = df.groupby([df.start.dt.year, df.start.dt.month]).count().rename_axis(['year', 'month'])['start'].reset_index(name='Entries')
In [1297]: df1
Out[1297]:
year month Entries
0 2020 1 3
1 2020 4 2

Pandas iterate two dataframes

i have two df, in one i have the list of several ids and in the other the name of the person and the id.
I want to loop them that when the id in df1 equals the id df2, he takes the name in df2 and create in df1.
I tried to adapt this code with wuzzy that I found, but didn't create.
for key,row in df.iterrows():
choices = str(list(df2.NAME_ID.unique()))
names = process.extract(str(row['P1_ID']), choices, limit=2)[0][0]
name = df2[df2['NAME_ID'] == names]['NAME']
if not name.empty:
df.loc[key,'Name'] = name
import pandas as pd
df = pd.read_clipboard(sep='\s\s+')
GAME_DATE_EST GAME_ID GAME_STATUS_TEXT P1_ID P2_ID SEASON P1_ID PTS_P1
0 2020-01-01 21900504 Final 1610612764 1610612753 2019 1610612764 10
1 2020-01-01 21900505 Final 1610612752 1610612757 2019 1610612752 9
2 2020-01-01 21900506 Final 1610612749 1610612750 2019 1610612749 10
3 2020-01-01 21900507 Final 1610612747 1610612756 2019 1610612747 8
4 2019-12-31 21900497 Final 1610612766 1610612738 2019 1610612766 9
df2
NAME_ID STANDINGSDATE NAME G W L W_PCT
0 1610612747 2020-01-01 Math 34 27 7 0.79
1 1610612743 2020-01-01 John 33 23 10 0.70
2 1610612746 2020-01-01 Elias 35 24 11 0.69
3 1610612745 2020-01-01 Alexander 34 23 11 0.68
4 1610612742 2020-01-01 Michael 33 21 12 0.64
I hope you understand and can help me
For that, you can do a simple join:
newdf = df.join(df2, on='NAME_ID', how='left')
Based on your given data, you can try:
df.merge(df2[['NAME_ID','NAME']], left_on=['P1_ID'], right_on=['NAME_ID'], how='left')

Categories

Resources