Pandas iterate two dataframes - python

i have two df, in one i have the list of several ids and in the other the name of the person and the id.
I want to loop them that when the id in df1 equals the id df2, he takes the name in df2 and create in df1.
I tried to adapt this code with wuzzy that I found, but didn't create.
for key,row in df.iterrows():
choices = str(list(df2.NAME_ID.unique()))
names = process.extract(str(row['P1_ID']), choices, limit=2)[0][0]
name = df2[df2['NAME_ID'] == names]['NAME']
if not name.empty:
df.loc[key,'Name'] = name
import pandas as pd
df = pd.read_clipboard(sep='\s\s+')
GAME_DATE_EST GAME_ID GAME_STATUS_TEXT P1_ID P2_ID SEASON P1_ID PTS_P1
0 2020-01-01 21900504 Final 1610612764 1610612753 2019 1610612764 10
1 2020-01-01 21900505 Final 1610612752 1610612757 2019 1610612752 9
2 2020-01-01 21900506 Final 1610612749 1610612750 2019 1610612749 10
3 2020-01-01 21900507 Final 1610612747 1610612756 2019 1610612747 8
4 2019-12-31 21900497 Final 1610612766 1610612738 2019 1610612766 9
df2
NAME_ID STANDINGSDATE NAME G W L W_PCT
0 1610612747 2020-01-01 Math 34 27 7 0.79
1 1610612743 2020-01-01 John 33 23 10 0.70
2 1610612746 2020-01-01 Elias 35 24 11 0.69
3 1610612745 2020-01-01 Alexander 34 23 11 0.68
4 1610612742 2020-01-01 Michael 33 21 12 0.64
I hope you understand and can help me

For that, you can do a simple join:
newdf = df.join(df2, on='NAME_ID', how='left')
Based on your given data, you can try:
df.merge(df2[['NAME_ID','NAME']], left_on=['P1_ID'], right_on=['NAME_ID'], how='left')

Related

JOIN two DataFrames and replace Column values in Python

I have dataframe df1:
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 5
3 txn vol(new) 2020-02-01 20
4 txn vol(tenu) 2020-01-01 30
5 txn vol(tenu) 2020-02-01 40
Second Dataframe df2:
Expenses Calendar Actual
0 txn vol(new) 2020-01-01 23
1 txn vol(new) 2020-02-01 32
2 txn vol(tenu) 2020-01-01 60
Now I wanted to read all data from df1, and join on df2 with Expenses + Calendar, then replace actual value in df1 from df2.
Expected output is:
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 23
3 txn vol(new) 2020-02-01 32
4 txn vol(tenu) 2020-01-01 60
5 txn vol(tenu) 2020-02-01 40
I am using below code
cols_to_replace = ['Actual']
df1.loc[df1.set_index(['Calendar','Expenses']).index.isin(df2.set_index(['Calendar','Expenses']).index), cols_to_replace] = df2.loc[df2.set_index(['Calendar','Expenses']).index.isin(df1.set_index(['Calendar','Expenses']).index),cols_to_replace].values
It is working when I have small data in df1. When it has (10K records), updates are happening with wrong values. df1 has 10K records, and df2 has 150 records.
Could anyone please suggest how to resolve this?
Thank you
If I understand your solution correctly, it seems to assume that (1) the Calendar-Expenses combinations are unique and (2) that their occurrences in both dataframes are aligned (same order)? I suspect that (2) isn't actually the case?
Another option - .merge() is fine! - could be:
df1 = df1.set_index(["Expenses", "Calendar"])
df2 = df2.set_index(["Expenses", "Calendar"])
df1.loc[list(set(df1.index).intersection(df2.index)), "Actual"] = df2["Actual"]
df2 = df2.reset_index() # If the original df2 is still needed
df1 = df1.reset_index()
here is one way to do it, using pd.merge
df=df.merge(df2,
on=['Expenses', 'Calendar'],
how='left',
suffixes=('_x', None)).ffill(axis=1).drop(columns='Actual_x')
df['Actual']=df['Actual'].astype(int)
df
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 23
3 txn vol(new) 2020-02-01 32
4 txn vol(tenu) 2020-01-01 60
5 txn vol(tenu) 2020-02-01 40

Get data from another data frame in python

My data frame df1:
ID Date
0 90 02/01/2021
1 101 01/01/2021
2 30 12/01/2021
My data frame df2:
ID City 01/01/2021 02/01/2021 12/01/2021
0 90 A 20 14 22
1 101 B 15 10 5
2 30 C 12 9 13
I need to create a column in df1 'New'. It should contain data from df2 with respect to 'ID' and 'Date' of df1. I am finding difficult in merging data. How could I do it?
Use DataFrame.melt with DataFrame.merge:
df22 = df2.drop('City', 1).melt(['ID'], var_name='Date', value_name='Val')
df = df1.merge(df22, how='left')
print (df)
ID Date Val
0 90 02/01/2021 14
1 101 01/01/2021 15
2 30 12/01/2021 13
You can melt and merge:
df1.merge(df2.melt(id_vars=['ID', 'City'], var_name='Date'), on=['ID', 'Date'])
output:
ID Date City value
0 90 02/01/2021 A 14
1 101 01/01/2021 B 15
2 30 12/01/2021 C 13
Alternative:
df1.merge(df2.melt(id_vars='ID',
value_vars=df2.filter(regex='/'),
var_name='Date'),
on=['ID', 'Date'])
output:
ID Date value
0 90 02/01/2021 14
1 101 01/01/2021 15
2 30 12/01/2021 13

Finding historical seasonal average for given month in a monthly series in a dataframe time-series

I have a dataframe (snippet below) with index in format YYYYMM and several columns of values, including one called "month" in which I've extracted the MM data from the index column.
index st us stu px month
0 202001 2616757.0 3287969.0 0.795858 2.036 01
1 201912 3188693.0 3137911.0 1.016183 2.283 12
2 201911 3610052.0 2752828.0 1.311398 2.625 11
3 201910 3762043.0 2327289.0 1.616492 2.339 10
4 201909 3414939.0 2216155.0 1.540930 2.508 09
What I want to do is make a new column called 'stavg' which takes the 5-year average of the 'st' column for the given month. For example, since the top row refers to 202001, the stavg for that row should be the average of the January values from 2019, 2018, 2017, 2016, and 2015. Going back in time by each additional year should pull the moving average back as well, such that stavg for the row for, say, 201205 should show the average of the May values from 2011, 2010, 2009, 2008, and 2007.
index st us stu px month stavg
0 202001 2616757.0 3287969.0 0.795858 2.036 01 xxx
1 201912 3188693.0 3137911.0 1.016183 2.283 12 xxx
2 201911 3610052.0 2752828.0 1.311398 2.625 11 xxx
3 201910 3762043.0 2327289.0 1.616492 2.339 10 xxx
4 201909 3414939.0 2216155.0 1.540930 2.508 09 xxx
I know how to generate new columns of data based on operations on other columns on the same row (such as dividing 'st' by 'us' to get 'stu' and extracting digits from index to get 'month') but this notion of creating a column of data based on previous values is really stumping me.
Any clues on how to approach this would be greatly appreciated!! I know that for the first five years of data, I won't be able to populate the 'stavg' column with anything, which is fine--I could use NaN there.
Try defining a function and using apply method
df['year'] = (df['index'].astype(int)/100).astype(int)
def get_stavg(df, year, month):
# get year from index
df_year_month = df.query('#year - 5 <= year < #year and month == #month')
return df_year_month.st.mean()
df['stavg'] = df.apply(lambda x: get_stavg(df, x['year'], x['month']), axis=1)
If you are looking for a pandas only solution you could do something like
Dummy Data
Here we create a dummy datasets with 10 years of data with only two months (Jan and Feb).
import pandas as pd
df1 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-JAN")})
df2 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-FEB")})
df1["n"] = df1.index*2
df2["n"] = df2.index*3
df = pd.concat([df1, df2]).sort_values("date").reset_index(drop=True)
df.head(10)
date n
0 2010-01-01 0
1 2010-02-01 0
2 2011-01-01 2
3 2011-02-01 3
4 2012-01-01 4
5 2012-02-01 6
6 2013-01-01 6
7 2013-02-01 9
8 2014-01-01 8
9 2014-02-01 12
Groupby + rolling mean
df["n_mean"] = df.groupby(df["date"].dt.month)["n"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
date n n_mean
0 2010-01-01 0 NaN
1 2010-02-01 0 NaN
2 2011-01-01 2 NaN
3 2011-02-01 3 NaN
4 2012-01-01 4 NaN
5 2012-02-01 6 NaN
6 2013-01-01 6 NaN
7 2013-02-01 9 NaN
8 2014-01-01 8 4.0
9 2014-02-01 12 6.0
10 2015-01-01 10 6.0
11 2015-02-01 15 9.0
12 2016-01-01 12 8.0
13 2016-02-01 18 12.0
14 2017-01-01 14 10.0
15 2017-02-01 21 15.0
16 2018-01-01 16 12.0
17 2018-02-01 24 18.0
18 2019-01-01 18 14.0
19 2019-02-01 27 21.0
By definition for the first 4 years the result is NaN.
Update
For your particular case
import pandas as pd
index = [f"{y}01" for y in range(2010, 2020)] +\
[f"{y}02" for y in range(2010, 2020)]
df = pd.DataFrame({"index":index})
df["st"] = df.index + 1
# dates/ index should be sorted
df = df.sort_values("index").reset_index(drop=True)
# extract month
df["month"] = df["index"].str[-2:]
df["st_mean"] = df.groupby("month")["st"]\
.rolling(5).mean()\
.reset_index(0,drop=True)

How to write the fucntion that transfrom the columns of my dataframe to a single column?

I have a dataframe like this:
A = ID Material1 Materia2 Material3
14 0 0 0
24 1 0 0
12 1 1 0
25 0 0 2
I want to have all information in one column like this:
A = ID Materials
14 Nan
24 Material1
12 Material1
12 Material2
25 Material3
25 Material3
can anyone help write a function please !
Use DataFrame.melt with repeat rows by counts with Index.repeat and DataFrame.loc:
df1 = df.melt('ID', var_name='Materials')
df1 = df1.loc[df1.index.repeat(df1['value'])].drop('value', axis=1).reset_index(drop=True)
print (df1)
ID Materials
0 24 Material1
1 12 Material1
2 12 Materia2
3 25 Material3
4 25 Material3
EDIT: For add only 0 Materials with missing values use DataFrame.merge with left join by original df['ID'] in one column DataFrame withoiut duplications by DataFrame.drop_duplicates:
df1 = df.melt('ID', var_name='Materials')
df0 = df[['ID']].drop_duplicates()
print (df0)
ID
0 14
1 24
2 12
3 25
df2 = df1.loc[df1.index.repeat(df1['value'])].drop('value', axis=1).reset_index(drop=True)
df2 = df0.merge(df2, on='ID', how='left')
print (df2)
ID Materials
0 14 NaN
1 24 Material1
2 12 Material1
3 12 Materia2
4 25 Material3
5 25 Material3

Shift time series with missing dates in Pandas

I have a times series with some missing entries, that looks like this:
date value
---------------
2000 5
2001 10
2003 8
2004 72
2005 12
2007 13
I would like to do create a column for the "previous_value". But I only want it to show values for consecutive years. So I want it to look like this:
date value previous_value
-------------------------------
2000 5 nan
2001 10 5
2003 8 nan
2004 72 8
2005 12 72
2007 13 nan
However just applying pandas shift function directly to the column 'value' would give 'previous_value' = 10 for 'time' = 2003, and 'previous_value' = 12 for 'time' = 2007.
What's the most elegant way to deal with this in pandas? (I'm not sure if it's as easy as setting the 'freq' attribute).
In [588]: df = pd.DataFrame({ 'date':[2000,2001,2003,2004,2005,2007],
'value':[5,10,8,72,12,13] })
In [589]: df['previous_value'] = df.value.shift()[ df.date == df.date.shift() + 1 ]
In [590]: df
Out[590]:
date value previous_value
0 2000 5 NaN
1 2001 10 5
2 2003 8 NaN
3 2004 72 8
4 2005 12 72
5 2007 13 NaN
Also see here for a time series approach using resample(): Using shift() with unevenly spaced data
Your example doesn't look like real time series data with timestamps. Let's take another example with the missing date 2020-01-03:
df = pd.DataFrame({"val": [10, 20, 30, 40, 50]},
index=pd.date_range("2020-01-01", "2020-01-05"))
df.drop(pd.Timestamp('2020-01-03'), inplace=True)
val
2020-01-01 10
2020-01-02 20
2020-01-04 40
2020-01-05 50
To shift by one day you can set the freq parameter to 'D':
df.shift(1, freq='D')
Output:
val
2020-01-02 10
2020-01-03 20
2020-01-05 40
2020-01-06 50
To combine original data with the shifted one you can merge both tables:
df.merge(df.shift(1, freq='D'),
left_index=True,
right_index=True,
how='left',
suffixes=('', '_previous'))
Output:
val val_previous
2020-01-01 10 NaN
2020-01-02 20 10.0
2020-01-04 40 NaN
2020-01-05 50 40.0
Other offset aliases you can find here

Categories

Resources