Match, update and replace values from one dataset to another in Pandas - python

I have two datasets, one has old data and one has updated data. I'd like to create a new dataset by updating values based on if the area, date and column values match.
Data
df1
area date aa bb cc
japan 10/1/2027 1 0 0
us 1/1/2022 5 5 5
fiji 11/2/2026 1 1 1
df2
area date aa bb cc stat
japan 10/1/2027 0 5 5 yes
fiji 11/2/2026 0 0 10 no
I have two datasets. I wish to replace the values in [aa], [bb], and [cc] columns of df2 with the updated values from df1 if we have the same date and area values. The aa, bb, and cc column are replaced with the updated values.
Desired
area date aa bb cc stat
japan 10/1/2027 1 0 0 yes
fiji 11/2/2026 1 1 1 no
Doing
df['date'] = df.date.apply(lambda x: np.nan if x == ' ' else x)
I am not exactly sure how to set this up, however, I have an idea. Any suggestion is appreciated

You can merge and combine_first:
cols = ['area', 'date']
out = (df2[cols].merge(df1, on=cols, how='left')
.combine_first(df2)[df2.columns]
)
output:
area date aa bb cc stat
0 japan 10/1/2027 1 0 0 yes
1 fiji 11/2/2026 1 1 1 no

Using .merge and making sure date columns in both dfs are set to datetime.
df1["date"] = pd.to_datetime(df1["date"])
df2["date"] = pd.to_datetime(df2["date"])
df3 = pd.merge(left=df1, right=df2, on=["area", "date"], how="right").filter(regex=r".*(?<!_y)$")
df3.columns = df3.columns.str.split("_").str[0]
print(df3)
area date aa bb cc stat
0 japan 2027-10-01 1 0 0 yes
1 fiji 2026-11-02 1 1 1 no

I think this can possibly be simplified to:
output = df1[df1['area'].isin(df2['area']) & df1['date'].isin(df2['date'])]
OUTPUT:
area date aa bb cc stat
japan 10/1/2027 1 0 0 yes
fiji 11/2/2026 1 1 1 no
Even when df1 looks like this:
DF1:
area date aa bb cc
0 japan 10/1/2027 1 0 0
1 us 1/1/2022 5 5 5
2 fiji 11/2/2026 1 1 1
3 fiji 12/5/2025 9 9 9

Related

Pandas.DataFrame - create a new column, based on whether value in another column has occur-ed or not

I'm an amateur user having some experiences VBA but trying to switch to Python because my beautiful new MBP runs VBA miserably.
I'm trying to create a df column, based on whether another column value has occur-ed already. If it has, then the new column value is 0 on that row, if not 1.
For example: I want to create column C in the example below. How do I do it quickly?
A B C (to create column C)
0 001 USA 1
1 002 Canada 1
3 003 China 1
4 004 India 1
5 005 UK 1
6 006 Japan 1
7 007 USA 0
8 008 UK 0
You can check for duplicates on the 'B' column and set duplicates to 0. Then set any non-duplicates to 1 like this:
df = pd.DataFrame({'A':[1, 2, 3, 4, 5, 6, 7, 8], 'B':['USA', 'Canada', 'China', 'India', 'UK', 'Jpan', 'USA', 'UK']})
df.loc[df['B'].duplicated(), 'C'] = 0
df['C'] = df['C'].fillna(1).astype(int)
print(df)
Output:
A B C
0 1 USA 1
1 2 Canada 1
2 3 China 1
3 4 India 1
4 5 UK 1
5 6 Jpan 1
6 7 USA 0
7 8 UK 0
After creating your dataframe :
import pandas as pandas
data = [["001", "USA"], ["002", "Canada"], ["003", "China"],
["004", "India"], ["005", "UK"], ["006", "Japan"], ["007", "USA"], ["008", "UK"]]
# Create a dataframe
df = pandas.DataFrame(data, columns=["A", "B"])
You can apply a function to each value of one of the columns (in your case, the B column) and have the output of the function as the value of your column.
df["C"] = df.B.apply(lambda x: 1 if df.B.value_counts()[x] == 1 else 0)
This will check if the value in the B column appears somewhere else in the column, and will return 1 if unique and 0 if duplicated.
The dataframe looks like this :
A B C
0 001 USA 0
1 002 Canada 1
2 003 China 1
3 004 India 1
4 005 UK 0
5 006 Japan 1
6 007 USA 0
7 008 UK 0
If you want the values to be recalculated each time you need to have the command
df["C"] = df.B.apply(lambda x: 1 if df.B.value_counts()[x] == 1 else 0)
executed each time after you add a row.

Updated all values in a pandas dataframe based on all instances of a specific value in another dataframe

My apologies beforehand! I have done this before a few times, but I am having some brain fog. I have two dataframes df1, and df2. I would like to update all values in df2 if it matches a specific value in df1, while not changing the other values in df2. I can do this pretty easily with np.where on columns of a dataframe, I am having brain fog on how I did this previously with 2 dataframes!
Goal: Set values in Df2 to 0 if they are 0 in DF1 - otherwise keep the DF2 value
Example
df1
A
B
C
4
0
1
0
2
0
1
4
0
df2
A
B
C
1
8
1
9
2
7
1
4
6
Expected df2 after our element swap
A
B
C
1
0
1
0
2
0
1
4
0
brain fog is bad! thank you for the assistance!
Using fillna
>>> df2[df1 != 0].fillna(0)
You can try
df2[df1.eq(0)] = 0
print(df2)
A B C
0 1 0 1
1 0 2 0
2 1 4 0

Creating dummy variable depending on year and category in pandas

this is my first time posting a question here, so please let me know if my question is lacking anyway.
Let's say I have the following dataframe where "Value" contains only integer 1 or 2. Basically, I want to create a column("Desired") with a dummy variable where 1 happens when firm has remained with Value=1 since the beginning of its appearance. Once the firm has Value=2, the dummy variable should be 0 even if the firm reverts back to Value=1.
Firm_ID
Year
Value
Desired
0000001
2000
1
1
0000001
2001
1
1
0000001
2002
2
0
0000001
2003
2
0
0000001
2004
1
0
0000001
2005
1
0
0000002
2000
2
0
0000002
2001
2
0
0000002
2002
2
0
0000003
2000
1
1
0000003
2001
1
1
0000003
2002
1
1
0000003
2003
1
1
d = {'firm_id': ["0000001" , "0000001","0000001","0000001","0000001","0000001","0000002","0000002","0000002","0000003",
"0000003","0000003","0000003"],
'year': [2000,2001,2002,2003,2004,2005,2000,2001,2002,2000,2001,2002,2003],
'Value':[1,1,2,2,1,1,2,2,2,1,1,1,1]}
df = pd.DataFrame(data=d)
Currently, the code I am running is the following.
for i in range(df.shape[0]):
firm = df.loc[i,'firm_id']
year = df.loc[i,'year']
temp_df = df[df['firm_id']==firm]
if (temp_df.groupby(['year']).max()[['Value']].max() == 2)[0]: # At some point this firm becomes Value==2
# Get Earliest Year of becoming Value==2
year_df = temp_df.groupby(['year']).max()[['Value']]
ch_year = year_df[year_df['Value']==2].index.min() # Year the firm becomes Value==2
if year >= ch_year :
df.loc[i,'Desired'] = 0
else :
df.loc[i,'Desired'] = 1
else :# They always remain Value==1
df.loc[i,'Desired']=1
However, this code is taking too long for the size of my current dataframe.
Is there a more efficient way of code that I could use?
# df = df.sort_values(["firm_id", "year"])
df.Value.ne(1).groupby(df.firm_id).cumsum().lt(1).astype(int)
# Value
# 0 1
# 1 1
# 2 0
# 3 0
# 4 0
# 5 0
# 6 0
# 7 0
# 8 0
# 9 1
# 10 1
# 11 1
# 12 1

Counting mode occurrences for all columns in a dataframe

I have a dataframe that looks like below.
dataframe1 =
In AA BB CC
0 10 1 0
1 11 2 3
2 10 6 0
3 9 1 0
4 10 3 1
5 1 2 0
now I want to create a dataframe that gives me the count of modes for each column, for column AA the count is 3 for mode 10, for columns CC the count is 4 for mode 0, but for BB there are two modes 1 and 2, so for BB I want the sum of counts for the modes. so for BB the count is 2+2=4, for mode 1 and 2.
Therefore the final dataframe that I want looks like below.
Columns Counts
AA 3
BB 4
CC 4
How to do it?
Another slightly more scalable solution using list comprehension:
pd.concat([df.eq(x) for _, x in df.mode().iterrows()]).sum()
[out]
AA 3
BB 4
CC 4
dtype: int64
You can compare columns with modes and count matches by sum:
df = pd.DataFrame({'Columns': df.columns,
'Val':[df[x].isin(df[x].mode()).sum() for x in df]})
print (df)
Columns Val
0 AA 3
1 BB 4
2 CC 4
First we get the modes of the columns with DataFrame.mode
Then we compare each column to it's modes and use Series.isin to check the amount of modes and sum these.
modes = df.iloc[:, 1:].mode()
data = {col: df[col].isin(modes[col]).sum() for col in df.iloc[:, 1:].columns}
df = pd.DataFrame.from_dict(data, orient='index', columns=['Counts'])
Counts
AA 3
BB 4
CC 4
Used pyjanitor module to bring in the transform function and return a dataframe:
(df.melt(id_vars='In')
.groupby('variable')
.agg(numbers=('value','value_counts'))
.groupby_agg(by='variable',
#here, it subtracts the max of numbers(for each group) from each
number in the group
agg = lambda x : x - x.max(),
agg_column_name='numbers',
new_column_name = 'test'
)
.query('test==0')
.groupby('variable')
.agg(count=('numbers','sum'))
)
count
variable
AA 3
BB 4
CC 4

Efficiently transform pandas dataFrame using column name as factor

I would like to transform a DataFrame given by a software into a more python usable one and I can't fix it in a simple way with pandas because I have to use information contained in the columns. Here a simple example :
import pandas as pd
d = {'00' : [1],'01' : [11], '10': [111], '11':[1111]}
pd.DataFrame(d)
00 01 10 11
0 1 11 111 1111
The column names contains the factors that I need to use in rows, I would like to get something like this :
df = {'trt': [0,0,1,1], 'grp': [0,1,0,1], 'value':[1,11,111,1111]}
pd.DataFrame(df)
grp trt value
0 0 0 1
1 1 0 11
2 0 1 111
3 1 1 1111
Any ideas of how to do it properly ?
Solution with MultiIndex.from_arrays created indexing with str and transpose by T:
df.columns = pd.MultiIndex.from_arrays([df.columns.str[0], df.columns.str[1]])
print (df)
0 1
0 1 0 1
0 1 11 111 1111
df1 = df.T.reset_index()
df1.columns = ['grp','trt','value']
print (df1)
grp trt value
0 0 0 1
1 0 1 11
2 1 0 111
3 1 1 1111
Similar solution with rename_axis and rename index:
d = {'00' : [1],'01' : [11], '10': [111], '11':[1111]}
df = pd.DataFrame(d)
df.columns = pd.MultiIndex.from_arrays([df.columns.str[0], df.columns.str[1]])
print(df.rename_axis(('grp','trt'), axis=1).rename(index={0:'value'}).T.reset_index())
grp trt value
0 0 0 1
1 0 1 11
2 1 0 111
3 1 1 1111
To me the simplest solution is just melting the original frame and splitting the column names in a second step. Something like this:
df = pd.DataFrame(d)
mf = pd.melt(df)
mf[['grp', 'trt']] = mf.pop('variable').apply(lambda x: pd.Series(tuple(x)))
Here's mf after melting:
variable value
0 00 1
1 01 11
2 10 111
3 11 1111
And the final result, after splitting the variable column:
value grp trt
0 1 0 0
1 11 0 1
2 111 1 0
3 1111 1 1
I'd encourage you to read up more on melting here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html . It can be incredibly useful.

Categories

Resources