Vlookups in Pandas across 2 dataframe column - python

I have 2 dataframes and I wish to grab IDs matching with DF2 into df1 merged as separate columns. There are multiple columns to be added due to df2 having many different country names for a specific ID.
df1 looks like below:
ID URL
A example.com/1
B example.com/2
C example.com/3
D example.com/4
df2 is like this:
ID country URL
A usa example.com/usa-A
B uk example.com/uk-B
C canada example.com/canada-C
A uk example.com/uk-A
C usa example.com/usa-C
What I am expecting df1 to look like:
ID URL uk USA Canada
A example.com/1 example.com/uk-A example.com/usa-A NaN
B example.com/2 example.com/uk-B NaN NaN
C example.com/3 NaN example.com/usa-C example.com/canada-C
D example.com/4 NaN NaN NaN
I wish to bring if DF1 ID A is found in DF2 ID against a country then bring the country URL up next to df1 ID in a specific country column.
The way I am trying to achieve this is using a for loop with a map call below:
final = pd.DataFrame()
for a in countries_list:
b = df2.loc[(df2["country"] == a)]
df1["country"] = df1['id'].map(df2.set_index('id')['url'])
final = pd.concat([final, df1])
It runs for a certain amount of countries and then start throwing InvalidIndexError: Reindexing only valid with uniquely valued Index objects which I tried to overcome using a reset_index() function on both df1 and df2 but still after a certain amount of iterations, it throws me the same error.
Can someone suggest a more efficient way to do this or any way i could run it over all possible iterations?
Thanks,

Try as follows:
res = df.merge(df2.pivot(index='ID',columns='country',values='URL'),
left_on='ID', right_index=True, how='left')
print(res)
ID URL canada uk usa
0 A example.com/1 NaN example.com/uk-A example.com/usa-A
1 B example.com/2 NaN example.com/uk-B NaN
2 C example.com/3 example.com/canada-C NaN example.com/usa-C
3 D example.com/4 NaN NaN NaN
Explanation
First, use df.pivot on df2. We get:
print(df2.pivot(index='ID',columns='country',values='URL'))
country canada uk usa
ID
A NaN example.com/uk-A example.com/usa-A
B NaN example.com/uk-B NaN
C example.com/canada-C NaN example.com/usa-C
Next, use df.merge to merge df and pivoted df2, joining the two on ID, and passing left to the how parameter to "use only keys from left frame". (Leave how='left' out, if you are not interested in the row for ID with only NaN values.)
If you're set on a particular column order, use e.g.:
res = res.loc[:,['ID','URL','uk','usa','canada']]

# map df to df2 for URL
df2['URL2']=df2['ID'].map(df.set_index(['ID'])['URL'])
#pivot
(df2.pivot(index=['ID','URL2'], columns='country', values='URL')
.reset_index()
.rename_axis(columns=None)
.rename(columns={'URL2':'URL'}))
ID URL2 canada uk usa
0 A example.com/1 NaN example.com/uk-A example.com/usa-A
1 B example.com/2 NaN example.com/uk-B NaN
2 C example.com/3 example.com/canada-C NaN example.com/usa-C

Related

How to change rows name to column name in dataframe using python?

i have two dataframe:
df1:
colname value
gender M
status business
age 60
df2:
colname value
name Susan
Place Africa
gender F
Is there a way i can concatenate these two dataframe in a way as the expected output? I tried outer join but it doesnot work, Thank you in advance.
Note: No dataframes have always the same common attribute, and also I am trying to remove the colname and value column.
Expected output:
gender status age name Place
0 M business 60 0 0
1 F 0 0 Susan Africa
You can convert to Series with colname as index and concat:
dfs = [df1, df2]
out = pd.concat([d.set_index('colname')['value'] for d in dfs],
axis=1, ignore_index=True).T
output:
colname gender status age name Place
0 M business 60 NaN NaN
1 F NaN NaN Susan Africa
Try this:
pd.concat([df1, df2], axis=0).fillna(0).reset_index(drop=True)
gender status age name place
0 M business 60 0 0
1 F 0 0 Susan Africa
The fillna will replace the NaN values with 0.
Below line will resolves your issue , you can use the pandas transpose function to solve this problem.
df_new = pd.concat([df1,df2])
df_new.set_index("colname").T

Python pandas insert empty rows after each row

Hello I am trying to insert 3 empty rows after each row of the current data using pandas then export the data. For example a sample current data could be:
name profession
Bill cashier
Sam stock
Adam security
Ideally what I want to achieve:
name profession
Bill cashier
Nan Nan
Nan Nan
Nan Nan
Sam stock
Nan Nan
Nan Nan
Nan Nan
Adam security
Nan Nan
Nan Nan
Nan Nan
I have experimented with itertools however i am not sure how i can precisely get three empty rows using after each row using this method. Any help, guidance, sample would definitely be appreciative!
Using append on a dataframe is quite inefficient I believe (has to reallocate memory for the entire data frame each time).
DataFrames were meant for analyzing data and easily adding columns—but not rows.
So I think a good approach would be to create a new dataframe of the correct size and then transfer the data over to it. Easiest way to do that is using an index.
# Demonstration data
data = 'name profession Bill cashier Sam stock Adam security'
data = np.array(data.split()).reshape((4,2))
df = pd.DataFrame(data[1:],columns=data[0])
# Add n blank rows
n = 3
new_index = pd.RangeIndex(len(df)*(n+1))
new_df = pd.DataFrame(np.nan, index=new_index, columns=df.columns)
ids = np.arange(len(df))*(n+1)
new_df.loc[ids] = df.values
print(new_df)
Output:
name profession
0 Bill cashier
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 Sam stock
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 Adam security
9 NaN NaN
10 NaN NaN
11 NaN NaN
insert_rows = 3 # how many rows to insert
df.index = range(0, insert_rows * len(df), insert_rows)
# create new_df with added rows
new_df = df.reindex(index = range(insert_rows * len(df)))
If you provided more information that would be helpful, but a thing that comes to mind is to use this command
df.append(pd.Series(), ignore_index=True)
This will add an empty row to your data frame, though as you can see you have to pass set ignore_index=True, otherwise the append won't work.
The code below includes a function to add empty rows between the existing rows of a dataframe.
Might not be the best approach for what you want to do, it might be better to add the blank rows when you are exporting the data.
import pandas as pd
def add_blank_rows(df, no_rows):
df_new = pd.DataFrame(columns=df.columns)
for idx in range(len(df)):
df_new = df_new.append(df.iloc[idx])
for _ in range(no_rows):
df_new=df_new.append(pd.Series(), ignore_index=True)
return df_new
df = pd.read_csv('test.csv')
df_with_blank_rows = add_blank_rows(df, 3)
print(df_with_blank_rows)
this works
df_new = pd.DataFrame()
for i, row in df.iterrows():
df_new = df_new.append(row)
for _ in range(3):
df_new = df_new.append(pd.Series(), ignore_index=True)
df of course is the original DataFrame
Here is a function to do that with one loop:
def NAN_rows(df):
row = df.shape[0]
x = np.empty((3,2,)) # 3 empty row and 2 columns. You can increase according to your original df
x[:] = np.nan
df_x = pd.DataFrame( columns = ['name' ,'profession'])
for i in range(row):
temp = np.vstack([df.iloc[i].tolist(),x])
df_x = pd.concat([df_x, pd.DataFrame(temp,columns = ['name' ,'profession'])], axis=0)
return df_x
df = pd.DataFrame({
'name' : ['Bill','Sam','Adam'],
'profession' : ['cashier','stock','security']
})
print(NAN_rows(df))
#Output:
name profession
0 Bill cashier
1 nan nan
2 nan nan
3 nan nan
0 Sam stock
1 nan nan
2 nan nan
3 nan nan
0 Adam security
1 nan nan
2 nan nan
3 nan nan

Derive multiple df from single df such that each df has no NaN values

I want to convert this table
0 thg John 3.0
1 thg James 4.0
2 mol NaN 5.0
3 mol NaN NaN
4 lob NaN NaN
In this following tables
df1
movie name rating
0 thg John 3.0
1 thg James 4.0
df2
movie rating
2 mol 5.0
df3
movie
3 mol
4 lob
Where each dataframe has no Nan value, Also tell method if I need to separate with respect to blank value instead of Nan.
I think that start of a new target DataFrame should occur not
only when the number of NaN values changes (compared to
previous row), but also when this number is the same, but
NaN values are in different columns.
So I propose the following formula:
dfs = [g.dropna(how='all',axis=1) for _,g in
df.groupby(df.isna().ne(df.isna().shift()).any(axis=1).cumsum())]
You can print partial DataFrames (any number of them) running:
n = 0
for grp in dfs:
print(f'\ndf No {n}:\n{grp}')
n += 1
The advantage of my solution over the other becomes obvious when you add
to the source DataFrame another row containing:
5 NaN NaN 3.0
It contains also 1 non-null value (like two previous rows).
The other solution will treat all these rows as one partial DataFrame
containing:
movie rating
3 mol NaN
4 lob NaN
5 NaN 3.0
as you can see, with NaN values, whereas my solution divides these
rows into 2 separate DataFrames, without any NaN.
create a list of dfs , with a groupby and dropna:
dfs = [g.dropna(how='all',axis=1) for _,g in df.groupby(df.isna().sum(1))]
print(dfs[0],'\n\n',dfs[1],'\n\n',dfs[2])
Or dict:
d = {f"df{e+1}": g[1].dropna(how='all',axis=1)
for e,g in enumerate(df.groupby(df.isna().sum(1)))}
print(d['df1'],'\n\n',d['df2'],'\n\n',d['df3']) #read the keys of d
movie name rating
0 thg John 3.0
1 thg James 4.0
movie rating
2 mol 5.0
movie
3 mol
4 lob

Adding columns from one data frame using Python

I am trying to add a column from one pandas data-frame to another pandas data-frame.
Here is data frame 1:
print (df.head())
ID Name Gender
1 John Male
2 Denver 0
0
3 Jeff Male
Note: Both ID and Name are indexes
Here is the data frame 2:
print (df2.head())
ID Appointed
1 R
2
3 SL
Note: ID is the index here.
I am trying to add the Appointed column from df2 to df1 based on the ID. I tried inserting the column and copying the column from df2 but the Appointed column keeps returning all NAN values. So far I had no luck any suggestions would be greatly appreciated.
If I understand your problem correctly, you should get what you need using this:
df1.reset_index().merge(df2.reset_index(), left_on='ID', right_on='ID')
ID Name Gender Appointed
0 1 John Male R
1 2 Denver 0 NaN
2 3 Jeff Male SL
Or, as an alternative, as pointed out by Wen, you could use join:
df1.join(df2)
Gender Appointed
ID Name
1 John Male R
2 Denver 0 NaN
0 NaN NaN NaN
3 Jeff Male SL
Reset index for both datafrmes and then create a column named 'Appointed' in df1 and assign the same column of df2 in it.
After resetting index,both datafrmes have index beginning from 0. When we assign the column, they automatically align according to index which is a property of pandas dataframe
df1.reset_index()
df2.reset_index()
df1['Appointed'] = df2['Appointed']

Joining two dataframes in pandas using full outer join

I've two dataframes in pandas as shown below. EmpID is a primary key in both dataframes.
df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])
I want to join these two dataframes with EmpID so that
Missing data in one dataframe can be filled with value from another table if exists and key matches
If there are observations with new keys then they should be appended in the resulting dataframe
I've used below code for achieving this.
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
But this code gives me duplicate columns which I don't want so I only used unique columns from both tables for merging.
ColNames = list(df_second.columns.difference(df_first.columns))
ColNames.append('EmpID')
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
Now I don't get duplicate columns but don't get value either in observations where key matches.
I'll really appreciate if someone can help me with this.
Regards,
Kailash Negi
It seems you need combine_first with set_index for match by indices created by columns EmpID:
df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
EmpID Department Location Name Salary
0 1 HR Delhi A 1000.0
1 2 NaN NaN B NaN
2 3 Finance NaN C 3000.0
3 4 NaN NaN D 8000.0
4 5 Programming NaN E 6000.0
5 8 Admin Mumbai B NaN
6 9 Ops Banglore D NaN
7 10 Analytics Mumbai K NaN
EDIT:
For some order of columns need reindex:
#concatenate all columns names togetehr and remove dupes
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')
df = (df_first.set_index('EmpID')
.combine_first(df_second.set_index('EmpID'))
.reset_index()
.reindex(columns=ColNames))
print (df)
EmpID Name Department Location Salary
0 1 A HR Delhi 1000.0
1 2 B NaN NaN NaN
2 3 C Finance NaN 3000.0
3 4 D NaN NaN 8000.0
4 5 E Programming NaN 6000.0
5 8 B Admin Mumbai NaN
6 9 D Ops Banglore NaN
7 10 K Analytics Mumbai NaN

Categories

Resources