In python, I have a df that looks like this
Name ID
Anna 1
Sarah 2
Max 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 3
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Sarah 2
Max 3
Dan 4
Hallie 5
Cam 6
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time.
Use pd.concat:
out = pd.concat([df1, df2.assign(ID=df2['ID'] + df1['ID'].max())], ignore_index=True)
print(out)
# Output
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
Concatenate the two DataFrames, reset_index and use the new index to assign "ID"s
df_new = pd.concat((df1, df2)).reset_index(drop=True)
df_new['ID'] = df_new.index + 1
Output:
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
You can concat dataframes with ignore_index=True and then set ID column:
df = pd.concat([df1, df2], ignore_index=True)
df['ID'] = df.index + 1
Related
I have multiple dataframes with data for each quarter of the year. My goal is to concatenate all of them so I can sum values and have a vision for my entire year.
I managed to concatenate the four dataframes (that have the same column names and same rows names) into one. But I keep getting NaN at two columns, even though I have the data. It goes like this
df1:
my_data 1st_quarter
0 occurrence_1 2
1 occurrence_3 3
2 occurrence_2 0
df2:
my_data 2nd_quarter
0 occurrence_1 5
1 occurrence_3 10
2 occurrence_2 3
df3:
my_data 3th_quarter
0 occurrence_1 10
1 occurrence_3 2
2 occurrence_2 1
So I run this:
df_results = pd.concat(
(df_results.set_index('my_data') for df_results in [df1, df2, df3]),
axis=1, join='outer'
).reset_index()
What Is happening is this output:
type 1st_quarter 2nd_quarter 3th_quarter
0 occurrence_1 2 NaN 10
1 occurrence_3 3 10 2
2 occurrence_2 0 3 1
3 occurrence_1 NaN 5 NaN
If I use join='inner', the first row disappear. Note that the rows have the exact same name in all dataframes.
How can I solve the NaN problem? Or after doing pd.concat reorganize my DF to "fill" the NaN with the correct numbers?
Update: My original dataset (which I unfortunately can post publicly) has a inconsistency in the first row name. Any suggestions about how I can get around it? Can I rename a row? Or combine two rows after a concatenate the dataframes?
I managed to get around this problem using combine_first with loc:
df_results.loc[0] = df_results.loc[0].combine_first(df_results.loc[3])
So I got this:
type 1st_quarter 2nd_quarter 3th_quarter
0 occurrence_1 2 5 10
1 occurrence_3 3 10 2
2 occurrence_2 0 3 1
3 occurrence_1 NaN 5 NaN
Then I dropped the last line:
df_results = df_results.drop([3])
I have two csv files as :
Name ID
0 Jack 1|2|3
1 Mac 4|5
2 Turtle 6|8
3 Rosh 9||10
Id Address
0 1 Adr1
1 2 Adr2
2 3 Adr3
3 4 Adr4
4 5 Adr5
5 6 Adr6
6 7 Adr7
7 8 Adr8
8 9 Adr9
9 10 Adr10
How do I join both of them based on ID value using dataframe and get the output as below:
Name ID
0 Jack Adr1|Adr2|Adr3
1 Mac Adr4|Adr5
2 Turtle Adr6|Adr8
3 Rosh Adr9||Adr10
Solution I am trying is to read both files separately using pandas.read_csv and then for the first dataframe iterate the rows:
for i,j in df_first_file.iterrows():
x = j['ID'].split('|')
for val in x :
print(val)
But after that I am struggling to join it with the other dataframe as its now a string after iterating it through the rows
Use:
#create DataFrames
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
#created dicts
df2['Id'] = df2['Id'].astype(str)
d = df2.set_index('Id')['Address'].to_dict()
#mapping with default empty string splitted values and last join by |
df1['ID'] = df1['ID'].apply(lambda x: '|'.join(d.get(y, '') for y in x.split('|')))
You could split and explode the ID column to be able to merge both dataframes:
df = df1.assign(ID = df1['ID'].str.split('|')).explode('ID').merge(
df2.astype(str), left_on='ID', right_on='Id', how='left').fillna('')
df = df[['Name', 'Address']].groupby('Name').agg(list).reset_index()
df['Address'] = df['Address'].transform('|'.join)
It gives as expected:
Name Address
0 Jack Adr1|Adr2|Adr3
1 Mac Adr4|Adr5
2 Rosh Adr9||Adr10
3 Turtle Adr6|Adr8
This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 3 years ago.
For the life of me I can not figure out how to implement the following solution:
Suppose I have a dataframe called df1
ID Name Gender
0 Bill M
1 Adam M
2 Kat F
1 Adam M
Then I have another dataframe called df2
ID Name Age
5as Sam 34
1as Adam 64
2as Kat 50
All I want to do is check if ID from df1 is in ID in df2, if so grab the corresponding Age column and attache it to df1.
Ideal Solution:
ID Name Gender Age
0 Bill M
1 Adam M 64
2 Kat F 50
1 Adam M 64
I have implement the following solution which at first I thought it works but realized it was missing matching a lot of values at the end of df. Not sure if it is because of what I wrote or the size of my CSV which is large.
y_list = df2.ID.dropna().unique()
for x in df1.ID.unique():
if x in y_list:
df1.loc[df1.ID == x, 'Age'] = df2.Age
Any help is appreciated!
Here's what you can do
df3 = df1.join(df2.set_index('ID'), on='ID', lsuffix='_left')
if you want to join on the 'ID' column.
If however you are looking to join on 'Name', you can change on='Name'.
An alternative option is to use merge,
df1.merge(df2, on='Name', how='left')
Output
ID Name_x Gender Name_y Age
0 0 Bill M NaN NaN
1 1 Adam M Adam 64.0
2 2 Kat F Kat 50.0
3 1 Adam M Adam 64.0
Here's the output when using caller.set_index('ID').join(other.set_index('ID'), lsuffix='_left')
Name_left Gender Name Age
ID
0 Bill M NaN NaN
1 Adam M Adam 64.0
1 Adam M Adam 64.0
2 Kat F Kat 50.0
You can do as below
name_age_dict = dict(zip(df2['Name'], df2['Age']))
df1['Age'] = df1['Name'].map(name_age_dict).fillna('')
Another method
df1['Age'] = df1['Name'].map(df2.set_index('Name')['Age']).fillna('')
Output
ID Name Gender Age
0 0 Bill M
1 1 Adam M 64
2 2 Kat F 50
3 1 Adam M 64
I have a dataset which I grouped by 2 different parameters and got something like this:
idx name time
a andy 2
a andy 5
a andy 4
b andy 3
b andy 7
b andy 9
and so on.
What I need is to generate features so the dataset will look like this:
idx name time1 time2 time3
a andy 2 4 5
Times should be sorted and their order should be used to generate features.
I am struggling to come up with any idea how to implement it.
You need to sort, then generate a column index with groupby + cumcount. Now it's a pivot_table problem, and we can clean up the MultiIndex in the end.
df = df.sort_values(['idx', 'time'])
df['idx2'] = df.groupby('idx').cumcount()+1
df1 = df.pivot_table(index=['idx', 'name'], columns='idx2').rename_axis([None, None], axis=1)
# Move everything to simple columns
df1.columns = [''.join(map(str, x)) for x in df1.columns]
df1 = df1.reset_index()
Output: df1:
idx name time1 time2 time3
0 a andy 2 4 5
1 b andy 3 7 9
i have a table in pandas dataframe df
id key_no
1 1
2 1
3 2
4 2
5 2
6 3
7 3
in this specific key_no 's are associated with multiple id's
i want to create a new dataframe which has columns
keyno start_id end_id
1 1 2
2 3 5
3 6 7
i.e create columns 'start_id', and 'end_id' for each keyno, in dataframe df2
Can we try using df.groupby , but how to create new df2 using that, i'm new to python,
any leads?
Use groupby + agg by first and last. Last rename columns by dict:
d = {'first':'start_id','last':'end_id'}
df = df.groupby('key_no')['id'].agg(['first','last']).rename(columns=d)
print (df)
start_id end_id
key_no
1 1 2
2 3 5
3 6 7