Merging two pandas dataframes many-to-one - python

How do I merge the following datasets:
df = A
date abc
1 a
1 b
1 c
2 d
2 dd
3 ee
3 df
df = B
date ZZZ
1 a
2 b
3 c
I want to get smth like this:
date abc ZZZ
1 a a
1 b a
1 c a
2 d b
2 dd b
3 ee c
3 df c
I tried this code:
aa = pd.merge(A, B, left_on="date", right_on="date", how="left", validate="m:1")
But I have the following mistake:
TypeError: merge() got an unexpected keyword argument 'validate'
I update my pandas using (conda update pandas), but still get the same error
Please, advise me this issue.

According to df.merge docs validate was added in version 0.21.0. You are using an older version so you should update the version of pandas you are using.

As #DeepSpace mentioned, you may need to upgrade your pandas.
To replicate the check in earlier versions, you can do something like this:
import pandas as pd
df1 = pd.DataFrame(index=['a', 'a', 'b', 'b', 'c'])
df2 = pd.DataFrame(index=['a', 'b', 'c'])
x = [i for i in df2.index if i in set(df1.index)]
len(x) == len(set(x)) # True
df1 = pd.DataFrame(index=['a', 'a', 'b', 'b', 'c'])
df2 = pd.DataFrame(index=['a', 'b', 'c', 'a'])
y = [i for i in df2.index if i in set(df1.index)]
len(y) == len(set(y)) # False

Related

How can I concatenate a dataframe and a series?

Code:
df_columns = ['A', 'B', 'C', 'D']
df_series = pd.Series([1,2,3,'N/A'],index = df_columns)
df = pd.DataFrame(df_series)
df
When I run the code above I receive the following output:
A 1
B 2
C 3
D 'N/A'
How can I write the code so that my Output is and df_columns is still the dataframe's index:
A B C D
1 2 3 'N/A'
So this would work, note the double brackets when loading in the data to designate the row.
import pandas as pd
df_columns = ['A', 'B', 'C', 'D']
df = pd.DataFrame([[1,2,3,'N/A']],columns= df_columns)
print(df)

How to replace df.loc with df.reindex without KeyError

I have a huge dataframe which I get from a .csv file. After defining the columns I only want to use the one I need. I used Python 3.8.1 version and it worked great, although raising the "FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative."
If I try to do the same in Python 3.10.x I get a KeyError now: "[’empty’] not in index"
In order to get slice/get rid of columns I don't need I use the .loc function like this:
df = df.loc[:, ['laenge','Timestamp', 'Nick']]
How can I get the same result with .reindex function (or any other) without getting the KeyError?
Thanks
If need only columns which exist in DataFrame use numpy.intersect1d:
df = df[np.intersect1d(['laenge','Timestamp', 'Nick'], df.columns)]
Same output is if use DataFrame.reindex with remove only missing values columns:
df = df.reindex(['laenge','Timestamp', 'Nick'], axis=1).dropna(how='all', axis=1)
Sample:
df = pd.DataFrame({'laenge': [0,5], 'col': [1,7], 'Nick': [2,8]})
print (df)
laenge col Nick
0 0 1 2
1 5 7 8
df = df[np.intersect1d(['laenge','Timestamp', 'Nick'], df.columns)]
print (df)
Nick laenge
0 2 0
1 8 5
Use reindex:
df = pd.DataFrame({'A': [0], 'B': [1], 'C': [2]})
# A B C
# 0 0 1 2
df.reindex(['A', 'C', 'D'], axis=1)
output:
A C D
0 0 2 NaN
If you need to get only the common columns, you can use Index.intersection:
cols = ['A', 'C', 'E']
df[df.columns.intersection(cols)]
output:
A C
0 0 2

Join together multiple columns into new one via mapping them through a dictionary

I am trying to figure out how I can create a new column joining the values of the other columns' corresponding keys... I explain hereafter:
Assume this dictionary and the following dataframe:
my_dict={np.nan:0, 'A':10, 'B':22, 'C':23, 'D':50, 'E':7}
my_df=pd.DataFrame({'col_1':['D', 'A', 'C', 'E'], 'col_2':['B', 'A', np.nan, 'C'], 'col_3':['D', 'A', 'E', 'C']})
Desired output is:
col_1 col_2 col_3 new_col
0 D B D 50-22-50
1 A A A 10-10-10
2 C NaN E 23-0-10
3 E C C 7-23-23
Any nice GENERIC ideas, please? I know I can map every column to a new and then join, but I prefer something more general for really many columns...
Many thanks!
I believe replace and agg/apply:
my_df['new_col'] = my_df.replace(my_dict).astype(str).agg('-'.join, axis=1)
Output:
col_1 col_2 col_3 new_col
0 D B D 50-22-50
1 A A A 10-10-10
2 C NaN E 23-0-7
3 E C C 7-23-23

Merge pandas dataframe with overwrite of columns

What is the quickest way to merge to python data frames in this manner?
I have two data frames with similar structures (both have a primary key id and some value columns).
What I want to do is merge the two data frames based on id. Are there any ways do this based on pandas operations? How I've implemented it right now is as coded below:
import pandas as pd
a = pd.DataFrame({'id': [1,2,3], 'letter': ['a', 'b', 'c']})
b = pd.DataFrame({'id': [1,3,4], 'letter': ['A', 'C', 'D']})
a_dict = {e[id]: e for e in a.to_dict('record')}
b_dict = {e[id]: e for e in b.to_dict('record')}
c_dict = a_dict.copy()
c_dict.update(b_dict)
c = pd.DataFrame(list(c.values())
Here, c would be equivalent to
pd.DataFrame({'id': [1,2,3,4], 'letter':['A','b', 'C', 'D']})
id letter
0 1 A
1 2 b
2 3 C
3 4 D
combine_first
If 'id' is your primary key, then use it as your index.
b.set_index('id').combine_first(a.set_index('id')).reset_index()
id letter
0 1 A
1 2 b
2 3 C
3 4 D
merge with groupby
a.merge(b, 'outer', 'id').groupby(lambda x: x.split('_')[0], axis=1).last()
id letter
0 1 A
1 2 b
2 3 C
3 4 D
One way may be as following:
append dataframe a to dataframe b
drop duplicates based on id
sort values on remaining by id
reset index and drop older index
You can try:
import pandas as pd
a = pd.DataFrame({'id': [1,2,3], 'letter': ['a', 'b', 'c']})
b = pd.DataFrame({'id': [1,3,4], 'letter': ['A', 'C', 'D']})
c = b.append(a).drop_duplicates(subset='id').sort_values('id').reset_index(drop=True)
print(c)
Try this
c = pd.concat([a, b], axis=0).sort_values('letter').drop_duplicates('id', keep='first').sort_values('id')
c.reset_index(drop=True, inplace=True)
print(c)
id letter
0 1 A
1 2 b
2 3 C
3 4 D

How to change fragmnet of a text in pandas data frame

I have a problem with replace text in df. I tried to use df.replace() function but in my case it failed. So here is my example:
df = pd.DataFrame({'col_a':['A', 'B', 'C'], 'col_b':['_world1_', '-world1_', '*world1_']})
df = df.replace(to_replace='world1', value='world2')
Unfortunately this code doesn't change anything, I still have world1 in my df
Someone have any suggestions ?
Use vectorised str.replace to replace string matches in your text:
In [245]:
df = pd.DataFrame({'col_a':['A', 'B', 'C'], 'col_b':['_world1_', '-world1_', '*world1_']})
df['col_b'] = df['col_b'].str.replace('world1', 'world2')
df
Out[245]:
col_a col_b
0 A _world2_
1 B -world2_
2 C *world2_
The value you want to replace does not exist.
That one works:
import pandas as pd
df = pd.DataFrame({'col_a':['A', 'B', 'C'], 'col_b':['_world1_', '-world1_', '*world1_']})
print df
df = df.replace(to_replace='*world1_', value='world2')
print df
Here you go:
df.col_b = df.apply(lambda x: x.col_b.replace('world1','world2'), axis = 1)
In [13]: df
Out[13]:
col_a col_b
0 A _world2_
1 B -world2_
2 C *world2_
There could be many more options, however with the function replace that you are referring to, it can be used with regex as well
In [21]: df.replace('(world1)','world2',regex=True)
Out[21]:
col_a col_b
0 A _world2_
1 B -world2_
2 C *world2_

Categories

Resources