Merging Pandas Dataframes averiging values where both have values - python

I have two Dataframes.
print(df1)
key value
0 A 2
1 B 3
2 C 2
3 D 3
print(df2)
key value
0 B 3
1 D 1
2 E 1
3 F 3
What I want is for it to do a outer merge on key and pick whichever value is not NaN.
Which one it choses if both are int (or float) is not that important. The mean would be a nice touch though.
print(df3)
key value
0 A 2
1 B 3
3 C 2
4 D 2
5 E 1
6 F 3
I tried:
df3 = df1.merge(df2, on='key', how='outer')
but it generates 2 new columns. I could just do my calculations after, but am sure there is an easier solution, that I just could not find.
Thanks for your help.

This works for me, the duplicates are dropped in order of the dataframe entry, so the dupes from df1 are dropped and df2 are kept, if any keys don't match the duplicate key or both happen to be na we can drop them .dropna()
dfs = pd.concat([df1,df2]).drop_duplicates(subset=['key'],keep='last').dropna(how='any')
key value
0 A 2
2 C 2
3 D 3
0 B 3
1 D 1
2 E 1
3 F 3

Related

try loc a column with list pandas - index not found

I have a fixed array e.g. sort_by = [a,b,c,d,e,f]. My dataframe looks like this, I have made Column1 my index:
Column1 | Column2 | ...
d 1
d 2
b 3
a 4
a 5
b 6
c 7
I want to loc from the sort_by list to sort them, however, sometimes not all values of sort_by are in Column which results in index not found. How do I get it to "try" to the best of its ability?
s.set_index('mitre_attack_tactic', inplace=True)
print(s.loc[sort_by]) --> doesn't work
print(s.loc[[a,b,c,d]) --> does work however Column1 could have e,f,g
Let us try pd.Categorical
out = df.iloc[pd.Categorical(df.Column1,['a','b','c','d']).argsort()]
Out[48]:
Column1 Column2
3 a 4
4 a 5
2 b 3
5 b 6
6 c 7
0 d 1
1 d 2
You can use the key of df.sort_values. Idea is to create a value index dictionary from sort_by list then map the dictionary to column and sort by the resulted index.
key = {v:k for k, v in enumerate(sort_by)}
df = df.sort_values('Column1', key=lambda col: col.map(key))
print(df)
Column1 Column2
3 a 4
4 a 5
2 b 3
5 b 6
6 c 7
0 d 1
1 d 2
This page helps:
If you create your sort_by as a categorical:
sort_by = pd.api.types.CategoricalDtype(["a","b","c","d","e","f"], ordered=True)
Then change your column to a categorical:
s['Column1'] = s['Column1'].astype(sort_by)
You can then sort it:
s.sort_values('Column1')
index.intersection
df.loc[pd.Index(sort_by).intersection(df.index)]
Column2
a 4
a 5
b 3
b 6
c 7
d 1
d 2

Pandas Inner Join with axis is one

I was working with inner join using concat in pandas.
Using two DataFrames as below:-
df1 = pd.DataFrame([['a',1],['b',2]], columns=['letter','number'])
df3 = pd.DataFrame([['c',3,'cat'],['d',4,'dog']],
columns=['letter','number','animal'])
pd.concat([df1,df3], join='inner')
The out is below
letter number
0 a 1
1 b 2
0 c 3
1 d 4
But after using axis=1 the output is as below
pd.concat([df1,df3], join='inner', axis=1)
letter number letter number animal
0 a 1 c 3 cat
1 b 2 d 4 dog
Why it is showing animal column while doing inner join when axis=1?
In Pandas.concat()
axis argument defines whether to concat the dataframes based on index or columns.
axis=0 // based on index (default value)
axis=1 // based on columns
when you Concatenated df1 and df3, it uses index to combine dataframes and thus output is
letter number
0 a 1
1 b 2
0 c 3
1 d 4
But when you used axis=1, pandas combined the data based on columns.
thats why the output is
letter number letter number animal
0 a 1 c 3 cat
1 b 2 d 4 dog
EDIT:
you asked But inner join only join same column right? Then why it is showing 'animal' column?
So, Because right now you have 2 rows in both the dataframes and join only works in indexes.
For explaining to you, I have added another row in df3
Let's suppose df3 is
0 1 2
0 c 3 cat
1 d 4 dog
2 e 5 bird
Now, If you concat the df1 and df3
pd.concat([df1,df3], join='inner', axis=1)
letter number 0 1 2
0 a 1 c 3 cat
1 b 2 d 4 dog
pd.concat([df1,df3], join='outer', axis=1)
letter number 0 1 2
0 a 1.0 c 3 cat
1 b 2.0 d 4 dog
2 NaN NaN e 5 bird
As you can see, in inner join only 0 and 1 indexes are in output
but in outer join, all the indexes are in output with NAN values.
The default value of the axis is 0. So in the first concat call axis=0 and over there concatenation happens in rows. When you set axis=1 the operation is similar to
df1.merge( df3, how="inner", left_index=True, right_index=True)

Filter rows with more than 1 value in a set and count their occurrence pandas python

Let's assume, I have the following data frame.
Id Combinations
1 (A,B)
2 (C,)
3 (A,D)
4 (D,E,F)
5 (F)
I would like to filter out Combination column values with more than value in a set. Something like below. AND I would like count the number of occurrence as whole in Combination column. For example, ID number 2 and 5 should be removed since their value in a set is only 1.
The result I am looking for is:
ID Combination Frequency
1 A 2
1 B 1
3 A 2
3 D 2
4 D 2
4 E 1
4 F 2
Can anyone help to get the above result in Python pandas?
First if necessary convert values to lists:
df['Combinations'] = df['Combinations'].str.strip('(,)').str.split(',')
If need count after filtering only one values by Series.str.len in boolean indexing, then use DataFrame.explode and count values by Series.map with Series.value_counts:
df1 = df[df['Combinations'].str.len().gt(1)].explode('Combinations')
df1['Frequency'] = df1['Combinations'].map(df1['Combinations'].value_counts())
print (df1)
Id Combinations Frequency
0 1 A 2
0 1 B 1
2 3 A 2
2 3 D 2
3 4 D 2
3 4 E 1
3 4 F 1
Or if need count before removing them filter them by Series.duplicated in last step:
df2 = df.explode('Combinations')
df2['Frequency'] = df2['Combinations'].map(df2['Combinations'].value_counts())
df2 = df2[df2['Id'].duplicated(keep=False)]
Alternative:
df2 = df2[df2.groupby('Id').Id.transform('size') > 1]
Or:
df2 = df2[df2['Id'].map(df2['Id'].value_counts() > 1]
print (df2)
Id Combinations Frequency
0 1 A 2
0 1 B 1
2 3 A 2
2 3 D 2
3 4 D 2
3 4 E 1
3 4 F 2

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

merge two dataframes without repeats pandas

I am trying to merge two dataframes, one with columns: customerId, full name, and emails and the other dataframe with columns: customerId, amount, and date. I want to have the first dataframe be the main dataframe and the other dataframe information be included but only if the customerIds match up; I tried doing:
merge = pd.merge(df, df2, on='customerId', how='left')
but the dataframe that is produced contains a lot of repeats and looks wrong:
customerId full name emails amount date
0 002963338 Star shine star.shine#cdw.com $2,910.94 2016-06-14
1 002963338 Star shine star.shine#cdw.com $9,067.70 2016-05-27
2 002963338 Star shine star.shine#cdw.com $6,507.24 2016-04-12
3 002963338 Star shine star.shine#cdw.com $1,457.99 2016-02-24
4 986423367 palm tree tree.palm#snapchat.com,tree#.com $4,604.83 2016-07-16
this cant be right, please help!
There is problem you have duplicates in customerId column.
So solution is remove them, e.g. by drop_duplicates:
df2 = df2.drop_duplicates('customerId')
Sample:
df = pd.DataFrame({'customerId':[1,2,1,1,2], 'full name':list('abcde')})
print (df)
customerId full name
0 1 a
1 2 b
2 1 c
3 1 d
4 2 e
df2 = pd.DataFrame({'customerId':[1,2,1,2,1,1], 'full name':list('ABCDEF')})
print (df2)
customerId full name
0 1 A
1 2 B
2 1 C
3 2 D
4 1 E
5 1 F
merge = pd.merge(df, df2, on='customerId', how='left')
print (merge)
customerId full name_x full name_y
0 1 a A
1 1 a C
2 1 a E
3 1 a F
4 2 b B
5 2 b D
6 1 c A
7 1 c C
8 1 c E
9 1 c F
10 1 d A
11 1 d C
12 1 d E
13 1 d F
14 2 e B
15 2 e D
df2 = df2.drop_duplicates('customerId')
merge = pd.merge(df, df2, on='customerId', how='left')
print (merge)
customerId full name_x full name_y
0 1 a A
1 2 b B
2 1 c A
3 1 d A
4 2 e B
I do not see repeats as a whole row but there are repetetions in customerId. You could remove them using:
df.drop_duplicates('customerId', inplace = 1)
where df could be the dataframe corresponding to amount or one obtained post merge. In case you want fewer rows (say n), you could use:
df.groupby('customerId).head(n)

Categories

Resources