Merging 2 dataframes on Pandas - python

Sorry I have a very simple question. So I have two dataframes that look like
Dataframe 1:
columns: a b c d e f g h
Dataframe 2:
columns: e ef
I'm trying to join Dataframe 2 on Dataframe 1 at column e, which should yield
columns: a b c d e ef g h
or
columns: a b c d e f g h ef
However:
df1.merge(df2, how = 'inner', on = 'e') yields a blank dataframe when I print it out.
'outer' merge only extends the dataframe vertically (like using an append function).
Would appreciate some help thank you!

You need same dtypes of columns for join, so need converting:
#convert string column to int
df1['e'] = df1['e'].astype(int)
#inner is default value, so can be omit
df1.merge(df2, on = 'e')
Sample:
df1 = pd.DataFrame({'a':list('abcdef'),
'b':[4,5,4,5,5,4],
'c':[7,8,9,4,2,3],
'd':[1,3,5,7,1,0],
'e':['5','3','6','9','2','4'],
'f':list('aaabbb'),
'g':[1,3,5,7,1,0]})
print (df1)
a b c d e f g
0 a 4 7 1 5 a 1
1 b 5 8 3 3 a 3
2 c 4 9 5 6 a 5
3 d 5 4 7 9 b 7
4 e 5 2 1 2 b 1
5 f 4 3 0 4 b 0
df2 = pd.DataFrame({'ef':[10,30,50,70,10,100],
'e':[5,3,6,9,0,7]})
print (df2)
e ef
0 5 10
1 3 30
2 6 50
3 9 70
4 0 10
5 7 100
df1['e'] = df1['e'].astype(int)
df = df1.merge(df2, on = 'e')
print (df)
a b c d e f g ef
0 a 4 7 1 5 a 1 10
1 b 5 8 3 3 a 3 30
2 c 4 9 5 6 a 5 50
3 d 5 4 7 9 b 7 70

Instead of
df1.merge(...)
try:
pd.merge(left=df1, right=df2, on ='e', how='inner')

You can do it like this:
def mergeDfs(df1,df2):
newDf = dict()
dfList = []
for i in df1:
l = len(i)
row = []
for j in range(l):
row.append(df1[i][j])
newDf[i] = row
dfList.append(i)
for i in df2:
l = len(i)
row = []
if i not in dfList:
for j in range(l):
row.append(df2[i][j])
newDf[i] = row
df = pd.DataFrame(newDf)
return df

Related

Join two dataframe into one and fill the missing gaps

Here is the first df:
index name
4 a
8 b
10 c
Here is the second df:
index name
4 d
5 d
6 d
7 d
8 e
9 e
10 f
Is there a way to join them as below df?
index name1 name2
4 a d
5 a d
6 a d
7 a d
8 b e
9 b e
10 c f
Basically join two df base on index, then auto fill the gap on name1 base on the first value.
# merge the two DF based on the index, and ffill null values
df2=df2.merge(df, on='index', how='left', suffixes=(['2','1'])).ffill()
# reindex the columns
df2.reindex(sorted(df2.columns), axis=1)
index name1 name2
0 4 a d
1 5 a d
2 6 a d
3 7 a d
4 8 b e
5 9 b e
6 10 c f

Summing every row of dataframe with a series

I am trying to sum every row of a dataframe with a series.
I have a dataframe with [107 rows and 42 columns] and a series of length 42. I would like to sum every row with the series such that every column in the dataframe would have the same number added to it. I tried df.add(series) but the result was a dataframe with 107 rows and 84 columns with all NaN values.
For example
dataframe:
Index a b c
d 1 2 3
e 4 5 6
f 7 8 9
g 0 0 0
series: 1 2 3
result would be
Index a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3
You can use DataFrame.add or + with numpy array if differet index values like columns names:
s = pd.Series([1,2,3])
df = df.add(s.to_numpy())
#alternative
#df = df + s.to_numpy()
print (df)
a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3
s = pd.Series([1,2,3])
s.index = df.columns
df = df.add(s)
#alternative
#df = df + s
print (df)
a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3

Separate pandas df by repeating range in a column

Problem:
I'm trying to split a pandas data frame by the repeating ranges in column A. My data and output are as follows. The ranges in columns A are always increasing and do not skip values. The values in column A do start and stop arbitrarily, however.
Data:
import pandas as pd
dict = {"A": [1,2,3,2,3,4,3,4,5,6],
"B": ["a","b","c","d","e","f","g","h","i","k"]}
df = pd.DataFrame(dict)
df
A B
0 1 a
1 2 b
2 3 c
3 2 d
4 3 e
5 4 f
6 3 g
7 4 h
8 5 i
9 6 k
Desired ouptut:
df1
A B
0 1 a
1 2 b
2 3 c
df2
A B
0 2 d
1 3 e
2 4 f
df3
A B
0 3 g
1 4 h
2 5 i
3 6 k
Thanks for advice!
Answer times:
from timeit import default_timer as timer
start = timer()
for x ,y in df.groupby(df.A.diff().ne(1).cumsum()):
print(y)
end = timer()
aa = end - start
start = timer()
s = (df.A.diff() != 1).cumsum()
g = df.groupby(s)
for _,g_ in g:
print(g_)
end = timer()
bb = end - start
start = timer()
[*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))]
print(*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum())), sep='\n\n')
end = timer()
cc = end - start
print(aa,bb,cc)
0.0176649530000077 0.018132143000002543 0.018715283999995336
Create the groupby key by using diff and cumsum
for x ,y in df.groupby(df.A.diff().ne(1).cumsum()):
print(y)
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
Just groupby by the difference
s = (df.A.diff() != 1).cumsum()
g = df.groupby(s)
for _,g_ in g:
print(g_)
Outputs
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
One-liner
because that's important
[*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))]
Print it
print(*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum())), sep='\n\n')
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
Assign it
df1, df2, df3 = (d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))

Creating python function to create categorical bins in pandas

I'm trying to create a reusable function in python 2.7(pandas) to form categorical bins, i.e. group less-value categories as 'other'. Can someone help me to create a function for the below: col1, col2, etc. are different categorical variable columns.
##Reducing categories by binning categorical variables - column1
a = df.col1.value_counts()
#get top 5 values of index
vals = a[:5].index
df['col1_new'] = df.col1.where(df.col1.isin(vals), 'other')
df = df.drop(['col1'],axis=1)
##Reducing categories by binning categorical variables - column2
a = df.col2.value_counts()
#get top 6 values of index
vals = a[:6].index
df['col2_new'] = df.col2.where(df.col2.isin(vals), 'other')
df = df.drop(['col2'],axis=1)
You can use:
df = pd.DataFrame({'A':list('abcdefabcdefabffeg'),
'D':[1,3,5,7,1,0,1,3,5,7,1,0,1,3,5,7,1,0]})
print (df)
A D
0 a 1
1 b 3
2 c 5
3 d 7
4 e 1
5 f 0
6 a 1
7 b 3
8 c 5
9 d 7
10 e 1
11 f 0
12 a 1
13 b 3
14 f 5
15 f 7
16 e 1
17 g 0
def replace_under_top(df, c, n):
a = df[c].value_counts()
#get top n values of index
vals = a[:n].index
#assign columns back
df[c] = df[c].where(df[c].isin(vals), 'other')
#rename processes column
df = df.rename(columns={c : c + '_new'})
return df
Test:
df1 = replace_under_top(df, 'A', 3)
print (df1)
A_new D
0 other 1
1 b 3
2 other 5
3 other 7
4 e 1
5 f 0
6 other 1
7 b 3
8 other 5
9 other 7
10 e 1
11 f 0
12 other 1
13 b 3
14 f 5
15 f 7
16 e 1
17 other 0
df2 = replace_under_top(df, 'D', 4)
print (df2)
A D_new
0 other 1
1 b 3
2 other 5
3 other 7
4 e 1
5 f other
6 other 1
7 b 3
8 other 5
9 other 7
10 e 1
11 f other
12 other 1
13 b 3
14 f 5
15 f 7
16 e 1
17 other other

Selecting values from a series in pandas

I have a dataset D with Columns from [A - Z] in total 26 columns. I have done some test and got to know which are the useful columns to me in a series S.
D #Dataset with columns from A - Z
S
B 0.78
C 1.04
H 2.38
S has the columns and a value associated with it, So I now know their importance and would like to keep only those Columns in the Dataset eg(B, C, D) How can I do it?
IIUC you can use:
cols = ['B','C','D']
df = df[cols]
Or if column names are in Series as values:
S = pd.Series(['B','C','D'])
df = df[S]
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
S = pd.Series(['B','C','D'])
print (S)
0 B
1 C
2 D
dtype: object
print (df[S])
B C D
0 4 7 1
1 5 8 3
2 6 9 5
Or index values:
S = pd.Series([1,2,3], index=['B','C','D'])
print (S)
B 1
C 2
D 3
dtype: int64
print (df[S.index])
B C D
0 4 7 1
1 5 8 3
2 6 9 5

Categories

Resources