Join two dataframe into one and fill the missing gaps - python

Here is the first df:
index name
4 a
8 b
10 c
Here is the second df:
index name
4 d
5 d
6 d
7 d
8 e
9 e
10 f
Is there a way to join them as below df?
index name1 name2
4 a d
5 a d
6 a d
7 a d
8 b e
9 b e
10 c f
Basically join two df base on index, then auto fill the gap on name1 base on the first value.

# merge the two DF based on the index, and ffill null values
df2=df2.merge(df, on='index', how='left', suffixes=(['2','1'])).ffill()
# reindex the columns
df2.reindex(sorted(df2.columns), axis=1)
index name1 name2
0 4 a d
1 5 a d
2 6 a d
3 7 a d
4 8 b e
5 9 b e
6 10 c f

Related

I want to groupby and drop groups if the shape is 3 and non of the values from a column contains zero

I want to groupby and drop groups if it satisfies two conditions (the shape is 3 and column A doesn't contain zeros).
My df
ID value
A 3
A 2
A 0
B 1
B 1
C 3
C 3
C 4
D 0
D 5
D 5
E 6
E 7
E 7
F 3
F 2
my desired df would be
ID value
A 3
A 2
A 0
B 1
B 1
D 0
D 5
D 5
F 3
F 2
You can use boolean indexing with groupby operations:
g = df['value'].eq(0).groupby(df['ID'])
# group contains a 0
m1 = g.transform('any')
# group doesn't have size 3
m2 = g.transform('size').ne(3)
# keep if any of the condition above is met
# this is equivalent to dropping if contains 0 AND size 3
out = df[m1|m2]
Output:
ID value
0 A 3
1 A 2
2 A 0
3 B 1
4 B 1
8 D 0
9 D 5
10 D 5
14 F 3
15 F 2

Summing every row of dataframe with a series

I am trying to sum every row of a dataframe with a series.
I have a dataframe with [107 rows and 42 columns] and a series of length 42. I would like to sum every row with the series such that every column in the dataframe would have the same number added to it. I tried df.add(series) but the result was a dataframe with 107 rows and 84 columns with all NaN values.
For example
dataframe:
Index a b c
d 1 2 3
e 4 5 6
f 7 8 9
g 0 0 0
series: 1 2 3
result would be
Index a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3
You can use DataFrame.add or + with numpy array if differet index values like columns names:
s = pd.Series([1,2,3])
df = df.add(s.to_numpy())
#alternative
#df = df + s.to_numpy()
print (df)
a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3
s = pd.Series([1,2,3])
s.index = df.columns
df = df.add(s)
#alternative
#df = df + s
print (df)
a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3

Drop group if another column has duplicate values - pandas dataframe

I have the following df
id value many other variables
A 5
A 5
A 8
A 9
B 3
B 4
B 5
B 9
C 10
C 11
C 19
D 6
D 6
D 10
E 0
E 0
E 0
...
I want to drop the whole id group if there are duplicate values in the value column (except zeros) So the output should be
id value many other variables
B 3
B 4
B 5
B 9
C 10
C 11
C 19
E 0
E 0
E 0
...
You can use duplicated to flag the duplicates, then transform groupby.any to flag the groups with duplicates. Then to get the rows with 0s, chain this boolean mask with a boolean mask that flags 0s:
out =df[~df.duplicated(['id','value']).groupby(df['id']).transform('any') | df['value'].eq(0)]
Output:
id value many_other_variables
4 B 3
5 B 4
6 B 5
7 B 9
8 C 10
9 C 11
10 C 19
14 E 0
15 E 0
16 E 0
Note: groupby.any is an aggregation, transform transforms that aggregate to match the length of the original DataFrame. The goal is to create a boolean mask to filter df with; and boolean masks must have the same length as the original df, so we transform the aggregate here.

find inverse/mirror pair and assign a pair number

I am trying to find the inverse pair and assign a pair number to the pair but am stuck for moving forward from the below.
df1:
col1 col2 no. of records
A B 2
B A 5
C D 4
D C 6
E F 4
G H 6
I am trying get this result.
col1 col2 pair 1 no. of records totalcount
A B 1 2 7
B A 1 5 7
C D 2 4 10
D C 2 6 10
E F 3 4 4
G H 4 6 6
I tried this method but it has only returned true/false.
to make a duplicate dataframe df2 and use isin function but was stucked for a long time while group them together.
df1['row_matched'] = np.where((df1.col1+df1.col2).isin(df2.col2+ df2.col1), df2['row'], "")
will appreciate any help available!
Use rank of group pair of col1, col2, which you could setup with set
In [37]: df['pair'] = (df.apply(lambda x: '-'.join(set(x[['col1', 'col2']])), 1)
.rank(method='dense').astype(int))
In [38]: df['totalcount'] = df.groupby('pair')['no.ofrecords'].transform('sum')
In [39]: df
Out[39]:
col1 col2 no.ofrecords pair totalcount
0 A B 2 1 7
1 B A 5 1 7
2 C D 4 2 10
3 D C 6 2 10
4 E F 4 3 4
5 G H 6 4 6

Merging 2 dataframes on Pandas

Sorry I have a very simple question. So I have two dataframes that look like
Dataframe 1:
columns: a b c d e f g h
Dataframe 2:
columns: e ef
I'm trying to join Dataframe 2 on Dataframe 1 at column e, which should yield
columns: a b c d e ef g h
or
columns: a b c d e f g h ef
However:
df1.merge(df2, how = 'inner', on = 'e') yields a blank dataframe when I print it out.
'outer' merge only extends the dataframe vertically (like using an append function).
Would appreciate some help thank you!
You need same dtypes of columns for join, so need converting:
#convert string column to int
df1['e'] = df1['e'].astype(int)
#inner is default value, so can be omit
df1.merge(df2, on = 'e')
Sample:
df1 = pd.DataFrame({'a':list('abcdef'),
'b':[4,5,4,5,5,4],
'c':[7,8,9,4,2,3],
'd':[1,3,5,7,1,0],
'e':['5','3','6','9','2','4'],
'f':list('aaabbb'),
'g':[1,3,5,7,1,0]})
print (df1)
a b c d e f g
0 a 4 7 1 5 a 1
1 b 5 8 3 3 a 3
2 c 4 9 5 6 a 5
3 d 5 4 7 9 b 7
4 e 5 2 1 2 b 1
5 f 4 3 0 4 b 0
df2 = pd.DataFrame({'ef':[10,30,50,70,10,100],
'e':[5,3,6,9,0,7]})
print (df2)
e ef
0 5 10
1 3 30
2 6 50
3 9 70
4 0 10
5 7 100
df1['e'] = df1['e'].astype(int)
df = df1.merge(df2, on = 'e')
print (df)
a b c d e f g ef
0 a 4 7 1 5 a 1 10
1 b 5 8 3 3 a 3 30
2 c 4 9 5 6 a 5 50
3 d 5 4 7 9 b 7 70
Instead of
df1.merge(...)
try:
pd.merge(left=df1, right=df2, on ='e', how='inner')
You can do it like this:
def mergeDfs(df1,df2):
newDf = dict()
dfList = []
for i in df1:
l = len(i)
row = []
for j in range(l):
row.append(df1[i][j])
newDf[i] = row
dfList.append(i)
for i in df2:
l = len(i)
row = []
if i not in dfList:
for j in range(l):
row.append(df2[i][j])
newDf[i] = row
df = pd.DataFrame(newDf)
return df

Categories

Resources