Merging two dataframes with pandas - python

This is a subset of data frame
F1:
id code s-code
l.1 1 11
l.2 2 12
l.3 3 13
f.1 4 NA
f.2 3 1
h.1 2 1
h.3 1 1
I need to compare the F1.id with F2.id and then add the differences in column "id" to the F2 data frame and fill in columns' values for the added "id" with 0.
this is the second data frame
F2:
id head sweat pain
l.1 1 0 1
l.3 1 0 0
f.2 3 1 1
h.3 1 1 0
The output should be like this:
F3:
id head sweat pain
l.1 1 0 1
l.3 3 13 0
f.2 3 1 1
h.1 2 1 1
h.3 1 1 0
l.2 0 0 0
h.1 0 0 0
f.1 0 0 0
I tried different solution, such as
F1[(F1.index.isin(F2.index)) & (F1.isin(F2))] to return the differences, but non of them worked.

By using reindex
df2.set_index('id').reindex(df1.id).fillna(0).reset_index()
Out[371]:
id head sweat pain
0 l.1 1.0 0.0 1.0
1 l.2 0.0 0.0 0.0
2 l.3 1.0 0.0 0.0
3 f.1 0.0 0.0 0.0
4 f.2 3.0 1.0 1.0
5 h.1 0.0 0.0 0.0
6 h.3 1.0 1.0 0.0

Use an outer merge + fillna:
df[['id']].merge(df2, how='outer')\
.fillna(0).astype(df2.dtypes)
id head sweat pain
0 l.1 1 0 1
1 l.2 0 0 0
2 l.3 1 0 0
3 f.1 0 0 0
4 f.2 3 1 1
5 h.1 0 0 0
6 h.3 1 1 0

Outside the Box
i = np.setdiff1d(F1.id, F2.id)
F2.append(pd.DataFrame(0, range(len(i)), F2.columns).assign(id=i))
id head sweat pain
0 l.1 1 0 1
1 l.3 1 0 0
2 f.2 3 1 1
3 h.3 1 1 0
0 f.1 0 0 0
1 h.1 0 0 0
2 l.2 0 0 0
With a normal index
i = np.setdiff1d(F1.id, F2.id)
F2.append(
pd.DataFrame(0, range(len(i)), F2.columns).assign(id=i),
ignore_index=True
)
id head sweat pain
0 l.1 1 0 1
1 l.3 1 0 0
2 f.2 3 1 1
3 h.3 1 1 0
4 f.1 0 0 0
5 h.1 0 0 0
6 l.2 0 0 0

Related

Sum values of columns that start with the same text string

I want to take the sum of values (row-wise) of columns that start with the same text string. Underneath is my original df with fails on courses.
Original df:
ID P_English_2 P_English_3 P_German_1 P_Math_1 P_Math_3 P_Physics_2 P_Physics_4
56 1 3 1 2 0 0 3
11 0 0 0 1 4 1 0
6 0 0 0 0 0 1 0
43 1 2 1 0 0 1 1
14 0 1 0 0 1 0 0
Desired df:
ID P_English P_German P_Math P_Physics
56 4 1 2 3
11 0 0 5 1
6 0 0 0 1
43 3 1 0 2
14 1 0 1 0
Tried code:
import pandas as pd

df = pd.DataFrame({"ID": [56,11,6,43,14],
"P_Math_1": [2,1,0,0,0],
"P_English_3": [3,0,0,2,1],

 "P_English_2": [1,0,0,1,0],
"P_Math_3": [0,4,0,0,1],
"P_Physics_2": [0,1,1,1,0],

 "P_Physics_4": [3,0,0,1,0],
"P_German_1": [1,0,0,1,0]})
print(df)

categories = ['P_Math', 'P_English', 'P_Physics', 'P_German']
def correct_categories(cols):

 return [cat for col in cols for cat in categories if col.startswith(cat)]
result = df.groupby(correct_categories(df.columns),axis=1).sum()

print(result)
Let's try groupby with axis=1:
# extract the subjects
subjects = [x[0] for x in df.columns.str.rsplit('_',n=1)]
df.groupby(subjects, axis=1).sum()
Output:
ID P_English P_German P_Math P_Physics
0 56 4 1 2 3
1 11 0 0 5 1
2 6 0 0 0 1
3 43 3 1 0 2
4 14 1 0 1 0
Or you can use wide_to_long, assuming ID are unique valued:
(pd.wide_to_long(df, stubnames=categories,
i=['ID'], j='count', sep='_')
.groupby('ID').sum()
)
Output:
P_Math P_English P_Physics P_German
ID
56 2.0 4.0 3.0 1.0
11 5.0 0.0 1.0 0.0
6 0.0 0.0 1.0 0.0
43 0.0 3.0 2.0 1.0
14 1.0 1.0 0.0 0.0

Fill column with nan if sum of multiple columns is 0

Task
I have a df where I do some ratios that are groupby date and id. I want to fill column c with NaN if the sum of a and b is 0. Any help would be awesome!!
df
date id a b c
0 2001-09-06 1 3 1 1
1 2001-09-07 1 3 1 1
2 2001-09-08 1 4 0 1
3 2001-09-09 2 6 0 1
4 2001-09-10 2 0 0 2
5 2001-09-11 1 0 0 2
6 2001-09-12 2 1 1 2
7 2001-09-13 2 0 0 2
8 2001-09-14 1 0 0 2
Try this:
df['new_c'] = df.c.where(df[['a','b']].sum(1).ne(0))
Out[75]:
date id a b c new_c
0 2001-09-06 1 3 1 1 1.0
1 2001-09-07 1 3 1 1 1.0
2 2001-09-08 1 4 0 1 1.0
3 2001-09-09 2 6 0 1 1.0
4 2001-09-10 2 0 0 2 NaN
5 2001-09-11 1 0 0 2 NaN
6 2001-09-12 2 1 1 2 2.0
7 2001-09-13 2 0 0 2 NaN
8 2001-09-14 1 0 0 2 NaN
It is better to build a new dataframe with same shape , and then do the following :
i = 0
for line in df :
new_df[i]['date'] = line['date']
new_df[i]['a'] = line['a']
new_df[i]['b'] = line['b']
if line['a'] + line['b'] == 0 :
new_df[i]['c'] = Nan
i += 1

How to label same pandas dataframe rows?

I have a large pandas dataframe like this:
log apple watermelon orange lemon grapes
1 1 1 yes 0 0
1 2 0 1 0 0
1 True 0 0 0 2
2 0 0 0 0 2
2 1 1 yes 0 0
2 0 0 0 0 2
2 0 0 0 0 2
3 True 0 0 0 2
4 0 0 0 0 2.1
4 0 0 0 0 2.1
How can I label the rows that are the same, for example:
log apple watermelon orange lemon grapes ID
1 1 1 yes 0 0 1
1 2 0 1 0 0 2
1 True 0 0 0 2 3
2 0 0 0 0 2 4
2 1 1 yes 0 0 1
2 0 0 0 0 2 4
2 0 0 0 0 2 4
3 True 0 0 0 2 3
4 0 0 0 0 2.1 5
4 0 0 0 0 2.1 5
I tried to:
df['ID']=df.groupby('log')[df.columns].transform('ID')
And
df['personid'] = df['log'].clip_upper(2) - 2*d.duplicated(subset='apple')
df
However, the above doesnt work because I literally have a lot of columns.
But its not giving me the expected output. Any idea of how to group and label this dataframe?
Given
x = io.StringIO("""log apple watermelon orange lemon grapes
1 1 1 yes 0 0
1 2 0 1 0 0
1 True 0 0 0 2
2 0 0 0 0 2
2 1 1 yes 0 0
2 0 0 0 0 2
2 0 0 0 0 2
3 True 0 0 0 2
4 0 0 0 0 2.1
4 0 0 0 0 2.1""")
df2 = pd.read_table(x, delim_whitespace=True)
You can first use transform with tuple to make each row hashable and comparable, and then play with indexes and range to create unique ids
f = df2.transform(tuple,1).to_frame()
k = f.groupby(0).sum()
k['id'] = range(1,len(k.index)+1)
And finally
df2['temp_key'] = f[0]
df2 = df2.set_index('temp_key')
df2['id'] = k.id
df2.reset_index().drop('temp_key', 1)
log apple watermelon orange lemon grapes id
0 1 1 1 yes 0 0.0 1
1 1 2 0 1 0 0.0 2
2 1 True 0 0 0 2.0 3
3 2 0 0 0 0 2.0 4
4 2 1 1 yes 0 0.0 5
5 2 0 0 0 0 2.0 4
6 2 0 0 0 0 2.0 4
7 3 True 0 0 0 2.0 6
8 4 0 0 0 0 2.1 7
9 4 0 0 0 0 2.1 7

Merging row data using panda's in python

I am trying to write a small python application that creates a csv file that contains data for a recipe system,
Imagine the following structure of excel data
Manufacturer Product Data 1 Data 2 Data 3
Test 1 Product 1 1 2 3
Test 1 Product 2 4 5 6
Test 2 Product 1 1 2 3
Test 3 Product 1 1 2 3
Test 3 Product 1 4 5 6
Test 3 Product 1 7 8 9
When merged i woudl like the data to be displayed in following format,
Test 1 Product 1 1 2 3 0 0 0 0 0 0
Test 2 Product 2 4 5 6 0 0 0 0 0 0
Test 2 Product 1 1 2 3 0 0 0 0 0 0
Test 3 Product 1 1 2 3 4 5 6 7 8 9
Any help would be greatfully recieved, so far i can read the panda dataset and convert to a CSV
Regards
Lee
Use melt, groupby, pd.Series, and unstack:
(df.melt(['Manufacturer','Product'])
.groupby(['Manufacturer','Product'])['value']
.apply(lambda x: pd.Series(x.tolist()))
.unstack(fill_value=0)
.reset_index())
Output:
Manufacturer Product 0 1 2 3 4 5 6 7 8
0 Test 1 Product 1 1 2 3 0 0 0 0 0 0
1 Test 1 Product 2 4 5 6 0 0 0 0 0 0
2 Test 2 Product 1 1 2 3 0 0 0 0 0 0
3 Test 3 Product 1 1 4 7 2 5 8 3 6 9
With groupby
df.groupby(['Manufacturer','Product']).agg(tuple).sum(1).apply(pd.Series).fillna(0)
Out[85]:
0 1 2 3 4 5 6 7 8
Manufacturer Product
Test1 Product1 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0
Product2 4.0 5.0 6.0 0.0 0.0 0.0 0.0 0.0 0.0
Test2 Product1 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0
Test3 Product1 1.0 4.0 7.0 2.0 5.0 8.0 3.0 6.0 9.0
cols = ['Manufacturer', 'Product']
d = df.set_index(cols + [df.groupby(cols).cumcount()]).unstack(fill_value=0)
d
Gets me
Data 1 Data 2 Data 3
0 1 2 0 1 2 0 1 2
Manufacturer Product
Test 1 Product 1 1 0 0 2 0 0 3 0 0
Product 2 4 0 0 5 0 0 6 0 0
Test 2 Product 1 1 0 0 2 0 0 3 0 0
Test 3 Product 1 1 4 7 2 5 8 3 6 9
Followed up wtih
d.sort_index(1, 1).pipe(lambda d: d.set_axis(range(d.shape[1]), 1, False).reset_index())
Manufacturer Product 0 1 2 3 4 5 6 7 8
0 Test 1 Product 1 1 2 3 0 0 0 0 0 0
1 Test 1 Product 2 4 5 6 0 0 0 0 0 0
2 Test 2 Product 1 1 2 3 0 0 0 0 0 0
3 Test 3 Product 1 1 2 3 4 5 6 7 8 9
Or
cols = ['Manufacturer', 'Product']
pd.Series({
n: d.values.ravel() for n, d in df.set_index(cols).groupby(cols)
}).apply(pd.Series).fillna(0, downcast='infer').rename_axis(cols).reset_index()
Manufacturer Product 0 1 2 3 4 5 6 7 8
0 Test 1 Product 1 1 2 3 0 0 0 0 0 0
1 Test 1 Product 2 4 5 6 0 0 0 0 0 0
2 Test 2 Product 1 1 2 3 0 0 0 0 0 0
3 Test 3 Product 1 1 2 3 4 5 6 7 8 9
With defaultdict and itertools.count
from itertools import count
from collections import defaultdict
c = defaultdict(count)
pd.Series({(
m, p, next(c[(m, p)])): v
for _, m, p, *V in df.itertuples()
for v in V
}).unstack(fill_value=0)
0 1 2 3 4 5 6 7 8
Test 1 Product 1 1 2 3 0 0 0 0 0 0
Product 2 4 5 6 0 0 0 0 0 0
Test 2 Product 1 1 2 3 0 0 0 0 0 0
Test 3 Product 1 1 2 3 4 5 6 7 8 9

Use fillna-method per specific segments in dataframe

Currently I have following dataframe, where F1-F4 are some segments
A B C D E F1 F2 F3 F4
06:00 2 4 6 8 1 1 0 0 0
06:15 3 5 7 9 NaN 1 0 0 0
06:30 4 6 8 7 3 1 0 0 0
06:45 1 3 5 7 NaN 1 0 0 0
07:00 2 4 6 8 6 0 1 0 0
07:15 4 4 8 8 NaN 0 1 0 0
---------------------------------------------
20:00 2 4 6 8 NaN 0 0 1 0
20:15 1 2 3 4 5 0 0 1 0
20:30 8 1 5 9 NaN 0 0 1 0
20:45 1 3 5 7 NaN 0 0 0 1
21:00 5 4 6 5 6 0 0 0 1
What is the best approach to achieve next dataset after some manipulations?
E(06:15) = MEAN( AVG[E(06:00-06:30)], AVG[06:15(A-E)] ) #F1==1
E(20:45) = MEAN( AVG[E(20:45-21:00)], AVG[20:45(A-E)] ) #F4==1
A B C D E F1 F2 F3 F4
06:00 2 4 6 8 1 1 0 0 0
06:15 3 5 7 9 [X0] 1 0 0 0
06:30 4 6 8 7 3 1 0 0 0
06:45 1 3 5 7 [X1] 1 0 0 0
07:00 2 4 6 8 6 0 1 0 0
07:15 4 4 8 8 [X2] 0 1 0 0
---------------------------------------------
20:00 2 4 6 8 [X3] 0 0 1 0
20:15 1 2 3 4 5 0 0 1 0
20:30 8 1 5 9 [X4] 0 0 1 0
20:45 1 3 5 7 [X5] 0 0 0 1
21:00 5 4 6 5 6 0 0 0 1
I was trying to use an idea like below, but without success so far
In[89]: df.groupby(['F1', 'F2', 'F3', 'F4'], as_index=False).median()
Out[89]:
F1 F2 F3 F4 A B C D E
0 0 0 0 1 2.0 3.0 2.0 2.0 0.0
1 0 0 1 0 1.5 2.0 3.0 3.5 1.0
2 0 1 0 0 6.0 7.0 6.0 7.0 9.0
3 1 0 0 0 3.0 4.0 3.0 4.0 4.0
and now, I am struggling with accessing to values E==0.0 via key F4==1

Categories

Resources