Task
I have a df where I do some ratios that are groupby date and id. I want to fill column c with NaN if the sum of a and b is 0. Any help would be awesome!!
df
date id a b c
0 2001-09-06 1 3 1 1
1 2001-09-07 1 3 1 1
2 2001-09-08 1 4 0 1
3 2001-09-09 2 6 0 1
4 2001-09-10 2 0 0 2
5 2001-09-11 1 0 0 2
6 2001-09-12 2 1 1 2
7 2001-09-13 2 0 0 2
8 2001-09-14 1 0 0 2
Try this:
df['new_c'] = df.c.where(df[['a','b']].sum(1).ne(0))
Out[75]:
date id a b c new_c
0 2001-09-06 1 3 1 1 1.0
1 2001-09-07 1 3 1 1 1.0
2 2001-09-08 1 4 0 1 1.0
3 2001-09-09 2 6 0 1 1.0
4 2001-09-10 2 0 0 2 NaN
5 2001-09-11 1 0 0 2 NaN
6 2001-09-12 2 1 1 2 2.0
7 2001-09-13 2 0 0 2 NaN
8 2001-09-14 1 0 0 2 NaN
It is better to build a new dataframe with same shape , and then do the following :
i = 0
for line in df :
new_df[i]['date'] = line['date']
new_df[i]['a'] = line['a']
new_df[i]['b'] = line['b']
if line['a'] + line['b'] == 0 :
new_df[i]['c'] = Nan
i += 1
Related
I have the following dataframe:
p l w s_w v
1 1 1 1 2
1 1 2 1 2
1 1 3 0 5
1 1 4 1 5
1 1 5 1 5
2 1 1 1 1
2 1 2 0 2
2 1 3 0 3
2 1 4 0 4
2 1 5 1 5
2 1 6 1 4
i want to have a new column
where in each row if the value of s_w is 1,
its value is sum(v) in two previous rows ( not necessarily successive ) where s_w==1
and sum(v) for two following rows ( not necessarily successive), again where s_w==1 so sum(v) + sum(v).
I am not interested in any number of zeros between
so resulted dataframe looks like this:
p l w s_w v c_s
1 1 1 1 2 Null
1 1 2 1 2 Null
1 1 3 0 5 Null
1 1 4 1 5 10
1 1 5 1 5 13
2 1 1 1 1 19
2 1 2 0 2 Null
2 1 3 0 3 Null
2 1 4 0 4 Null
2 1 5 1 5 Null
2 1 6 1 4 Null
last two rows value will Null because there are no two 1s after them ( n the other words sum before and after only if there are two 1s in previous and following rows( not necessarily successive, otherwise Null)
A new Edit to the original question:
for each group of P,l if only the value in check column is 1 then find the above mentioned pattern in s_w columns and sum(v) of two previous rows where s_w==1 ( not necessarily successive) and also sum(v) of two following rows where s_w==1 ( not necessarily successive)
p l w s_w check v
1 1 1 1 0 2
1 1 2 1 0 2
1 1 3 0 0 5
1 1 4 1 0 5
1 1 5 1 1 5
2 1 1 1 0 1
2 1 2 0 0 2
2 1 3 0 0 3
2 1 4 0 0 4
2 1 5 1 0 5
2 1 6 1 0 4
Idea is filtered rows with 1 and use rolling sum with shift values for correct align:
s = df.loc[df['s_w'].eq(1), 'v']
df['c_s'] = s.rolling(2).sum().shift().add(s.iloc[::-1].rolling(2).sum().shift())
print (df)
p l w s_w v c_s
0 1 1 1 1 2 NaN
1 1 1 2 1 2 NaN
2 1 1 3 0 5 NaN
3 1 1 4 1 5 10.0
4 1 1 5 1 5 13.0
5 2 1 1 1 1 19.0
6 2 1 2 0 2 NaN
7 2 1 3 0 3 NaN
8 2 1 4 0 4 NaN
9 2 1 5 1 5 NaN
10 2 1 6 1 4 NaN
Another idea:
df['c_s'] = s.shift(-1).add(s.shift(-2)).add(s.shift(2)).add(s.shift(1))
EDIT:
Solution per groups:
s = df[df['s_w'].eq(1)]
f = lambda x: x.rolling(2).sum().shift()
df['c_s'] = s.groupby(['p','l'])['v'].apply(f).add(s.iloc[::-1].groupby(['p','l'])['v'].apply(f))
g = df[df['s_w'].eq(1)].groupby(['p','l'])['v']
df['c_s'] = g.shift(-1).add(g.shift(-2)).add(g.shift(2)).add(g.shift(1))
There is a dataframe:
0 1 2 3
0 a c e NaN
1 b d NaN NaN
2 b c NaN NaN
3 a b c d
4 a b NaN NaN
5 b c NaN NaN
6 a b NaN NaN
7 a b c e
8 a b c NaN
9 a c e NaN
I would like to transfrom encode it with one-hot like this
a c e b d
0 1 1 1 0 0
1 0 0 0 1 1
2 0 1 0 1 0
3 1 1 0 1 1
4 1 0 0 1 0
5 0 1 0 1 0
6 1 0 0 1 0
7 1 1 1 1 0
8 1 1 0 1 0
9 1 1 1 0 0
pd.get_dummies does not work here, because it acutually encode each columns independently. How can I get this? Btw, the order of the columns doesn't matter.
Try this:
df.stack().str.get_dummies().max(level=0)
Out[129]:
a b c d e
0 1 0 1 0 1
1 0 1 0 1 0
2 0 1 1 0 0
3 1 1 1 1 0
4 1 1 0 0 0
5 0 1 1 0 0
6 1 1 0 0 0
7 1 1 1 0 1
8 1 1 1 0 0
9 1 0 1 0 1
One way using str.join and str.get_dummies:
one_hot = df1.apply(lambda x: "|".join([i for i in x if pd.notna(i)]), 1).str.get_dummies()
print(one_hot)
Output:
a b c d e
0 1 0 1 0 1
1 0 1 0 1 0
2 0 1 1 0 0
3 1 1 1 1 0
4 1 1 0 0 0
5 0 1 1 0 0
6 1 1 0 0 0
7 1 1 1 0 1
8 1 1 1 0 0
9 1 0 1 0 1
Probably this question has already an answer, but I could not succeed to find any.
I want to get items from a second data-frame to be appended to a new column in the first dataframe if there a match between both dataframe
Here I am showing some sample data quite similar to the case I am confronting.
import pandas as pd
import numpy as np
a = np.arange(3).repeat(3)
b = np.tile(np.arange(3),3)
df1 = pd.DataFrame({'a':a, 'b':b})
a b
0 0 0
1 0 1
2 0 2
3 1 0
4 1 1
5 1 2
6 2 0
7 2 1
8 2 2
a2 = np.arange(1, 4).repeat(3)
b2 = np.tile(np.arange(3),3)
c = np.random.randint(0, 10, size=a2.size)
df2 = pd.DataFrame({'a2':a2, 'b2':b2, 'c':c})
a2 b2 c
0 1 0 3
1 1 1 1
2 1 2 9
3 2 0 5
4 2 1 8
5 2 2 4
6 3 0 1
7 3 1 6
8 3 2 1
The desired output should be like
a b c
0 0 0 nan
1 0 1 nan
2 0 2 nan
3 1 0 3
4 1 1 1
5 1 2 9
6 2 0 5
7 2 1 8
8 2 2 4
Unfortunately, I could not come up with anyway to solve it.
Use merge with left join and rename columns names:
df = df1.merge(df2.rename(columns={'a2':'a', 'b2':'b'}), on=['a','b'], how='left')
print (df)
a b c
0 0 0 NaN
1 0 1 NaN
2 0 2 NaN
3 1 0 3.0
4 1 1 5.0
5 1 2 0.0
6 2 0 2.0
7 2 1 6.0
8 2 2 2.0
I am trying to write a small python application that creates a csv file that contains data for a recipe system,
Imagine the following structure of excel data
Manufacturer Product Data 1 Data 2 Data 3
Test 1 Product 1 1 2 3
Test 1 Product 2 4 5 6
Test 2 Product 1 1 2 3
Test 3 Product 1 1 2 3
Test 3 Product 1 4 5 6
Test 3 Product 1 7 8 9
When merged i woudl like the data to be displayed in following format,
Test 1 Product 1 1 2 3 0 0 0 0 0 0
Test 2 Product 2 4 5 6 0 0 0 0 0 0
Test 2 Product 1 1 2 3 0 0 0 0 0 0
Test 3 Product 1 1 2 3 4 5 6 7 8 9
Any help would be greatfully recieved, so far i can read the panda dataset and convert to a CSV
Regards
Lee
Use melt, groupby, pd.Series, and unstack:
(df.melt(['Manufacturer','Product'])
.groupby(['Manufacturer','Product'])['value']
.apply(lambda x: pd.Series(x.tolist()))
.unstack(fill_value=0)
.reset_index())
Output:
Manufacturer Product 0 1 2 3 4 5 6 7 8
0 Test 1 Product 1 1 2 3 0 0 0 0 0 0
1 Test 1 Product 2 4 5 6 0 0 0 0 0 0
2 Test 2 Product 1 1 2 3 0 0 0 0 0 0
3 Test 3 Product 1 1 4 7 2 5 8 3 6 9
With groupby
df.groupby(['Manufacturer','Product']).agg(tuple).sum(1).apply(pd.Series).fillna(0)
Out[85]:
0 1 2 3 4 5 6 7 8
Manufacturer Product
Test1 Product1 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0
Product2 4.0 5.0 6.0 0.0 0.0 0.0 0.0 0.0 0.0
Test2 Product1 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0
Test3 Product1 1.0 4.0 7.0 2.0 5.0 8.0 3.0 6.0 9.0
cols = ['Manufacturer', 'Product']
d = df.set_index(cols + [df.groupby(cols).cumcount()]).unstack(fill_value=0)
d
Gets me
Data 1 Data 2 Data 3
0 1 2 0 1 2 0 1 2
Manufacturer Product
Test 1 Product 1 1 0 0 2 0 0 3 0 0
Product 2 4 0 0 5 0 0 6 0 0
Test 2 Product 1 1 0 0 2 0 0 3 0 0
Test 3 Product 1 1 4 7 2 5 8 3 6 9
Followed up wtih
d.sort_index(1, 1).pipe(lambda d: d.set_axis(range(d.shape[1]), 1, False).reset_index())
Manufacturer Product 0 1 2 3 4 5 6 7 8
0 Test 1 Product 1 1 2 3 0 0 0 0 0 0
1 Test 1 Product 2 4 5 6 0 0 0 0 0 0
2 Test 2 Product 1 1 2 3 0 0 0 0 0 0
3 Test 3 Product 1 1 2 3 4 5 6 7 8 9
Or
cols = ['Manufacturer', 'Product']
pd.Series({
n: d.values.ravel() for n, d in df.set_index(cols).groupby(cols)
}).apply(pd.Series).fillna(0, downcast='infer').rename_axis(cols).reset_index()
Manufacturer Product 0 1 2 3 4 5 6 7 8
0 Test 1 Product 1 1 2 3 0 0 0 0 0 0
1 Test 1 Product 2 4 5 6 0 0 0 0 0 0
2 Test 2 Product 1 1 2 3 0 0 0 0 0 0
3 Test 3 Product 1 1 2 3 4 5 6 7 8 9
With defaultdict and itertools.count
from itertools import count
from collections import defaultdict
c = defaultdict(count)
pd.Series({(
m, p, next(c[(m, p)])): v
for _, m, p, *V in df.itertuples()
for v in V
}).unstack(fill_value=0)
0 1 2 3 4 5 6 7 8
Test 1 Product 1 1 2 3 0 0 0 0 0 0
Product 2 4 5 6 0 0 0 0 0 0
Test 2 Product 1 1 2 3 0 0 0 0 0 0
Test 3 Product 1 1 2 3 4 5 6 7 8 9
I have the following short dataframe:
A B C
1 1 3
2 1 3
3 2 3
4 2 3
5 0 0
I want the output to look like this:
A B C
1 1 3
2 1 3
3 0 0
4 0 0
5 0 0
1 1 3
2 1 3
3 2 3
4 2 3
5 0 0
use pd.MultiIndex.from_product with unique As and Bs. Then reindex.
cols = list('AB')
mux = pd.MultiIndex.from_product([df.A.unique(), df.B.unique()], names=cols)
df.set_index(cols).reindex(mux, fill_value=0).reset_index()
A B C
0 1 1 3
1 1 2 0
2 1 0 0
3 2 1 3
4 2 2 0
5 2 0 0
6 3 1 0
7 3 2 3
8 3 0 0
9 4 1 0
10 4 2 3
11 4 0 0
12 5 1 0
13 5 2 0
14 5 0 0