I have a data frame which looks like:
D Type Value
0 1 A 2
1 1 B 4
2 2 C 1
3 1 A 1
I want to group by D and Type and sum the values.
data=df.groupby(['D','Type']).sum()
print(data)
Which gives me this result:
D Type Value
1 A 3
B 4
2 C 3
But I want it in this format:
D A B C
1 3 4 Nan
2 Nan Nan 3
UPDATE:
r = df.pivot_table(index=['D'], columns='Type', aggfunc='sum').reset_index()
r.columns = [tup[1] if tup[1] else tup[0] for tup in r.columns]
r.to_csv('c:/temp/out.csv', index=False)
Result:
D,A,B,C
1,3.0,4.0,
2,,,1.0
Original answer:
you can use pivot_table() method:
In [7]: df.pivot_table(index=['D'], columns='Type', aggfunc='sum', fill_value=0)
Out[7]:
Value
Type A B C
D
1 3 4 0
2 0 0 1
or with NaN's:
In [8]: df.pivot_table(index=['D'], columns='Type', aggfunc='sum')
Out[8]:
Value
Type A B C
D
1 3.0 4.0 NaN
2 NaN NaN 1.0
PS i think you have a typo in your groupby... section:
In [10]: df.groupby(['D','Type']).sum()
Out[10]:
Value
D Type
1 A 3
B 4
2 C 1
there should be C --> 1 instead of C --> 3
Related
I hope you're doing well.
I want to drop duplicates rows based on some conditions.
For example :
A B C D E
0 foo 2 3 4 100
1 foo 2 3 1 3
2 foo 2 3 5 nan
3 bar 1 2 8 nan
4 bar 1 2 1 nan
The result should be
A B C D E
0 foo 2 3 4 100
1 foo 2 3 1 3
2 bar 1 2 nan nan
So we have duplicated rows (based on columns A,B, and C), first we check the value in column E if it's nan we drop the row but if all values in column E are nan (like the example of row 3 and 4 concerning the name 'bar'), we should keep one row and set the value in column D as nan.
Thanks in advance.
It works
import pandas as pd
import io
table = """
A B C D E
0 foo 2 3 4 100
1 foo 2 3 1 3
2 foo 2 3 5 nan
3 bar 1 2 8 nan
4 bar 1 2 1 nan
"""
df = pd.read_table(io.StringIO(table), index_col=0, sep=' ', skipinitialspace=True)
# Index for duplicated in A,B,C and all nan in E
index_1 = set(df[df.duplicated(['A','B','C','E'], keep=False)]["E"].isna().index)
# Index for duplicated ABC and nan in E
index_2 = set(df[df[df.duplicated(['A','B','C'], keep=False)]["E"].isna()].index)
# Set nan for D in index_1
df.loc[index_1, 'D'] = np.nan
# Drop nan E with duplicated ABC except index_1
df.drop(index_2-index_1, inplace=True)
# Drop other duplicates
df.drop_duplicates(['A','B','C','D'], inplace=True)
print(df)
This is what was required:
A B C D E
0 foo 2 3 4.0 100.0
1 foo 2 3 1.0 3.0
3 bar 1 2 NaN NaN
I want to deal with duplicates in a pandas df:
df=pd.DataFrame({'A':[1,1,1,2,1],'B':[2,2,1,2,1],'C':[2,2,1,1,1],'D':['a','c','a','c','c']})
df
I want to keep only rows with unique values of A, B, C an create binary columns D_a and D_c, so the results will be something like this without doing super slow loops on each row..
result= pd.DataFrame({'A':[1,1,2],'B':[2,1,2],'C':[2,1,1],'D_a':[1,1,0],'D_c':[1,1,1]})
Thanks a lot
You can use:
df1 = (df.groupby(['A','B','C'])['D']
.value_counts()
.unstack(fill_value=0)
.add_prefix('D_')
.clip_upper(1)
.reset_index()
.rename_axis(None, axis=1))
print (df1)
A B C D_a D_c
0 1 1 1 1 1
1 1 2 2 1 1
2 2 2 1 0 1
Using get_dummies + sum -
df = df.set_index(['A', 'B', 'C'])\
.D.str.get_dummies()\
.sum(level=[0, 1, 2])\
.add_prefix('D_')\
.reset_index()
df
A B C D_a D_c
0 1 1 1 1 1
1 1 2 2 1 1
2 2 2 1 0 1
You can do something like this
df.loc[df['D']=='a', 'D_a'] = 1
df.loc[df['D']=='c', 'D_c'] = 1
This will put a 1 in a new column where every an "a" or "c" appears.
A B C D D_a D_c
0 1 2 2 a 1.0 NaN
1 1 2 2 c NaN 1.0
2 1 1 1 a 1.0 NaN
3 2 2 1 c NaN 1.0
4 1 1 1 c NaN 1.0
but then you have to replace the NaN with a 0.
df = df.fillna(0)
Next you only have to select the columns you need and then drop the duplicates.
df = df[["A","B","C", "D_a", "D_c"]].drop_duplicates()
Hope this is the solution you were looking for.
I just cannot get my head around this. I have a data frame with the following values:
df = pd.DataFrame([
(1,np.nan,"a"),
(1,"a",np.nan),
(1,np.nan,"b"),
(1,"c","b"),
(2,"a",np.nan),
(2,np.nan,"b"),
(3,"a",np.nan)], columns=["A", "B", "C"])
That translates into
A B C
0 1 NaN a
1 1 a NaN
2 1 NaN b
3 1 c b
4 2 a NaN
5 2 NaN b
6 3 a NaN
What I want is that if I have a null value / empty field in "B" it should be replaced with the value from "C". Like this:
A B C
0 1 a a
1 1 a NaN
2 1 b b
3 1 c b
4 2 a NaN
5 2 b b
6 3 a NaN
I can of course filer for the values:
df.loc[df.B.isnull()]
but I cannot manage to assign values from the other column:
df.loc[df.B.isnull()] = df.C
I understand that I want to replace the three NaN with seven entries in column C, so it does not match. So how do I get the corresponding values over?
You can use:
df.loc[df.B.isnull(), 'B'] = df.C
Output:
A B C
0 1 a a
1 1 a NaN
2 1 b b
3 1 c b
4 2 a NaN
5 2 b b
6 3 a NaN
Or as suggested in comment below you can also use:
df.B.where(pd.notnull, df.C, inplace=True)
You can use combine_first, it also seems to be much faster
df.B = df.B.combine_first(df.C)
1000 loops, best of 3: 764 µs per loop
df.loc[df.B.isnull(), 'B'] = df.C
100 loops, best of 3: 1.54 ms per loop
You get
A B C
0 1 a a
1 1 a NaN
2 1 b b
3 1 c b
4 2 a NaN
5 2 b b
6 3 a NaN
consider the following dataset
df=pd.DataFrame({'A':pd.date_range('2012-02-02','2012-02-07'),
'ID':['A','B','A','D','A',np.NaN]})
df
Out[122]:
A ID
0 2012-02-02 A
1 2012-02-03 B
2 2012-02-04 A
3 2012-02-05 D
4 2012-02-06 A
5 2012-02-07 NaN
I would like to get the number of unique values of ID, up to time t. That means the output should look like
Out[122]:
A uniqueID
0 2012-02-02 1
1 2012-02-03 2
2 2012-02-04 2
3 2012-02-05 3
4 2012-02-06 3
5 2012-02-07 3
Indeed, on Feb 3rd, we know there are two unique values of ID ('A' and 'B'). On Feb 4th we see 'A', but we know that already so we don't increment our count of unique ID values.
I dont see a simple way to do so with groupby.agg('nunique'). Any idea is welcome.
Thanks!
EDIT:
trying to understand edchum solution...
df.apply(lambda x: df['ID'].iloc[:x.name+1],axis=1)
Out[134]:
0 1 2 3 4 5
0 A NaN NaN NaN NaN NaN
1 A B NaN NaN NaN NaN
2 A B A NaN NaN NaN
3 A B A D NaN NaN
4 A B A D A NaN
5 A B A D A NaN
apply a lambda that slices the df using loc and the row index value using .name and calcs the nunique count of ID column:
In [5]:
df['Unique_ID'] = df.apply(lambda x: df['ID'].loc[:x.name].nunique(),axis=1)
df
Out[5]:
A ID Unique_ID
0 2012-02-02 A 1
1 2012-02-03 B 2
2 2012-02-04 A 2
3 2012-02-05 D 3
4 2012-02-06 A 3
5 2012-02-07 NaN 3
EDIT
Here's a breakdown, if we modify the df so the index is not an int generated one:
In [19]:
df=pd.DataFrame({'A':pd.date_range('2012-02-02','2012-02-07'),
'ID':['A','B','A','D','A',np.NaN]}, index=list('abcdef'))
df
Out[19]:
A ID
a 2012-02-02 A
b 2012-02-03 B
c 2012-02-04 A
d 2012-02-05 D
e 2012-02-06 A
f 2012-02-07 NaN
So we see that name in this case is in fact the row Series index value:
In [20]:
df.apply(lambda x: print(x.name),axis=1).tolist()
a
b
c
d
e
f
So we can use this to slice the df using loc with a range up to and including this index value:
In [22]:
df.apply(lambda x: print(df['ID'].loc[:x.name]),axis=1)
a A
Name: ID, dtype: object
a A
b B
Name: ID, dtype: object
a A
b B
c A
Name: ID, dtype: object
a A
b B
c A
d D
Name: ID, dtype: object
a A
b B
c A
d D
e A
Name: ID, dtype: object
a A
b B
c A
d D
e A
f NaN
Name: ID, dtype: object
So you can see that from the above that we are incrementing the slice range on each row, we can then call nunique on this to return the number of unique values seen in this range:
In [24]:
df.apply(lambda x: print(df['ID'].loc[:x.name].nunique()),axis=1)
1
2
2
3
3
3
Given a 3-column DataFrame, df:
a b c
0 NaN a True
1 1 b True
2 2 c False
3 3 NaN False
4 4 e True
[5 rows x 3 columns]
I would like to place aNaN in column c for each row where a NaN exists in any other colunn. My current approach is as follows:
for col in df:
df['c'][pd.np.isnan(df[col])] = pd.np.nan
I strongly suspect that there is a way to do this via logical indexing instead of iterating through columns as I am currently doing.
How could this be done?
Thank you!
If you don't care about the bool/float issue, I propose:
>>> df.loc[df.isnull().any(axis=1), "c"] = np.nan
>>> df
a b c
0 NaN a NaN
1 1 b 1
2 2 c 0
3 3 NaN NaN
4 4 e 1
[5 rows x 3 columns]
If you really do, then starting again from your frame df you could:
>>> df["c"] = df["c"].astype(object)
>>> df.loc[df.isnull().any(axis=1), "c"] = np.nan
>>> df
a b c
0 NaN a NaN
1 1 b True
2 2 c False
3 3 NaN NaN
4 4 e True
[5 rows x 3 columns]
df.c[df.ix[:, :'c'].apply(lambda r: any(r.isnull()), axis=1)] = np.nan
Note that you may need to change the type of column c to float or you'll get an error about being unable to assign nan to integer column.
filter and select the rows where you have NaN for either 'a' or 'b' and assign 'c' to NaN:
In [18]:
df.ix[pd.isnull(df.a) | pd.isnull(df.b),'c'] = NaN
In [19]:
df
Out[19]:
a b c
0 NaN a NaN
1 1 b 1
2 2 c 0
3 3 d 0
4 4 NaN NaN
[5 rows x 3 columns]