I have a dataframe:
A B C V
1 4 7 T
2 6 8 T
3 9 9 F
and I want to create a new column, summing the rows where V is 'T'
So I want
A B C V D
1 4 7 T 12
2 6 8 T 16
3 9 9 F
Is there any way to do this without iteration?
Mask the values before summing:
df.select_dtypes(np.number).sum(axis=1).mask(df['V'] != 'T')
# Or,
df.select_dtypes(np.number).mask(df['V'] != 'T').sum(axis=1, skipna=False)
0 12.0
1 16.0
2 NaN
dtype: float64
df['D'] = df.select_dtypes(np.number).sum(axis=1).mask(df['V'] != 'T')
df
A B C V D
0 1 4 7 T 12.0
1 2 6 8 T 16.0
2 3 9 9 F NaN
If you actually wanted blanks, use
df.select_dtypes(np.number).sum(axis=1).mask(df['V'] != 'T', '')
0 24
1 32
2
dtype: object
Which returns an object column (not recommended).
Alternatively, using np.where:
np.where(df['V'] == 'T', df.select_dtypes(np.number).sum(axis=1), np.nan)
# array([12., 16., nan])
df['D'] = np.where(
df['V'] == 'T', df.select_dtypes(np.number).sum(axis=1), np.nan)
df
A B C V D
0 1 4 7 T 12.0
1 2 6 8 T 16.0
2 3 9 9 F 0.0
Use Numpy where
import numpy as np
df['D'] = np.where(df['V'] == 'T', df.select_dtypes(np.number).sum(axis=1), None)
df['D'] = df[['A', 'B', 'C']][df['V'] == 'T'].sum(axis=1)
In [51]df:
Out[51]:
A B C V D
0 1 4 7 T 12.000
1 2 6 8 T 16.000
2 3 9 9 F nan
Related
i want to write a program that drops a column if it exceeds a specific number of NA values .This is what i did.
def check(x):
for column in df:
if df.column.isnull().sum() > 2:
df.drop(column,axis=1)
there is no error in executing the above code , but while doing df.apply(check), there are a ton of errors.
P.S:I know about the thresh arguement in df.dropna(thresh,axis)
Any tips?Why isnt my code working?
Thanks
Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')
})
m = ~df.isnull().sum().gt(2)
df = df.loc[:,m]
print(df)
Returns:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Explanation
Assume we print the columns and the mask before applying it.
print(df.columns.tolist())
print(m.tolist())
It would return this:
['A', 'B', 'C', 'D', 'E', 'F']
[True, False, False, True, True, True]
Columns B and C are unwanted (False). They are removed when the mask is applied.
I think best here is use dropna with parameter thresh:
thresh : int, optional
Require that many non-NA values.
So for vectorize solution subtract it from length of DataFrame:
N = 2
df = df.dropna(thresh=len(df)-N, axis=1)
print (df)
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a NaN NaN 1 5.0 a
1 b NaN 8.0 3 3.0 a
2 c NaN NaN 5 6.0 a
3 d 5.0 NaN 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f NaN 3.0 0 NaN b
def check(df):
for column in df:
if df[column].isnull().sum() > 2:
df.drop(column,axis=1, inplace=True)
return df
print (df.pipe(check))
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Alternatively, you can use count which counts non-null values
In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]
Out[23]:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I have sample dataframe like this
df1=
A B C
a 1 2
b 3 4
b 5 6
c 7 8
d 9 10
I would like to replace a part of this dataframe (col A=a and b) with this dataframe
df2=
A B C
b 9 10
b 11 12
c 13 14
I would like to get result below
df3=
A B C
a 1 2
b 9 10
b 11 12
c 13 14
d 9 10
I tried
df1[df1.A.isin("bc")]...
But I couldnt figure out how to replace.
someone tell how to replace dataframe.
As I explained try update.
import pandas as pd
df1 = pd.DataFrame({"A":['a','b','b','c'], "B":[1,2,4,6], "C":[3,2,1,0]})
df2 = pd.DataFrame({"A":['b','b','c'], "B":[100,400,300], "C":[39,29,100]}).set_index(df1.loc[df1.A.isin(df2.A),:].index)
df1.update(df2)
Out[75]:
A B C
0 a 1.0 3.0
1 b 100.0 39.0
2 b 400.0 29.0
3 c 300.0 100.0
You need combine_first or update by column A, but because duplicates need cumcount:
df1['g'] = df1.groupby('A').cumcount()
df2['g'] = df2.groupby('A').cumcount()
df1 = df1.set_index(['A','g'])
df2 = df2.set_index(['A','g'])
df3 = df2.combine_first(df1).reset_index(level=1, drop=True).astype(int).reset_index()
print (df3)
A B C
0 a 1 2
1 b 9 10
2 b 11 12
3 c 13 14
4 d 9 10
Another solution:
df1['g'] = df1.groupby('A').cumcount()
df2['g'] = df2.groupby('A').cumcount()
df1 = df1.set_index(['A','g'])
df2 = df2.set_index(['A','g'])
df1.update(df2)
df1 = df1.reset_index(level=1, drop=True).astype(int).reset_index()
print (df1)
A B C
0 a 1 2
1 b 9 10
2 b 11 12
3 c 13 14
4 d 9 10
If duplicatesof column A in df1 are same in df2 and have same length:
df2.index = df1.index[df1.A.isin(df2.A)]
df3 = df2.combine_first(df1)
print (df3)
A B C
0 a 1.0 2.0
1 b 9.0 10.0
2 b 11.0 12.0
3 c 13.0 14.0
4 d 9.0 10.0
you could solve your problem with the following:
import pandas as pd
df1 = pd.DataFrame({'A':['a','b','b','c','d'],'B':[1,3,5,7,9],'C':[2,4,6,8,10]})
df2 = pd.DataFrame({'A':['b','b','c'],'B':[9,11,13],'C':[10,12,14]}).set_index(df1.loc[df1.A.isin(df2.A),:].index)
df1.loc[df1.A.isin(df2.A), ['B', 'C']] = df2[['B', 'C']]
Out[108]:
A B C
0 a 1 2
1 b 9 10
2 b 11 12
3 c 13 14
4 d 9 10
I have a dataframe on which I would like to perform a division based on the entries. To illustrate the problem, say I have the following dataframe:
import pandas as pd
df= pd.DataFrame([[1.,2.,3.,4.], [5.,6.,7.,8.], [9.,10.,11.,12.]],
columns=['A','B','C','D'], index=['x','y','z'])
So I get the following as df:
A B C D
x 1 2 3 4
y 5 6 7 8
z 9 10 11 12
What I would like to do is see how much each value in column D changed as I went from x to y, and again to z.
The dataframe I'd get would be:
A B C D
x 1 2 3 NaN
y 5 6 7 2.0
z 9 10 11 1.5
How do I do this in a systematic way?
You can use div with the column shifted:
In [21]:
df['D'] = df['D'].div(df['D'].shift())
df
Out[21]:
A B C D
x 1 2 3 NaN
y 5 6 7 2.0
z 9 10 11 1.5
Or more succinctly:
In [23]:
df['D'] /= df['D'].shift()
df
Out[23]:
A B C D
x 1 2 3 NaN
y 5 6 7 2.0
z 9 10 11 1.5
you can use pct_change():
In [57]: df.D.pct_change() + 1
Out[57]:
x NaN
y 2.0
z 1.5
Name: D, dtype: float64
or as DF (on the fly):
In [58]: df.assign(D=df.D.pct_change() + 1)
Out[58]:
A B C D
x 1.0 2.0 3.0 NaN
y 5.0 6.0 7.0 2.0
z 9.0 10.0 11.0 1.5
Im trying to do something to a pandas dataframe as follows:
If say row 2 has a 'nan' value in the 'start' column, then I can replace all row entries with '999999'
if pd.isnull(dfSleep.ix[2,'start']):
dfSleep.ix[2,:] = 999999
The above code works but I want to do it for every row, ive tried replacing the '2' with a ':' but that does not work
if pd.isnull(dfSleep.ix[:,'start']):
dfSleep.ix[:,:] = 999999
and ive tried something like this
for row in df.iterrows():
if pd.isnull(dfSleep.ix[row,'start']):
dfSleep.ix[row,:] = 999999
but again no luck, any ideas?
I think row in your approach is not an row index. It's a row of the DataFrame
You can use this instead:
for row in df.iterrows():
if pd.isnull(dfSleep.ix[row[0],'start']):
dfSleep.ix[row[0],:] = 999999
UPDATE:
In [63]: df
Out[63]:
a b c
0 0 3 NaN
1 3 7 5.0
2 0 5 NaN
3 4 1 6.0
4 7 9 NaN
In [64]: df.ix[df.c.isnull()] = [999999] * len(df.columns)
In [65]: df
Out[65]:
a b c
0 999999 999999 999999.0
1 3 7 5.0
2 999999 999999 999999.0
3 4 1 6.0
4 999999 999999 999999.0
You can use vectorized approach (.fillna() method):
In [50]: df
Out[50]:
a b c
0 1 8 NaN
1 8 8 6.0
2 5 2 NaN
3 9 4 1.0
4 4 2 NaN
In [51]: df.c = df.c.fillna(999999)
In [52]: df
Out[52]:
a b c
0 1 8 999999.0
1 8 8 6.0
2 5 2 999999.0
3 9 4 1.0
4 4 2 999999.0
I have a dataframe with a 2-level Multiindex:
ix = pd.MultiIndex.from_tuples(list(enumerate(np.random.choice(['A', 'B'], 5))))
df = pd.DataFrame({'Val': np.random.randint(0, 30, 5)}, index=ix).unstack().fillna(0)
df
Val
A B
0 27 0
1 0 3
2 0 7
3 9 0
4 0 19
I would like to add a column for each existing sublevel ('A' and 'B') that is equal to half of the Val column. My intuition was to do
df['Half_val'] = df.Val / 2
which gives a ValueError: Wrong number of items passed 2, placement implies 1 exception.
I can manually do
res = df.Val / 2
df.loc[:, ('Half_val', 'A')] = res.A
df.loc[:, ('Half_val', 'B')] = res.B
which gives what I'm after:
>>> df
Val Half_val
A B A B
0 27 0 13.5 0.0
1 0 3 0.0 1.5
2 0 7 0.0 3.5
3 9 0 4.5 0.0
4 0 19 0.0 9.5
Is there a less verbose, more idiomatic way to make a multiindex column assignment like this (particularly one where I don't have to explicitly specify each sublevel on the left side)?
Edit:
I forgot to mention that trying
res = df.Val / 2
df.loc[:, res.columns] = res
gives a KeyError: "['A' 'B'] not in index" exception.
Edit 2
It would be nice if the solution allowed pseudo-mixed level columns in the dataframe. In my example, I can do
In [5]: df['C'] = 'a'
In [6]: df
Out[6]:
Val C
A B
0 4 0 a
1 0 10 a
2 0 4 a
3 21 0 a
4 0 14 a
which adds a column with a single level. But since the column already had 2 levels, it appears it gives an implicit second level of an empty string
In [9]: list(df)
Out[9]: [('Val', 'A'), ('Val', 'B'), ('C', '')]
when I try a solution offered below, it the single-level C column seems to break it:
In [7]: pd.concat([df,df['Val']/2],axis=1,keys=['Val', 'C', 'Half'])
==> AssertionError: Cannot concat indices that do not have the same number of levels
Is there some trick for the keys parameter to pass, or do I need to give C a different dummy value for the second level (since it looks like "" doesn't count) and then remove it after the concatenation?
You can iterate over the level values and do a direct assignment (one value at a time)
In [55]: df.columns.get_level_values(1)
Out[55]: Index([u'A', u'B'], dtype='object')
In [51]: df[('Half','A')] = df[('Val','A')]/2
In [52]: df[('Half','B')] = df[('Val','B')]/2
In [53]: df
Out[53]:
Val Half
A B A B
0 0 12 0.0 6.0
1 0 5 0.0 2.5
2 0 26 0.0 13.0
3 3 0 1.5 0.0
4 25 0 12.5 0.0
You can do this as well
In [59]: concat([df['Val'],df['Val']/2],axis=1,keys=['Val','Half'])
Out[59]:
Val Half
A B A B
0 0 10 0.0 5.0
1 0 10 0.0 5.0
2 0 13 0.0 6.5
3 27 0 13.5 0.0
4 2 0 1.0 0.0
Here's an issue to track this bug/enhancement: https://github.com/pydata/pandas/issues/7475
I think this option is preferable to the concat option because you don't have to risk incorrectly re-labeling the 'Val' column. Please correct me if you disagree!
Given your input dataframe:
In [3]: df
Out[3]:
Val
A B
0 26 0
1 10 0
2 18 0
3 0 18
4 2 0
A third option worth considering is:
In [4]: df[pd.MultiIndex.from_product([['Half']] + df.columns.levels[1:])] = df['Val'] / 2
In [5]: df
Out[5]:
Val Half
A B A B
0 26 0 13 0
1 10 0 5 0
2 18 0 9 0
3 0 18 0 9
4 2 0 1 0
This approach also just works with an arbitrarily nested MultiIndex. (I don't know if it's possible do this assignment with sub-columns of a MultiIndex).
In [1]: df = pd.DataFrame({'Val': np.random.randint(5, 30, 12)}, index=pd.MultiIndex.from_product([['A', 'B','C'], ['a', 'b'], [0, 1]])).unstack().unstack()
In [2]: df
Out[2]:
Val
0 1
a b a b
A 6 10 11 7
B 16 8 23 15
C 29 17 11 18
In [3]: df[pd.MultiIndex.from_product([['Half']] + df.columns.levels[1:])] = df['Val'] / 2
In [4]: df
Out[4]:
Val Half
0 1 0 1
a b a b a b a b
A 6 10 11 7 3.0 5.0 5.5 3.5
B 16 8 23 15 8.0 4.0 11.5 7.5
C 29 17 11 18 14.5 8.5 5.5 9.0