Say I have two columns, A and B, in my dataframe:
A B
1 NaN
2 5
3 NaN
4 6
I want to get a new column, C, which fills in NaN cells in column B using values from column A:
A B C
1 NaN 1
2 5 5
3 NaN 3
4 6 6
How do I do this?
I'm sure this is a very basic question, but as I am new to Pandas, any help will be appreciated!
You can use combine_first:
df['c'] = df['b'].combine_first(df['a'])
Docs: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.combine_first.html
You can use where which is a vectorized if/else:
df['C'] = df['A'].where(df['B'].isnull(), df['B'])
A B C
0 1 NaN 1
1 2 5 5
2 3 NaN 3
3 4 6 6
df['c'] = df['b'].fillna(df['a'])
So what .fillna will do is it will fill all the Nan values in the data frame
We can pass any value to it
Here we pass the value df['a']
So this method will put the corresponding values of 'a' into the Nan values of 'b'
And the final answer will be in 'c'
Related
The example below:
import pandas as pd
list1 = ['a','a','a','b','b','b','b','c','c','c']
list2 = range(len(list1))
df = pd.DataFrame(zip(list1, list2), columns= ['Item','Value'])
df
gives:
required: GroupFirstValue column as shown below.
The idea is to use a lambda formula to get the 'first' value for each group..for example "a"'s first value is 0, "b"'s first value is 3, "c"'s first value is 7. That's why those numbers appear in the GroupFirstValue column.
Note: I know that I can do this on 2 steps...one is the original df and the second is a grouped by df and then merge them together. The idea is to see if this can be done more efficiently in a single step. Many thanks in advance!
groupby and use first
df.groupby('Item')['Value'].first()
or you can use transform and assign to a new column in your frame
df['new_col'] = df.groupby('Item')['Value'].transform('first')
Use mask and duplicated
df['GroupFirstValue'] = df.Value.mask(df.Item.duplicated())
Out[109]:
Item Value GroupFirstValue
0 a 0 0.0
1 a 1 NaN
2 a 2 NaN
3 b 3 3.0
4 b 4 NaN
5 b 5 NaN
6 b 6 NaN
7 c 7 7.0
8 c 8 NaN
9 c 9 NaN
I am trying to create a new column in a pandas dataframe that sums the total of other columns. However, if any of the source columns are blank (NaN or 0), I need the new column to also be written as blank (NaN)
a b c d sum
3 5 7 4 19
2 6 0 2 NaN (note the 0 in column c)
4 NaN 3 7 NaN
I am currently using the pd.sum function, formatted like this
df['sum'] = df[['a','b','c','d']].sum(axis=1, numeric_only=True)
which ignores the NaNs, but does not write NaN to the sum column.
Thanks in advance for any advice
replace your 0 to np.nan then pass skipna = False
df.replace(0,np.nan).sum(1,skipna=False)
0 19.0
1 NaN
2 NaN
dtype: float64
df['sum'] = df.replace(0,np.nan).sum(1,skipna=False)
If I have a DataFrame where I want to group rows with the same index name, say:
a b c
c 2 1 -
c nan 2 -
d 4 3 -
e 5 4 -
d 6 5 -
I want to merge rows with the same column name, while taking the average of their values in column a and b. So that df would turn into:
a b
c 2 1.5
d 5 4
e 5 4
If I do:
averaging = df.groupby(["Index"])[['a', 'b']].mean()
("Index" is the name set for the rows)
That works, except it doesn't ignore nan. So instead of my desired dataframe, I get:
a b
c nan 1.5
d 5 4
e 5 4
You can using mean with level=0
pd.to_numeric(df.a,errors='coerce').mean(level=0)
Out[438]:
c 2.0
d 5.0
e 5.0
Name: a, dtype: float64
Also nan is not NaN , using replace
df=df.replace('nan',np.nan)
Say I have a Dataframe
df = pd.DataFrame({'A':[0,1],'B':[2,3]})
A B
0 0 2
1 1 3
Then I have a Series generated by some other function using inputs from the first row of the df but which has no overlap with the existing df
s = pd.Series ({'C':4,'D':6})
C 4
D 6
Now I want to add s to df.loc[0] with the keys becoming new columns and the values added only to this first row. The end result for df should look like:
A B C D
0 0 2 4 6
1 1 3 NaN NaN
How would I do that? Similar questions I've found only seem to look at doing this for one column or just adding the Series as a new row at the end of the DataFrame but not updating an existing row by adding multiple new columns from a Series.
I've tried df.loc[0,list(['C','D'])] = [4,6] which was suggested in another answer but that only works if ['C','D'] are already existing columns in the Dataframe. df.assign(**s) works but then assigns the Series values to all rows.
join with transpose:
df.join(pd.DataFrame(s).T)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
Or use concat
pd.concat([df, pd.DataFrame(s).T], axis=1)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
i have the following dictionary which contains dataframes as values, each always having the same number of columns (1) with the same title
test = {'A': pd.DataFrame(np.random.randn(10), index=range(10),columns=['values']),
'B': pd.DataFrame(np.random.randn(6), index=range(6),columns=['values']),
'C': pd.DataFrame(np.random.randn(11), index=range(11),columns=['values'])}
from this, i would like to create a single dataframe where the index values are the key values of the dictionary (so A,B,C) and the columns are the union of the current index values across all dictionaries (so in this case 0,1,2,3...10). the values of this dataframe would be the corresponding 'values' from the dataframe corresponding to each row, and where blank, NaN
is there a handy way to do this?
IIUC, use pd.concat, keys, and unstack:
pd.concat([test[i] for i in test], keys=test.keys()).unstack(1)['values']
Better yet,
pd.concat(test).unstack(1)['values']
Output:
0 1 2 3 4 5 6 \
A -0.029027 -0.530398 -0.866021 1.331116 0.090178 1.044801 -1.586620
C 1.320105 1.244250 -0.162734 0.942929 -0.309025 -0.853728 1.606805
B -1.683822 1.015894 -0.178339 -0.958557 -0.910549 -1.612449 NaN
7 8 9 10
A -1.072210 1.654565 -1.188060 NaN
C 1.642461 -0.137037 -1.416697 -0.349107
B NaN NaN NaN NaN
dont over complicate things:
just use concat and transpose
pd.concat(test, axis=1).T
0 1 2 3 4 5 \
A values -0.592711 0.266518 -0.774702 0.826701 -2.642054 -0.366401
B values -0.709410 -0.463603 0.058129 -0.054475 -1.060643 0.081655
C values 1.384366 0.662186 -1.467564 0.449142 -1.368751 1.629717
6 7 8 9 10
A values 0.431069 0.761245 -1.125767 0.614622 NaN
B values NaN NaN NaN NaN NaN
C values 0.988287 -1.508384 0.214971 -0.062339 -0.011547
if you were dealing with series instead of 1 column DataFrame it would make more sense to begin with...
test = {'A': pd.Series(np.random.randn(10), index=range(10)),
'B': pd.Series(np.random.randn(6), index=range(6)),
'C': pd.Series(np.random.randn(11), index=range(11))}
pd.concat(test,axis=1).T
0 1 2 3 4 5 6 \
A -0.174565 -2.015950 0.051496 -0.433199 0.073010 -0.287708 -1.236115
B 0.935434 0.228623 0.205645 -0.602561 1.860035 -0.921963 NaN
C 0.944508 -1.296606 -0.079339 0.629038 0.314611 -0.429055 -0.911775
7 8 9 10
A -0.704886 -0.369263 -0.390684 NaN
B NaN NaN NaN NaN
C 0.815078 0.061458 1.726053 -0.503471