Given a DataFrame, I'd like to count the number of NaN values in each column, to show the proportion as a histogram.
I've come up with
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
nan_dict = {}
for col in df:
nan_dict[col] = df[col].value_counts(dropna=False)[0]
and then build the histogram from the dict. This seems really cumbersome; also, it fails when there are no NaNs.
Is there a way I could apply value_counts along all columns so that I get back a Series with NaN values per column?
df = pd.DataFrame({"col1": [1, 2], "col2": [3, 4]})
print(dict(zip(df.columns, df.isna().sum())))
Prints:
{'col1': 0, 'col2': 0}
For dataframe:
col1 col2
0 1 3.0
1 2 NaN
Prints:
{'col1': 0, 'col2': 1}
Related
I would like to add new column using other column value with condition
In pandas, I do this like below
import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df['c'] = df['a']
df.loc[df['b']==4, 'c'] = df['b']
The result is
a
b
c
1
3
1
2
4
4
Could you teach me how to do this with polars?
Use when/then/otherwise
df = pl.DataFrame({'a': [1, 2], 'b': [3, 4]})
df.with_columns(
pl.when(pl.col("b") == 4).then(pl.col('b')).otherwise(pl.col('a')).alias("c")
)
I am trying something like this:
List append in pandas cell
But the problem is the post is old and everything is deprecated and should not be used anymore.
d = {'col1': ['TEST', 'TEST'], 'col2': [[1, 2], [1, 2]], 'col3': [35, 89]}
df = pd.DataFrame(data=d)
col1
col2
col3
TEST
[1, 2, 3]
35
TEST
[1, 2, 3]
89
My Dataframe looks like this, were there is the col2 is the one I am interested in. I need to add [0,0] to the lists in col2 for every row in the DataFrame. My real DataFrame is of dynamic shape so I cant just set every cell on its own.
End result should look like this:
col1
col2
col3
TEST
[1, 2, 3, 0, 0]
35
TEST
[1, 2, 3, 0, 0]
89
I fooled around with df.apply and df.assign but I can't seem to get it to work.
I tried:
df['col2'] += [0, 0]
df = df.col2.apply(lambda x: x.append([0,0]))
Which returns a Series that looks nothing like i need it
df = df.assign(new_column = lambda x: x + list([0, 0))
Not sure if this is the best way to go but, option 2 works with a little modification
import pandas as pd
d = {'col1': ['TEST', 'TEST'], 'col2': [[1, 2], [1, 2]], 'col3': [35, 89]}
df = pd.DataFrame(data=d)
df["col2"] = df["col2"].apply(lambda x: x + [0,0])
print(df)
Firstly, if you want to add all members of an iterable to a list use .extend instead of .append. This doesn't work because the method works inplace and doesn't return anything so "col2" values become None, so use list summation instead. Finally, you want to assign your modified column to the original DataFrame, not override it (this is the reason for the Series return)
One idea is use list comprehension:
df["col2"] = [x + [0,0] for x in df["col2"]]
print (df)
col1 col2 col3
0 TEST [1, 2, 0, 0] 35
1 TEST [1, 2, 0, 0] 89
for val in df['col2']:
val.append(0)
I have this dataset where I have NaN values on column 'a'. I want to group rows by 'user_id', compute the mean on column 'c' grouped by 'user_id' and fill NaN values on 'a' with this mean. How can I do it?
this is the code
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0, np.nan, np.nan], 'user_id': [1, 2, 2], 'c': [3, 7, 7]})
print(df)
what I should have
df = pd.DataFrame({'a': [0, 7, 7], 'user_id': [1, 2, 2], 'c': [3, 7, 7]})
print(df)
what I've tried
df['a'].fillna(df.groupby('user_id')['a'].transform('mean'), inplace=True)
print(df)
after printing the df I still se NaN on 'a' column
Note since I have a huge dataset I need to do id inplace
I think you need processing column c:
df['a'].fillna(df.groupby('user_id')['c'].transform('mean'), inplace=True)
df['a'] = df['a'].fillna(df.groupby('user_id')['c'].transform('mean'))
print (df)
a user_id c
0 0.0 1 3
1 7.0 2 7
2 7.0 2 7
I have three dataframes that have the same format and I want to simply add the three respective values on top of each other, so that df_new = df1 + df2 + df3. The new df would have the same amount of rows and columns as each old df.
But doing so only appends the columns. I have searched through the docs and there is a lot on merging etc but nothing on adding values. I suppose there must be a one liner for such a basic operation?
Possible solution is the following:
# pip install pandas
import pandas as pd
#set test dataframes with same structure but diff values
df1 = pd.DataFrame({"col1": [1, 1, 1], "col2": [1, 1, 1],})
df2 = pd.DataFrame({"col1": [2, 2, 2], "col2": [2, 2, 2],})
df3 = pd.DataFrame({"col1": [3, 3, 3], "col2": [3, 3, 3],})
df_new = pd.DataFrame()
for col in list(df1.columns):
df_new[col] = df1[col].map(str) + df2[col].map(str) + df3[col].map(str)
df_new
Returns
I was wondering if I can used pandas .drop method to drop rows when chaining methods to construct a data frame.
Dropping rows is straight forward once the data frame exists:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
print(df1)
# drop the entries that match "2"
df1 = df1[df1['A'] !=2]
print(df1)
However, I would like to do this while I am creating the data frame:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
# .drop(lambda x: x['A']!=2)
)
print(df2)
The commented line does not work, but maybe there is a correct way of doing this. Grateful for any input.
Use DataFrame.loc with callable:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.loc[lambda x: x['AA']!=2]
)
Or DataFrame.query:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.query("AA != 2")
)
print(df2)
AA B
0 1 5
2 3 3
You can use DataFrame.apply with DataFrame.dropna:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.apply(lambda x: x if x['AA'] !=2 else np.nan,axis=1).dropna()
)
print(df2)
AA B
0 1.0 5.0
2 3.0 3.0
Maybe you can try a selection with ".loc" at the end of your definition line code.