comparing two columns and replace NaN with numbers - python

for i in range(len(df1)-1):
if (df1['overall_rating'][i]==np.nan) and (df1['recommended'][i]==0):
df1['overall_rating']=df1['overall_rating'][i].replace(np.nan,1)
else:
df1['overall_rating']
print(df1['overall_rating'])
I am comparing overall rating columns and recommended column in a pandas dataframe. If both column values happens to be true then i should replace nan in rating column to be 1 . But I am not getting answer as well error.Anyone please let me know where I am going wrong.

Use DataFrame.loc for set 1 by 2 conditions, for test missing values is used Series.isna function:
df1 = pd.DataFrame({'overall_rating':[np.nan,2,4,np.nan],
'recommended':[0,0,1,1]})
df1.loc[df1['overall_rating'].isna() & (df1['recommended']==0), 'overall_rating'] = 1
print (df1)
overall_rating recommended
0 1.0 0
1 2.0 0
2 4.0 1
3 NaN 1

Related

Pandas: How to delete rows where 2 conditions in 2 different columns need to be met

Let's say I have a data frame that looks like this. I want to delete everything with a certain ID if all of its Name values are empty. Like in this example, every name value is missing in the rows where ID is 2. Even if I have 100 rows with the ID 3 and only one name values is present, I want to keep it.
ID
Name
1
NaN
1
Banana
1
NaN
2
NaN
2
NaN
2
NaN
3
Apple
3
NaN
So the desired output looks like this:
ID
Name
1
NaN
1
Banana
1
NaN
3
Apple
3
NaN
Everything I tried so far was wrong. In this attempt, I tried to count every NaN Value that belongs to an ID, but it still returns me too many rows. This is the closest I got to my desired outcome.
df = df[(df['ID']) & (df['Name'].isna().sum()) != 0]
You want to exclude rows from IDs that have as many NaNs as they have rows. Therefore, you can group by ID and count their number of rows and number of NaNs.
Based on this result, you can get the IDs from people whose row count equals their NaN count and exclude them from your original dataframe.
# Declare column that indicates if `Name` is NaN
df['isna'] = df['Name'].isna().astype(int)
# Declare a dataframe that counts the rows and NaNs per `ID`
counter = df.groupby('ID').agg({'Name':'size', 'isna':'sum'})
# Get ID's from people who have as many NaNs as they have rows
exclude = counter[counter['Name'] == counter['isna']].index.values
# Exclude these IDs from your data
df = df[~df['ID'].isin(exclude)]
Using .groupby and .query
ids = df.groupby(["ID", "Name"]).agg(Count=("Name", "count")).reset_index()["ID"].tolist()
df = df.query("ID.isin(#ids)").reset_index(drop=True)
print(df)
Output:
ID Name
0 1 NaN
1 1 Banana
2 1 NaN
3 3 Apple
4 3 NaN

How to sum duplicate columns in dataframe and return nan if at least one value is nan

I have a dataframe with duplicate columns (number not known a priori) like this example:
a
a
a
b
b
0
1
1
1
1
1
1
1
nan
1
1
1
I need to be able to aggregate the columns by summing their values (by rows) and returning NaN if at least one value, in one of the columns among the duplicates, is NaN.
I have tried this code:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,1,1,1,1], [1,np.nan,1,1,1]], columns=['a','a','a','b','b'])
df = df.groupby(axis=1, level=0).sum()
The result i get is as follows, but it does not return NaN in the second row of column 'a'.
a
b
0
3
2
1
2
2
In the documentation of pandas.DataFrame.sum, there is the skipna parameter which might suit my case. But I am using the function pandas.core.groupby.GroupBy.sum which does not have this parameter, but the min_count which does what i want but the number is not known in advance and would be different for each duplicate column.
For example, a min_count=3 solves the problem for column 'a', but obviously returns NaN on the whole of column 'b'.
The result I want to achieve is:
a
b
0
3
2
1
nan
2
One workaround might be to use apply to get the DataFrame.sum:
df.groupby(level=0, axis=1).apply(lambda x: x.sum(axis=1, skipna=False))
Output:
a b
0 3.0 2.0
1 NaN 2.0
Another possible solution:
cols, ldf = df.columns.unique(), len(df)
pd.DataFrame(
np.reshape([sum(df.loc[i, x]) for i in range(ldf) for x in cols],
(len(cols), ldf)),
columns=cols)
Output:
a b
0 3.0 2.0
1 NaN 2.0

Why sometimes we have to add .values when we do elementwise operation in pandas?

Suppose I have a dataframe looks like
A
0 0
1 1
2 2
3 3
and when I run:
a = df.loc[np.arange(0,2)] / df.loc[np.arange(2,4)]
I get
A
0 NaN
1 NaN
2 NaN
3 NaN
I know I could get the right result by writing
a = df.loc[np.arange(0,2)].values / df.loc[np.arange(2,4)]
b = df.loc[np.arange(0,2)] / df.loc[np.arange(2,4)].values
Can anyone explain why?
Due to pandas is index and columns sensitive, when you do the calculation the hidden key for them get match first , if we only need to get the value match and remove the impact of index and columns is adding .values or to_numpy() , however, index also bring some advantage as well
Example 1 index not match so the value will return NaN
s1=pd.Series([1],index=[1])
s2=pd.Series([1],index=[999])
s1/s2
1 NaN
999 NaN
dtype: float64
s1.values/s2.values
array([1.])
Example 2 index match so pandas will return the value when the index match
s1=pd.Series([1],index=[1])
s2=pd.Series([1,999],index=[1,999])
s1/s2
1 1.0
999 NaN
dtype: float64

Replacing the missing values in python with Lists in python dataframe

I have a dataframe and wanted to fill the Nan values of particular column with the list derived from other calculation.
df = pd.DataFrame([1,Nan,3,Nan], columns=['A'])
values_to_be_filled = [100.942,90.942]
df
A
0 1
1 Nan
2 3
3 Nan
output:
df2
A
0 1
1 100.942
2 3
3 90.942
I have tried use the replace function but not able to replace with the list elements. Any help would be much appreciated
df.loc[df["A"].isnull(), "A"] = values_to_be_filled

accessing Groupby Sum results [duplicate]

I have a dataframe with 2 index levels:
value
Trial measurement
1 0 13
1 3
2 4
2 0 NaN
1 12
3 0 34
Which I want to turn into this:
Trial measurement value
1 0 13
1 1 3
1 2 4
2 0 NaN
2 1 12
3 0 34
How can I best do this?
I need this because I want to aggregate the data as instructed here, but I can't select my columns like that if they are in use as indices.
The reset_index() is a pandas DataFrame method that will transfer index values into the DataFrame as columns. The default setting for the parameter is drop=False (which will keep the index values as columns).
All you have to do call .reset_index() after the name of the DataFrame:
df = df.reset_index()
This doesn't really apply to your case but could be helpful for others (like myself 5 minutes ago) to know. If one's multindex have the same name like this:
value
Trial Trial
1 0 13
1 3
2 4
2 0 NaN
1 12
3 0 34
df.reset_index(inplace=True) will fail, cause the columns that are created cannot have the same names.
So then you need to rename the multindex with df.index = df.index.set_names(['Trial', 'measurement']) to get:
value
Trial measurement
1 0 13
1 1 3
1 2 4
2 0 NaN
2 1 12
3 0 34
And then df.reset_index(inplace=True) will work like a charm.
I encountered this problem after grouping by year and month on a datetime-column(not index) called live_date, which meant that both year and month were named live_date.
There may be situations when df.reset_index() cannot be used (e.g., when you need the index, too). In this case, use index.get_level_values() to access index values directly:
df['Trial'] = df.index.get_level_values(0)
df['measurement'] = df.index.get_level_values(1)
This will assign index values to individual columns and keep the index.
See the docs for further info.
As #cs95 mentioned in a comment, to drop only one level, use:
df.reset_index(level=[...])
This avoids having to redefine your desired index after reset.
I ran into Karl's issue as well. I just found myself renaming the aggregated column then resetting the index.
df = pd.DataFrame(df.groupby(['arms', 'success'])['success'].sum()).rename(columns={'success':'sum'})
df = df.reset_index()
Short and simple
df2 = pd.DataFrame({'test_col': df['test_col'].describe()})
df2 = df2.reset_index()
A solution that might be helpful in cases when not every column has multiple index levels:
df.columns = df.columns.map(''.join)
Similar to Alex solution in a more generalized form. It keeps the indexes untouched and adds index level as a new columns with its name.
for i in df.index.names:
df[i] = df.index.get_level_values(i)
which gives
value Trial measurement
Trial measurement
1 0 13 1 0
1 3 1 1
2 4 1 2
...

Categories

Resources