I have a pandas DataFrame that I want to separate into observations for which there are no missing values and observations with missing values. I can use dropna() to get rows without missing values. Is there any analog to get rows with missing values?
#Example DataFrame
import pandas as pd
df = pd.DataFrame({'col1': [1,np.nan,3,4,5],'col2': [6,7,np.nan,9,10],})
#Get observations without missing values
df.dropna()
Check null by row and filter with boolean indexing:
df[df.isnull().any(1)]
# col1 col2
#1 NaN 7.0
#2 3.0 NaN
~ = Opposite :-)
df.loc[~df.index.isin(df.dropna().index)]
Out[234]:
col1 col2
1 NaN 7.0
2 3.0 NaN
Or
df.loc[df.index.difference(df.dropna().index)]
Out[235]:
col1 col2
1 NaN 7.0
2 3.0 NaN
I use the following expression as the opposite of dropna. In this case, it keeps rows based on the specified column that are null. Anything with a value is not kept.
csv_df = csv_df.loc[~csv_df['Column_name'].notna(), :]
Related
I want to merge/join two large dataframes while the 'id' column the dataframe on the right is assumed to be substrings of the left 'id' column.
For illustration purposes:
import pandas as pd
import numpy as np
df1=pd.DataFrame({'id':['abc','adcfek','acefeasdq'],'numbers':[1,2,np.nan],'add_info':[3123,np.nan,312441]})
df2=pd.DataFrame({'matching':['adc','fek','acefeasdq','abcef','acce','dcf'],'needed_info':[1,2,3,4,5,6],'other_info':[22,33,11,44,55,66]})
This is df1:
id numbers add_info
0 abc 1.0 3123.0
1 adcfek 2.0 NaN
2 acefeasdq NaN 312441.0
And this is df2:
matching needed_info other_info
0 adc 1 22
1 fek 2 33
2 acefeasdq 3 11
3 abcef 4 44
4 acce 5 55
5 dcf 6 66
And this is the desired output:
id numbers add_info needed_info other_info
0 abc 1.0 3123.0 NaN NaN
1 adcfek 2.0 NaN 2.0 33.0
2 adcfek 2.0 NaN 6.0 66.0
3 acefeasdq NaN 312441.0 3.0 11.0
So as described, I only want to merge the additional columns only when the 'matching' column is a substring of the 'id' column. If it is the other way around, e.g. 'abc' is a substring of 'adcef', nothing should happen.
In my data, a lot of the matches between df1 and df2 are actually exact, like the 'acefeasdq' row. But there are cases where 'id's contain multiple 'matching's. For the moment, it is okish to ignore these cases but I'd like to learn how I can tackle this issue. And additionally, is it possible to mark out the rows that are merged based on substrings and the rows that are merged exactly?
You can use pd.merge(how='cross') to create a dataframe containing all combinations of the rows. And then filter the dataframe using a boolean series:
df = pd.merge(df1, df2, how="cross")
include_row = df.apply(lambda row: row.matching in row.id, axis=1)
filtered = df.loc[include_row]
print(filtered)
Docs:
pd.merge
Indexing and selecting data
If your processing can handle CROSS JOIN (problematic with large datasets), then you could cross join and then delete/filter only those you want.
map= cross.apply(lambda x: str(x['matching']) in str(x['id']), axis=1) #create map of booleans
final = cross[map] #get only those where condition was met
I am attempting to forward fill a filtered section of a DataFrame but it is not working the way I hoped.
I have df that look like this:
Col Col2
0 1 NaN
1 NaN NaN
2 3 string
3 NaN string
I want it to look like this:
Col Col2
0 1 NaN
1 NaN NaN
2 3 string
3 3 string
This my current code:
filter = (df["col2"] == "string")
df.loc[filter, "col"].fillna(method="ffill", inplace=True)
But my code does not change the df at all. Any feedback is greatly appreciated
We can use boolean indexing to filter the section of Col where Col2 = 'string' then forward fill and update the values only in that section
m = df['Col2'].eq('string')
df.loc[m, 'Col'] = df.loc[m, 'Col'].ffill()
Col Col2
0 1.0 NaN
1 NaN NaN
2 3.0 string
3 3.0 string
I am not sure I understand your question but if you want to fill the NAN values or any values you should use the Simple imputer
from sklearn.impute import SimpleImputer
Then you can define an imputer that fills these missing values/NAN with a specific strategy. For example if you want to fill these values with the mean of all the column you can write it as follows:
imputer=SimpleImputer(missing_values=np.nan, strategy= 'mean')
Or you can write it like this if you have the NaN as string
imputer=SimpleImputer(missing_values="NaN", strategy= 'mean')
and if you want to fill it with a specific values you can do this:
imputer=SimpleImputer(missing_values=np.nan, strategy= 'constant', fill_value = "YOUR VALUE")
Then you can use it like that
df[["Col"]]=imputer.fit_transform(df[["Col"]])
I have a dataframe like the following
df = [[1,'NaN',3],[4,5,'Nan'],[7,8,9]]
df = pd.DataFrame(df)
and I would like to remove all columns that have in their first row a NaN value.
So the output should be:
df = [[1,3],[4,'Nan'],[7,9]]
df = pd.DataFrame(df)
So in this case, only the second column is removed since the first element was a NaN value.
Hence, dropna() is based on a condition.. any idea how to handle this? Thx!
If values are np.nan and not string NaN(else replace them), you can do:
Input:
df = [[1,np.nan,3],[4,5,np.nan],[7,8,9]]
df = pd.DataFrame(df)
Solution:
df.loc[:,df.iloc[0].notna()] #assign back to your desired variable
0 2
0 1 3.0
1 4 NaN
2 7 9.0
I have a dataframe that contains duplicate column names. Now I am trying to combine the duplicate columns into a single column using the following command (the following dataframe is for demo only. it doesn't contain duplicate column names, but the same problem will occur with duplicate column name as well).
d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
d['col2']=d['col2'].astype(str)
d['col1']=np.nan
d=d.groupby(lambda x:x, axis=1).sum(min_count=1)
the output is:
col1 col2
0 0.0 3.0
1 0.0 4.0
But I expect
the output is:
col1 col2
0 Nan 3.0
1 Nan 4.0
My hope is that, by using min_count=1, pandas will return NaN when the columns being summed up are all NaN. However, now it is returning 0 instead of NaN. Any idea why?
This depends on your version number of pandas, when you set min_count=1.
If you have version < 0.22.0 then you would indeed get np.nan when there are less then 1 non-na values.
From version 0.22.0 and up, the default value has been changed to 0 when are only na values.
This is also explained in the documentation.
If I need to choose from a dataframe where columns col1 and col2 must follow the condition that atleast one of these columns must be not null.
Right now, I am trying to perform below but it doesn't work
df=df.loc[(df['Cat1_L2'].isnull()) & (df['Cat2_L3'].isnull())==False]
Setup
(Modifying U8-Forward's data)
df = pd.DataFrame({'Cat1_L2':[1,np.nan,3, np.nan], 'Cat3_L3': [np.nan,3,4, np.nan]})
df
Cat1_L2 Cat3_L3
0 1.0 NaN
1 NaN 3.0
2 3.0 4.0
3 NaN NaN
Indexing with isna + sum
Fixing your code, ensure the number of True cases (corresponding to NaN in columns) is lesser than 2.
df[df[['Cat1_L2', 'Cat3_L3']].isna().sum(axis=1) < 2]
Cat1_L2 Cat3_L3
0 1.0 NaN
1 NaN 3.0
2 3.0 4.0
dropna with thresh
df.dropna(subset=['Cat1_L2', 'Cat3_L3'], thresh=1)
Cat1_L2 Cat3_L3
0 1.0 NaN
1 NaN 3.0
2 3.0 4.0
One way is to loop over every row using itertuples(). Beaware that this is computationally expensive.
1 - Create list that chceks your condition for each row using itertuples()
condition_list = []
for row in df.itertuples():
if (row.Cat1_L2 != None) or (row.Cat2_L3 != None):
condition_list.append(1)
else:
condition_list.append(0)
2. Convert list to pandas series
condition_series = pd.Series(condition_list)
3. Append series to original df
df['condition_column'] = condition_series.values
4. Filter df
df_new = df[df.condition_column == 1]
del df_new['condition_column']