Chained conditional count in Pandas - python

I have a dataframe that looks at how a form has been filled out. Here's an example:
ID Name Postcode Street Employer Salary
1 John NaN Craven Road NaN NaN
2 Sue TD2 NAN NaN 15000
3 Jimmy MW6 Blake Street Bank 40000
4 Laura QE2 Mill Lane NaN 20000
5 Sam NW2 Duke Avenue Farms 35000
6 Jordan SE6 NaN NaN NaN
7 NaN CB2 NaN Startup NaN `
I want to return a count of successively filled out columns on the condition that all previous columns have been filled. The final output should look something like:
Name Postcode Street Employer salary
6 5 3 2 2
Is there a good Pandas way of doing this? I suppose there could be a way of applying a mask so that if any previous boolean is given as zero the current column is also zero and then counting that but I'm not sure if that is the best way.
Thanks!

I think you can use notnull and cummin:
In [99]: df.notnull().cummin(axis=1).sum(axis=0)
Out[99]:
Name 6
Postcode 5
Street 3
Employer 2
Salary 2
dtype: int64
Although note that I had to replace your NAN (Sue's street) with a float NaN before I did that, and I assumed that ID was your index.
The cumulative minimum is one way to implement "applying a mask so that if any previous boolean is given as zero the current column is also zero", as you predicted would work.

Maybe cumprod BTW you have 'NAN' in your df, I try then as notnull here
df.notnull().cumprod(1).sum()
Out[59]:
ID 7
Name 6
Postcode 5
Street 4
Employer 2
Salary 2
dtype: int64

Related

Check if multiple rows are filled and if not blank all values

I have the following dataframe:
ID Name Date Code Manager
1 Paulo % 10 Jonh's Peter
2 Pedro 2001-01-20 20
3 James 30
4 Sofia 2001-01-20 40 -
I need a way to check if multiple columns (in this case Date, Code and Manager) are filled with any value. In the case there is any blank in any of those 3 columns, blank all the values in all 3 columns, returning this data:
ID Name Date Code Manager
1 Paulo % 10 Jonh's Peter
2 Pedro
3 James
4 Sofia 2001-01-20 40 -
What is the best solution for this case?
You can use pandas.DataFrame.loc to replace values in certain colums and based on a condition .
Considering that your dataframe is named df:
You can use this code to get the expected output :
df.loc[df[["Date", "Code", "Manager"]].isna().any(axis=1), ["Date", "Code", "Manager"]] = ''
>>> print(df)
You can use pandas.isnull() with pandas.any(axis=1) then use pandas.loc for setting what value that you want.
col_chk = ['Date', 'Code', 'Manager']
m = df[col_chk].isnull().any(axis=1)
df.loc[m , col_chk] = '' # or pd.NA
print(df)
ID Name Date Code Manager
0 1 Paulo % 10.0 Jonh's Peter
1 2 Pedro
2 3 James
3 4 Sofia 2001-01-20 40.0 -

How do I getting all values from a one panda.series row that correspond to s specific value in second row?

I have a Dataframe that looks like this :
name age occ salary
0 Vinay 22.0 engineer 60000.0
1 Kushal NaN doctor 70000.0
2 Aman 24.0 engineer 80000.0
3 Rahul NaN doctor 65000.0
4 Ramesh 25.0 doctor 70000.0
and im trying to get all salary values that correspond to a specific occupation ,t o then compute the mean salary of that occ.
here is an answer with a few step
temp_df = df.loc[df['occ'] == 'engineer']
temp_df.salary.mean()
All averages at once:
df_averages = df[['occ', 'salary']].groupby('occ').mean()
salary
occ
doctor 68333.333333
engineer 70000.000000

Update a dataframe based on values from other. It is a traditional UPSERT task with a new indicator column

I am trying to do an UPSERT task over two dataframes.
Here I am updating df2 with df1.
I have used something like this:
final_df=df1.set_index('EmpID').combine_first(df2.set_index('EmpID'))
final_df.reset_index()
My result here is:
EmpID Name Salary Status
0 A John 1000.0 Left
1 B Mary 2000.0 Working
2 C Samie 3000.0 Left
3 D Doe 4000.0 NaN
4 E Lance 2500.0 Contractor
Also I am not able to add the 'Indicator' column
I did this and almost achieved my goal, but is there any better way? plus what to do with the column insert?
df=pd.concat([df1, df2[~df2.EmpID.isin(df1.EmpID)]])
df=df.set_index('EmpID').join(df2.set_index('EmpID'),how='outer',rsuffix='_R')
df[['Name','Salary','Status_R']].reset_index()
EmpID Name Salary Status_R
0 A John 1000.0 Left
1 B Mary 2000.0 Working
2 C Samie NaN Left
3 D Doe 4000.0 NaN
4 E Lance 2500.0 Contractor

Python Pandas fill missing zipcode with values from another datafrane based on conditions

I have a dataset in which I add coordinates to cities based on zip-codes but several of these zip-codes are missing. Also, in some cases cities are missing, states are missing, or both are missing. For example:
ca_df[['OWNER_CITY', 'OWNER_STATE', 'OWNER_ZIP']]
OWNER_CITY OWNER_STATE OWNER_ZIP
495 MIAMI SHORE PA
496 SEATTLE
However, a second dataset has city, state & the matching zip-codes. This one is complete without any missing values.
df_coord.head()
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
I want to fill in the missing zip-codes in the first dataframe if:
Zip-code is empty
City is present
State is present
This is an all-or-nothing operations means, either all three criteria are met and the zip-code gets filled or nothing changes.
However, this is a fairly large dataset with > 50 million records so ideally I want to vectorize the operation by working column-wise.
Technically, that would fit np.where but as far as I know, np.where only takes of condition in the following format:
df1['OWNER_ZIP'] = np.where(df["cond"] ==X, df_coord['OWNER_ZIP'], "")
How do I ensure I only fill missing zip-codes when all conditions are met?
Given ca_df:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California NaN
2 Houston NaN NaN
and df_coord:
OWNER_ZIP CITY STATE
0 111 Miami Shore Florida
1 222 Los Angeles California
2 333 Houston Texas
You can use pd.notna along with pd.DataFrame#index like this:
inferrable_zips_df = pd.notna(ca_df["OWNER_CITY"]) & pd.notna(ca_df["OWNER_STATE"])
is_inferrable_zip = ca_df.index.isin(df_coord[inferrable_zips_df].index)
ca_df.loc[is_inferrable_zip, "OWNER_ZIP"] = df_coord["OWNER_ZIP"]
with ca_df resulting as:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California 222
2 Houston NaN NaN
I've changed the "" to np.nan, but if you still wish to use "" then you just need to change pd.notna(ca_df[...]) to ca_df[...] == "".
You can combine numpy.where statements to combine multiple rules. This should give you the array of row indices which abide to each of the three rules:
np.where(df["OWNER_ZIP"] == X) and np.where(df["CITY"] == Y) and np.where(df["STATE"] == Z)
Use:
print (df_coord)
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 NaN MN
3 NaN MIAMI SHORE PA
4 NaN SEATTLE NaN
First is necessary test if same dtypes in columns matching:
#or convert ca_df['OWNER_ZIP'] to integers
df_coord['OWNER_ZIP'] = df_coord['OWNER_ZIP'].astype(str)
print (df_coord.dtypes)
OWNER_ZIP object
CITY object
STATE object
dtype: object
print (ca_df.dtypes)
OWNER_ZIP object
OWNER_CITY object
OWNER_STATE object
dtype: object
Then filter for each combinations of columns - missing and non missing values and add new data by merge, then convert index to same like filtered data and assign back:
mask1 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].isna()
df1 = ca_df[mask1].drop('OWNER_ZIP', axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask1])
ca_df.loc[mask1, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df1
mask2 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].isna() & ca_df['OWNER_ZIP'].isna()
df2 = ca_df[mask2].drop(['OWNER_ZIP','OWNER_STATE'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask2])
ca_df.loc[mask2, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df2
mask3 = ca_df['OWNER_CITY'].isna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].notna()
df3 = ca_df[mask3].drop(['OWNER_CITY'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask3])
ca_df.loc[mask3, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df3
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
You can do a left join on these dataframes considering join on the columns 'city' and 'state'. That would give you the zip-code corresponding to a city and state if both values are non-null in the first dataframe (OWNER_CITY, OWNER_STATE, OWNER_ZIP) and since it would be a left join, it would also preserve your rows which either don't have a zip-code or have null/empty city and state values.

Pandas fillna with DataFrame of values

Accordingly to the docs, the fillna value parameter can be one among the following:
value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
I have a data frame that looks like:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
And that is what I want to do:
NaN Cabin will be filled according to the median value given the Pclass feature value
NaN Age will be filled according to its median value across the data set
NaN Embarked will be filled according to the median value given the Pclass feature value
So after some data manipulation, I got this data frame:
Pclass Cabin Embarked Ticket
0 1 C S 50
1 2 F S 13
2 3 G S 5
What it says is that for the Pclass == 1 the most common Cabin is C. Given that, in my original data frame df I want to fill every df['Cabin'] == null with C.
This is a small example and I could treat each possible null combination by hand with something as:
df_both[df_both['Pclass'] == 1 & df_both['Cabin'] == np.NaN] = 'C'
However, I wonder if I can use this derived data frame to do all this filling automatic.
Thank you.
If you want to fill all Nan's with something like the median or the mean of the specific column you can do the following.
for median:
df.fillna(df.median())
for mean
df.fillna(df.mean())
see https://pandas.pydata.org/pandas-docs/stable/missing_data.html#filling-with-a-pandasobject for more information.
Edit:
Alternatively you can use a dictionary with specified values. The keys need to map to column names. This way you can also impute for missing values in strings.
df.fillna({'col1':'a','col2': 1})

Categories

Resources