Multiple similar columns with similar values - python

The dataframe looks like:
name education education_2 education_3
name_1 NaN some college NaN
name_2 NaN NaN graduate degree
name_3 high school NaN NaN
I just want to keep one education column. I tried to use the conditional statement and compared to each other, I got nothing but error though. I also looked through the merge solution, but in vain. Does anyone know how to deal with it using Python or pandas? Thank you in advance.
name education
name_1 some college
name_2 graduate degree
name_3 high school

One day I hope they'll have better functions for String type rows, rather than the limited support for columns currently available:
df['education'] = (df.filter(like='education') # Filters to only Education columns.
.T # Transposes
.convert_dtypes() # Converts to pandas dtypes, still somewhat in beta.
.max() # Gets the max value from the column, which will be the not-null one.
)
df = df[['name', 'education']]
print(df)
Output:
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
Looping this wouldn't be too hard e.g.:
cols = ['education', 'age', 'income']
for col in cols:
df[col] = df.filter(like=col).bfill(axis=1)[col]
df = df[['name'] + cols]

You can use df.fillna to do so.
df['combine'] = df[['education','education2','education3']].fillna('').sum(axis=1)
df
name education education2 education3 combine
0 name1 NaN some college NaN some college
1 name2 NaN NaN graduate degree graduate degree
2 name3 high school NaN NaN high school
If you have a lot of columns to combine, you can try this.
df['combine'] = df[df.columns[1:]].fillna('').sum(axis=1)

use bfill to fill the empty (NaN) values
df.bfill(axis=1).drop(columns=['education 2','education 3'])
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
if there are other columns in between then choose the columns to apply bfill
In essence, if you have multiple columns for education that you need to consolidate under a single column then choose the columns to which you apply the bfill. subsequently, you can delete those columns from which you back filled.
df[['education','education 2','education 3']].bfill(axis=1).drop(columns=['education 2','education 3'])

Related

How to collapse all rows in pandas dataframe across all columns

I am trying to collapse all the rows of a dataframe into one single row across all columns.
My data frame looks like the following:
name
job
value
bob
business
100
NAN
dentist
Nan
jack
Nan
Nan
I am trying to get the following output:
name
job
value
bob jack
business dentist
100
I am trying to group across all columns, I do not care if the value column is converted to dtype object (string).
I'm just trying to collapse all the rows across all columns.
I've tried groupby(index=0) but did not get good results.
You could apply join:
out = df.apply(lambda x: ' '.join(x.dropna().astype(str))).to_frame().T
Output:
name job value
0 bob jack business dentist 100.0
Try this:
new_df = df.agg(lambda x: x.dropna().astype(str).tolist()).str.join(' ').to_frame().T
Output:
>>> new_df
name job value
0 bob jack business dentist 100.0

keep one column's data in pandas and show all NANs from other columns only

Goal: Id like to still show who the person is so that I can display the NANs associated with them so that I can quickly find who is missing info.
Consider this dataset:
df:
Name Phone Address
John Doe NAN 123 lane
Jenny Gump 222-222-2222 NAN
Larry Bean NAN 561 road
Harry Smidlap 111-111-1111 555 highway
I'd like to clean the data up and show something like this (similar to an excel view when filtering for blanks):
Then maybe populate the empty data with something that says "Data exists" or just leave it blank. Im open to suggestions. And also drop the rows that have all data populated.
df:
Name Phone Address
John Doe NAN
Jenny Gump NAN
Larry Bean NAN
I've tried:
df[df.isnull().any(axis=1)]
That works great but I have a big data source and I see a lot of unnecessary info that already has data. I only care about seeing the person's name and what their missing.
Anyone have any ideas?
Since you require the Name column to be intact, you can just select other columns except Name and mask them, then create another data frame df2 which removes all the NaN values. After that you can just drop the indexes in df2 from df which you give you rows with only the NaN values as follows.
df.mask((df.columns != 'Name') & (df.notnull()), "", inplace=True)
df2 = df.dropna()
df.drop(df2.index, inplace=True)
This should give you the following output.
Name Phone Address
John Doe NAN
Jenny Gump NAN
Larry Bean NAN
Mask (replacing values where the condition is true) any place where it is not null with an empty string.
df.mask(df.notnull(), '')
This operates over multiple dimensions, passing a 2D set of true/false answers to the question "Replace or not?". And where it is true, it send the contents to /dev/null, while where it is not, it allows them to remain precariously.

How to drop rows in one DataFrame based on one similar column in another Dataframe that has a different number of rows

I have two DataFrames that are completely dissimilar except for certain values in one particular column:
df
First Last Email Age
0 Adam Smith email1#email.com 30
1 John Brown email2#email.com 35
2 Joe Max email3#email.com 40
3 Will Bill email4#email.com 25
4 Johnny Jacks email5#email.com 50
df2
ID Location Contact
0 5435 Austin email5#email.com
1 4234 Atlanta email1#email.com
2 7896 Toronto email3#email.com
How would I go about finding the matching values in the Email column of df and the Contact column of df2, and then dropping the whole row in df based on that match?
Output I'm looking for (index numbering doesn't matter):
df1
First Last Email Age
1 John Brown email2#email.com 35
3 Will Bill email4#email.com 25
I've been able to identify matches using a few different methods like:
Changing the column names to be identical
common = df.merge(df2,on=['Email'])
df3 = df[(~df['Email'].isin(common['Email']))]
But df3 still shows all the rows from df.
I've also tried:
common = df['Email'].isin(df2['Contact'])
df.drop(df[common].index, inplace = True)
And again, identifies the matches but df still contains all original rows.
So the main thing I'm having difficulty with is updating df with the matches dropped or creating a new DataFrame that contains only the rows with dissimilar values when comparing the Email column from df and the Contact column in df2. Appreciate any suggestions.
As mentioned in the comments(#Arkadiusz), it is enough to filter your data using the following
df3 = df[(~df['Email'].isin(df2.Contact))].copy()
print(df3)

pandas - Can't merge df/series and groupby then count

TL;DR:
Have 2 dataframes with different sizes, but one 'id' column(in both df) that supposed to act as index. Need to merge them, group by 'sector' and 'gender' and count/sum entrys in each group.
Long version:
I have a dataframe with 'id', 'sector', among other columns, from company personnel. Another dataframe with 'id' and 'gender'. Examples bellow:
df1:
row* id sector other columns
1 0 Operational ...
2 0 Administrative ...
3 1 Sales ...
4 2 IT ...
5 3 Operational ...
6 3 IT ...
7 4 Sales ...
[...]
150 100 Operational ...
151 100 Sales ...
152 101 IT ...
*I don't really have a 'row' column, it's there just to make it easier to understand my problem.
df2:
row* id gender
1 0 Male
2 1 Female
3 2 Female
4 3 Male
5 4 Male
[...]
101 100 Male
102 101 Female
As you can see, one person can be in more then one sector (which seems to make my problem more complicated.)
I need to merge them together and then make a sum from how many male and female in each sector.
FIRST PROBLEM
Decided to make a new df to get only the columns 'id' and 'sector'.
df3 = df1[['id','sector']]
df3 = df3.merge(df2)
I get:
No common columns to perform merge on. Merge options: left_on=None,
right_on=None, left_index=False, right_index=False
Tried using .join() instead of .merge() and I get:
['id'] not in index"
Tried now with reset_index() - Found in some of the answers around here, but didn't really solved my issue.
df1 = df1.reset_index()
df3 = df1[['id','sector']]
df3 = df3.join(df2)
What I got was this:
row* id sector gender
1 0 Operational Male
2 0 Administrative Female
3 1 Sales Female
4 2 IT Male
5 3 Operational Male
6 3 IT ...
7 4 Sales ...
[...]
150 100 Operational NaN
151 100 Sales NaN
152 101 IT NaN
It didn't respected the 'id' and just concatenated the column to the side. Since df2 only had 102 rows, I got NaN in the other rows(103 to 152), aside from the fact that the 'gender' was no longer accurate.
SECOND PROBLEM
Decided to power through that in order to get the rest of the work done. I tried this:
df3 = df3.groupby('sector','gender').size()
It raises:
No axis named gender for object type < class 'pandas.core.frame.DataFrame'>
What doesn't really make sense to me, because I can call df3.gender and I get the (entire) expected series. If I remove 'gender' from the line above, it actually group but just that doesn't work for me. Also tried passing the columns names befor groupby, to no avail.
Expected result should be something like this:
sector gender sum
operational male 20
operational female 5
administrative male 10
administrative female 17
sales male 12
sales female 13
IT male 1
IT female 11
Not sure if I can answer to my own question but I think I should since the issue is resolved.
The solutions were very simple, even though I don't understand some of the issues I got.
First problem added on='id' in the merge
df3 = df1[['id','sector']].merge(df2, on='id')
Second problem just missing a list, as pointed by #DYZ
df3.groupby(['sector','gender']).size()
Feeling quite stupid right now... Must be tired. Thanks DYZ and sorry for the trouble.

Performing the appropriate join operation between two pandas DataFrame

nocity.head()
user_id business_id stars
0 cjpdDjZyprfyDG3RlkVG3w uYHaNptLzDLoV_JZ_MuzUA 5
1 bjTcT8Ty4cJZhEOEo01FGA uYHaNptLzDLoV_JZ_MuzUA 3
2 AXgRULmWcME7J6Ix3I--ww uYHaNptLzDLoV_JZ_MuzUA 3
3 oU2SSOmsp_A8JYI7Z2JJ5w uYHaNptLzDLoV_JZ_MuzUA 4
4 0xtbPEna2Kei11vsU-U2Mw uYHaNptLzDLoV_JZ_MuzUA 5
withcity.head()
business_id city
0 YDf95gJZaq05wvo7hTQbbQ Richmond Heights
1 mLwM-h2YhXl2NCgdS84_Bw Charlotte
2 v2WhjAB3PIBA8J8VxG3wEg Toronto
3 CVtCbSB1zUcUWg-9TNGTuQ Scottsdale
4 duHFBe87uNSXImQmvBh87Q Phoenix
nocity dataframe has business_id, (they may be repeating since it also has the rating each user_id gave for each business_id)
The withcity dataframe has the city associated with each business_id
The result I want is:
This is going to be super hard to word:
I want to look up the city associated with each business_id from the withcity dataframe and create a new column in nocity called cityname, which now has the city name associated with that business_id
Why I gave up trying and came here
I know this can be performed with some sort of join operation.. But I don't understand which one exactly.. I looked them up online and got a little confused as to what would happen if some business_id wasn't available in the two dataframes when performing that join operation?
For example:
withcity has some business_id with some city value; and when performing whichever appropriate join with the nocity, it does not find that particular business_id
So I came here for help.
What other alternative did I try?
area_dict = dict(zip(withcity.business_id, withcity.city))
emptylist = []
for rows in nocity['business_id']:
for key, value in area_dict.items():
if(key == rows):
emptylist.append(value)
I created a dictionary which held the business_id and the city from the withcity dataframe, and performed some sort of matching comparison with the nocity dataframe.
But my method, will probably take a lot of time since there are 4.7 million records to be exact.
IIUC merge
nocity.merge(withcity,on='business_id',how='left')
Out[855]:
user_id business_id stars city
0 cjpdDjZyprfyDG3RlkVG3w uYHaNptLzDLoV_JZ_MuzUA 5 NaN
1 bjTcT8Ty4cJZhEOEo01FGA uYHaNptLzDLoV_JZ_MuzUA 3 NaN
2 AXgRULmWcME7J6Ix3I--ww uYHaNptLzDLoV_JZ_MuzUA 3 NaN
3 oU2SSOmsp_A8JYI7Z2JJ5w uYHaNptLzDLoV_JZ_MuzUA 4 NaN
4 0xtbPEna2Kei11vsU-U2Mw uYHaNptLzDLoV_JZ_MuzUA 5 NaN
In general, whenever you have a situation like this you want to consider avoiding loops and iterations and instead perform a merge. Then afterwards, you massage the data to fit your needs. For example, Wen's solution is the most apt way to do this.
However there were a few things I would add. Say those are my two dfs below:
Let's call the first and second dfs, nocity and withcity respectively.
You want to do:
nocity.merge(withcity, on='business_id', how='left')
However, if you end up getting nan values as Wen got above. Check the datatypes of your keys
Meaning, if you business_id field in nocity was int (for some reason) while the business_id field in withcity was str then Pandas will have issues merging the dataframes and you get NaN values instead of the desired City Name.
To check you would do
#for all datatypes in the nocity df
print(nocity.dtypes)
#or just for the field's dtypes
print(nocity.business_id.dtypes)
Then you would convert to a common datatype like str if they were different...
#example conversion of pandas column (series) to different datatype
nocity.business_id = nocity.business_id.astype(str)
withcity.business_id = withcity.business_id.astype(str)
#then perform merge as usual
nocity = nocity.merge(withcity, on='business_id', how='left')
Hope this helps. Also don't forget to change your name from 'city' to 'cityname' if that is what you prefer
nocity.rename(columns = {'city': 'city name'})

Categories

Resources