The dataframe looks like:
name education education_2 education_3
name_1 NaN some college NaN
name_2 NaN NaN graduate degree
name_3 high school NaN NaN
I just want to keep one education column. I tried to use the conditional statement and compared to each other, I got nothing but error though. I also looked through the merge solution, but in vain. Does anyone know how to deal with it using Python or pandas? Thank you in advance.
name education
name_1 some college
name_2 graduate degree
name_3 high school
One day I hope they'll have better functions for String type rows, rather than the limited support for columns currently available:
df['education'] = (df.filter(like='education') # Filters to only Education columns.
.T # Transposes
.convert_dtypes() # Converts to pandas dtypes, still somewhat in beta.
.max() # Gets the max value from the column, which will be the not-null one.
)
df = df[['name', 'education']]
print(df)
Output:
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
Looping this wouldn't be too hard e.g.:
cols = ['education', 'age', 'income']
for col in cols:
df[col] = df.filter(like=col).bfill(axis=1)[col]
df = df[['name'] + cols]
You can use df.fillna to do so.
df['combine'] = df[['education','education2','education3']].fillna('').sum(axis=1)
df
name education education2 education3 combine
0 name1 NaN some college NaN some college
1 name2 NaN NaN graduate degree graduate degree
2 name3 high school NaN NaN high school
If you have a lot of columns to combine, you can try this.
df['combine'] = df[df.columns[1:]].fillna('').sum(axis=1)
use bfill to fill the empty (NaN) values
df.bfill(axis=1).drop(columns=['education 2','education 3'])
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
if there are other columns in between then choose the columns to apply bfill
In essence, if you have multiple columns for education that you need to consolidate under a single column then choose the columns to which you apply the bfill. subsequently, you can delete those columns from which you back filled.
df[['education','education 2','education 3']].bfill(axis=1).drop(columns=['education 2','education 3'])
I am trying to collapse all the rows of a dataframe into one single row across all columns.
My data frame looks like the following:
name
job
value
bob
business
100
NAN
dentist
Nan
jack
Nan
Nan
I am trying to get the following output:
name
job
value
bob jack
business dentist
100
I am trying to group across all columns, I do not care if the value column is converted to dtype object (string).
I'm just trying to collapse all the rows across all columns.
I've tried groupby(index=0) but did not get good results.
You could apply join:
out = df.apply(lambda x: ' '.join(x.dropna().astype(str))).to_frame().T
Output:
name job value
0 bob jack business dentist 100.0
Try this:
new_df = df.agg(lambda x: x.dropna().astype(str).tolist()).str.join(' ').to_frame().T
Output:
>>> new_df
name job value
0 bob jack business dentist 100.0
Every month I collect data that contains details of employees to be stored in our database.
I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.
I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.
There are also some important details to mention:
Each column can also contain blank values in any of the dataframes;
The dataframes have the same column names but not necessarily the same data type;
The dataframes do not have the same number of rows necessarily;
If a row do not find its Index match, do not return to the new dataframe;
The rows of the dataframes can be matched by a column named "Index"
So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):
df1:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com
3 MKT $7600 Maria d 30-06-2021
4 I'T 8000 Peter az#i.com 14-07-2021
df2:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
5 IT 9000 John NOT PROVIDED
6 IT 9900 John NOT PROVIDED
df3:
Index Department Salary Manager Email Start_Date
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
**The differences in this example are:
Start date added in row of Index 2
Salary format corrected and email corrected for row Index 3
Department format corrected for row Index 4
What would be the best way to to this comparison?
I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.
Thank you for the support!
I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
But first you would need to align the rows between old and new dataframe via the index:
new_df_to_compare=new_df.loc[old_df.index]
When datatypes don't match. You would also need to align them:
new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())
Then compare should work just like this:
difference_df = old_df.compare(new_df_to_compare)
I have two DataFrames that are completely dissimilar except for certain values in one particular column:
df
First Last Email Age
0 Adam Smith email1#email.com 30
1 John Brown email2#email.com 35
2 Joe Max email3#email.com 40
3 Will Bill email4#email.com 25
4 Johnny Jacks email5#email.com 50
df2
ID Location Contact
0 5435 Austin email5#email.com
1 4234 Atlanta email1#email.com
2 7896 Toronto email3#email.com
How would I go about finding the matching values in the Email column of df and the Contact column of df2, and then dropping the whole row in df based on that match?
Output I'm looking for (index numbering doesn't matter):
df1
First Last Email Age
1 John Brown email2#email.com 35
3 Will Bill email4#email.com 25
I've been able to identify matches using a few different methods like:
Changing the column names to be identical
common = df.merge(df2,on=['Email'])
df3 = df[(~df['Email'].isin(common['Email']))]
But df3 still shows all the rows from df.
I've also tried:
common = df['Email'].isin(df2['Contact'])
df.drop(df[common].index, inplace = True)
And again, identifies the matches but df still contains all original rows.
So the main thing I'm having difficulty with is updating df with the matches dropped or creating a new DataFrame that contains only the rows with dissimilar values when comparing the Email column from df and the Contact column in df2. Appreciate any suggestions.
As mentioned in the comments(#Arkadiusz), it is enough to filter your data using the following
df3 = df[(~df['Email'].isin(df2.Contact))].copy()
print(df3)
I have 2 dataframes, one of which has supplemental information for some (but not all) of the rows in the other.
names = df({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank'],
'classification':['thief','thief','good','thief']})
I would like to take the classification column from the info dataframe above and add it to the names dataframe above. However, when I do combined = pd.merge(names, info) the resulting dataframe is only 4 rows long. All of the rows that do not have supplemental info are dropped.
Ideally, I would have the values in those missing columns set to unknown. Resulting in a dataframe where some people are theives, some are good, and the rest are unknown.
EDIT:
One of the first answers I received suggested using merge outter which seems to do some weird things. Here is a code sample:
names = df({'names':['bob','frank','bob','bob','bob''james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','dev','dev','dev''sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank','joe','bill'],
'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna("unknown")
The strange thing is that in the output I'll get a row where the resulting name is "bobjames" and another where position is "devsys". Finally, even though bill does not appear in the names dataframe it shows up in the resulting dataframe. So I really need a way to say lookup a value in this other dataframe and if you find something tack on those columns.
In case you are still looking for an answer for this:
The "strange" things that you described are due to some minor errors in your code. For example, the first (appearance of "bobjames" and "devsys") is due to the fact that you don't have a comma between those two values in your source dataframes. And the second is because pandas doesn't care about the name of your dataframe but cares about the name of your columns when merging (you have a dataframe called "names" but also your columns are called "names"). Otherwise, it seems that the merge is doing exactly what you are looking for:
import pandas as pd
names = pd.DataFrame({'names':['bob','frank','bob','bob','bob', 'james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','dev','dev','dev', 'sys','sys','sys','sup','sup','sup']})
info = pd.DataFrame({'names':['joe','mark','tim','frank','joe','bill'],
'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna('unknown', inplace=True)
which will result in:
names position classification
0 bob dev unknown
1 bob dev unknown
2 bob dev unknown
3 bob dev unknown
4 frank dev thief
5 james dev unknown
6 tim sys good
7 ricardo sys unknown
8 mike sys unknown
9 mark sup thief
10 joan sup unknown
11 joe sup thief
12 joe sup good
13 bill unknown thief
I think you want to perform an outer merge:
In [60]:
pd.merge(names, info, how='outer')
Out[60]:
names position classification
0 bob dev NaN
1 frank dev thief
2 james dev NaN
3 tim sys good
4 ricardo sys NaN
5 mike sys NaN
6 mark sup thief
7 joan sup NaN
8 joe sup thief
There is section showing the type of merges can perform: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
For outer or inner join also join function can be used. In the case above let's suppose that names is the main table (all rows from this table must occur in result). Then to run left outer join use:
what = names.set_index('names').join(info.set_index('names'), how='left')
resp.
what = names.set_index('names').join(info.set_index('names'), how='left').fillna("unknown")
set_index functions are used to create temporary index column (same in both tables). When dataframes would have contain such index column, then this step wouldn't be necessary. For example:
# define index when create dataframes
names = pd.DataFrame({'names':['bob',...],'position':['dev',...]}).set_index('names')
info = pd.DataFrame({'names':['joe',...],'classification':['thief',...]}).set_index('names')
what = names.join(info, how='left')
To perform other types of join just change how attribute (left/right/inner/outer are allowed). More info here
Think of it as an SQL join operation. You need a left-outer join[1].
names = pd.DataFrame({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = pd.DataFrame({'names':['joe','mark','tim','frank'],'classification':['thief','thief','good','thief']})
Since there are names for which there is no classification, a left-outer join will do the job.
a = pd.merge(names, info, how='left', on='names')
The result is ...
>>> a
names position classification
0 bob dev NaN
1 frank dev thief
2 james dev NaN
3 tim sys good
4 ricardo sys NaN
5 mike sys NaN
6 mark sup thief
7 joan sup NaN
8 joe sup thief
... which is fine. All the NaN results are ok if you look at both the tables.
Cheers!
[1] - http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging