I have two dataframes
df1:
AccountNo
name
a/ctype
11.22.21
Henry
checking
11.22.22
Sam
Saving.
11.22.23
John
Checking
df2:
AccountNo
name
a/ctype
11-22-21
Henry
checking
11-22-23
John
Checking
11-22-24
Rita
Checking
output:
df3:
A/cNO_df1
A/cNO_df2
result.
Name_df1
Name_df2
result
a/ctype_df1
a/ctype_df2
result.
11.22.21
11-22-21
Match
Henry
Henry
Match.
checking
checking
Match.
11.22.22
Notindf2
Sam.
Notindf2
checking
Notindf2
11.22.23
11-22-23
Match
John.
john
Match.
checking
checking
Match.
.
11-22-24
Notindf1
.
Rita
Notindf1
checking
Notindf2
I tried removing the non numeric character for the accounts to compare both data set using:
df1['AccountNo'] = df1.AccountNo.replace(regex=[r'\D+', value='')
df2['AccountNo'] = df2.AccountNo.replace(regex=[r'\D+', value='')
And then concat two dataframes. But, When I remove the character I cannot print it in the same format and for ac not in df1 and ac not in df2 I am not able to concat that. I tried using numpy where to compare and concat.
Is there a way it can be done?
You can merge with an external Series as key:
df1.merge(df2, left_on='AccountNo', right_on=df2['AccountNo'].str.replace('-', '.'),
suffixes=('_df1', '_df2'), how='outer')
output:
AccountNo AccountNo_df1 name_df1 a/ctype_df1 AccountNo_df2 name_df2 a/ctype_df2
0 11.22.21 11.22.21 Henry checking 11-22-21 Henry checking
1 11.22.22 11.22.22 Sam Saving NaN NaN NaN
2 11.22.23 11.22.23 John Checking 11-22-23 John Checking
3 11.22.24 NaN NaN NaN 11-22-24 Rita Checking
Goal: Id like to still show who the person is so that I can display the NANs associated with them so that I can quickly find who is missing info.
Consider this dataset:
df:
Name Phone Address
John Doe NAN 123 lane
Jenny Gump 222-222-2222 NAN
Larry Bean NAN 561 road
Harry Smidlap 111-111-1111 555 highway
I'd like to clean the data up and show something like this (similar to an excel view when filtering for blanks):
Then maybe populate the empty data with something that says "Data exists" or just leave it blank. Im open to suggestions. And also drop the rows that have all data populated.
df:
Name Phone Address
John Doe NAN
Jenny Gump NAN
Larry Bean NAN
I've tried:
df[df.isnull().any(axis=1)]
That works great but I have a big data source and I see a lot of unnecessary info that already has data. I only care about seeing the person's name and what their missing.
Anyone have any ideas?
Since you require the Name column to be intact, you can just select other columns except Name and mask them, then create another data frame df2 which removes all the NaN values. After that you can just drop the indexes in df2 from df which you give you rows with only the NaN values as follows.
df.mask((df.columns != 'Name') & (df.notnull()), "", inplace=True)
df2 = df.dropna()
df.drop(df2.index, inplace=True)
This should give you the following output.
Name Phone Address
John Doe NAN
Jenny Gump NAN
Larry Bean NAN
Mask (replacing values where the condition is true) any place where it is not null with an empty string.
df.mask(df.notnull(), '')
This operates over multiple dimensions, passing a 2D set of true/false answers to the question "Replace or not?". And where it is true, it send the contents to /dev/null, while where it is not, it allows them to remain precariously.
I have two df's, one for user names and another for real names. I'd like to know how I can check if I have a real name in my first df using the data of the other, and then replace it.
For example:
import pandas as pd
df1 = pd.DataFrame({'userName':['peterKing', 'john', 'joe545', 'mary']})
df2 = pd.DataFrame({'realName':['alice','peter', 'john', 'francis', 'joe', 'carol']})
df1
userName
0 peterKing
1 john
2 joe545
3 mary
df2
realName
0 alice
1 peter
2 john
3 francis
4 joe
5 carol
My code should replace 'peterKing' and 'joe545' since these names appear in my df2. I tried using pd.contains, but I can only verify if a name appears or not.
The output should be like this:
userName
0 peter
1 john
2 joe
3 mary
Can someone help me with that? Thanks in advance!
You can use loc[row, colum], here you can see the documentation about loc method. And Series.str.contain method to select the usernames you need to replace with the real names. In my opinion, this solution is clear in terms of readability.
for real_name in df2['realName'].to_list():
df1.loc[ df1['userName'].str.contains(real_name), 'userName' ] = real_name
Output:
userName
0 peter
1 john
2 joe
3 mary
I am wondering if it is even possible to do what I want. I am right now using
df.loc[df.T_LOSS_DESC.str.contains("slip", na=False)]
It locates the column T_LOSS_DESC and then anywhere in that column where their is the specific word like "slip" it returns those rows. My first question is there anyway to put the results in its own column? also if so is their anyway to specify more than one possible keyword to look for? Example would be
df.loc[df.T_LOSS_DESC.str.contains("slip,Slip,Slipped", na=False)]
is that viable to do? or can I only use one parameter?
what my dataframe looks like:
T_LOSS_DESC
1 Bob was running and Slipped
2 Jeff got burnt by the sun
3 James went for a walk
what I would like my dataframe to look like is if it finds matches inside that column I am looking at I want it to put the matches in a different column.
So my final dataframe would look like such:
T_LOSS_DESC Slippery
1 Bob was running and Slipped Bob was running and Slipped
2 Jeff got burnt by the sun
3 James went for a walk
so since only one of my strings matched with the strings i was looking for in the column it would then bring that one match over into a new column called Slippery
Thanks in advance.
IIUC:
In [95]: df['new'] = df.loc[df.T_LOSS_DESC.str.contains("slip|Slip|Slipped", na=False)]
In [96]: df
Out[96]:
T_LOSS_DESC new
0 Bob was running and Slipped Bob was running and Slipped
1 Jeff got burnt by the sun NaN
2 James went for a walk NaN
alternatively you can do it this way:
In [116]: df.loc[df.T_LOSS_DESC.str.contains("slip|Slip|Slipped", na=False), 'Slippery'] = df.T_LOSS_DESC
In [117]: df
Out[117]:
T_LOSS_DESC Slippery
0 Bob was running and Slipped Bob was running and Slipped
1 Jeff got burnt by the sun NaN
2 James went for a walk NaN
I have 2 dataframes, one of which has supplemental information for some (but not all) of the rows in the other.
names = df({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank'],
'classification':['thief','thief','good','thief']})
I would like to take the classification column from the info dataframe above and add it to the names dataframe above. However, when I do combined = pd.merge(names, info) the resulting dataframe is only 4 rows long. All of the rows that do not have supplemental info are dropped.
Ideally, I would have the values in those missing columns set to unknown. Resulting in a dataframe where some people are theives, some are good, and the rest are unknown.
EDIT:
One of the first answers I received suggested using merge outter which seems to do some weird things. Here is a code sample:
names = df({'names':['bob','frank','bob','bob','bob''james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','dev','dev','dev''sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank','joe','bill'],
'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna("unknown")
The strange thing is that in the output I'll get a row where the resulting name is "bobjames" and another where position is "devsys". Finally, even though bill does not appear in the names dataframe it shows up in the resulting dataframe. So I really need a way to say lookup a value in this other dataframe and if you find something tack on those columns.
In case you are still looking for an answer for this:
The "strange" things that you described are due to some minor errors in your code. For example, the first (appearance of "bobjames" and "devsys") is due to the fact that you don't have a comma between those two values in your source dataframes. And the second is because pandas doesn't care about the name of your dataframe but cares about the name of your columns when merging (you have a dataframe called "names" but also your columns are called "names"). Otherwise, it seems that the merge is doing exactly what you are looking for:
import pandas as pd
names = pd.DataFrame({'names':['bob','frank','bob','bob','bob', 'james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','dev','dev','dev', 'sys','sys','sys','sup','sup','sup']})
info = pd.DataFrame({'names':['joe','mark','tim','frank','joe','bill'],
'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna('unknown', inplace=True)
which will result in:
names position classification
0 bob dev unknown
1 bob dev unknown
2 bob dev unknown
3 bob dev unknown
4 frank dev thief
5 james dev unknown
6 tim sys good
7 ricardo sys unknown
8 mike sys unknown
9 mark sup thief
10 joan sup unknown
11 joe sup thief
12 joe sup good
13 bill unknown thief
I think you want to perform an outer merge:
In [60]:
pd.merge(names, info, how='outer')
Out[60]:
names position classification
0 bob dev NaN
1 frank dev thief
2 james dev NaN
3 tim sys good
4 ricardo sys NaN
5 mike sys NaN
6 mark sup thief
7 joan sup NaN
8 joe sup thief
There is section showing the type of merges can perform: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
For outer or inner join also join function can be used. In the case above let's suppose that names is the main table (all rows from this table must occur in result). Then to run left outer join use:
what = names.set_index('names').join(info.set_index('names'), how='left')
resp.
what = names.set_index('names').join(info.set_index('names'), how='left').fillna("unknown")
set_index functions are used to create temporary index column (same in both tables). When dataframes would have contain such index column, then this step wouldn't be necessary. For example:
# define index when create dataframes
names = pd.DataFrame({'names':['bob',...],'position':['dev',...]}).set_index('names')
info = pd.DataFrame({'names':['joe',...],'classification':['thief',...]}).set_index('names')
what = names.join(info, how='left')
To perform other types of join just change how attribute (left/right/inner/outer are allowed). More info here
Think of it as an SQL join operation. You need a left-outer join[1].
names = pd.DataFrame({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = pd.DataFrame({'names':['joe','mark','tim','frank'],'classification':['thief','thief','good','thief']})
Since there are names for which there is no classification, a left-outer join will do the job.
a = pd.merge(names, info, how='left', on='names')
The result is ...
>>> a
names position classification
0 bob dev NaN
1 frank dev thief
2 james dev NaN
3 tim sys good
4 ricardo sys NaN
5 mike sys NaN
6 mark sup thief
7 joan sup NaN
8 joe sup thief
... which is fine. All the NaN results are ok if you look at both the tables.
Cheers!
[1] - http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging