Preserving Column Order - Python Pandas and Column Concat - python

So my google-fu doesn't seem to be doing me justice with what seems like should be a trivial procedure.
In Pandas for Python I have 2 datasets, I want to merge them. This works fine using .concat. The issue is, .concat reorders my columns. From a data retrieval point of view, this is trivial. From a "I just want to open the file and quickly see the most important column" point of view, this is annoying.
File1.csv
Name Username Alias1
Tom Tomfoolery TJZ
Meryl MsMeryl Mer
Timmy Midsize Yoda
File2.csv
Name Username Alias 1 Alias 2
Bob Firedbob Fire Gingy
Tom Tomfoolery TJZ Awww
Result.csv
Alias1 Alias2 Name Username
0 TJZ NaN Tom Tomfoolery
1 Mer NaN Meryl MsMeryl
2 Yoda NaN Timmy Midsize
0 Fire Gingy Bob Firedbob
1 TJZ Awww Tom Tomfoolery
The result is fine, but in the data-file I'm working with I have 1,000 columns. The 2-3 most important are now in the middle. Is there a way, in this toy example, I could've forced "Username" to be the first column and "Name" to be the second column, preserving the values below each all the way down obviously.
Also as a side note, when I save to file it also saves that numbering on the side (0 1 2 0 1). If theres a way to prevent that too, that'd be cool. If not, its not a big deal since it's a quick fix to remove.
Thanks!

Assuming the concatenated DataFrame is df, you can perform the reordering of columns as follows:
important = ['Username', 'Name']
reordered = important + [c for c in df.columns if c not in important]
df = df[reordered]
print df
Output:
Username Name Alias1 Alias2
0 Tomfoolery Tom TJZ NaN
1 MsMeryl Meryl Mer NaN
2 Midsize Timmy Yoda NaN
0 Firedbob Bob Fire Gingy
1 Tomfoolery Tom TJZ Awww
The list of numbers [0, 1, 2, 0, 1] is the index of the DataFrame. To prevent them from being written to the output file, you can use the index=False option in to_csv():
df.to_csv('Result.csv', index=False, sep=' ')

Related

Compare two dataframes with different format column values

I have two dataframes
df1:
AccountNo
name
a/ctype
11.22.21
Henry
checking
11.22.22
Sam
Saving.
11.22.23
John
Checking
df2:
AccountNo
name
a/ctype
11-22-21
Henry
checking
11-22-23
John
Checking
11-22-24
Rita
Checking
output:
df3:
A/cNO_df1
A/cNO_df2
result.
Name_df1
Name_df2
result
a/ctype_df1
a/ctype_df2
result.
11.22.21
11-22-21
Match
Henry
Henry
Match.
checking
checking
Match.
11.22.22
Notindf2
Sam.
Notindf2
checking
Notindf2
11.22.23
11-22-23
Match
John.
john
Match.
checking
checking
Match.
.
11-22-24
Notindf1
.
Rita
Notindf1
checking
Notindf2
I tried removing the non numeric character for the accounts to compare both data set using:
df1['AccountNo'] = df1.AccountNo.replace(regex=[r'\D+', value='')
df2['AccountNo'] = df2.AccountNo.replace(regex=[r'\D+', value='')
And then concat two dataframes. But, When I remove the character I cannot print it in the same format and for ac not in df1 and ac not in df2 I am not able to concat that. I tried using numpy where to compare and concat.
Is there a way it can be done?
You can merge with an external Series as key:
df1.merge(df2, left_on='AccountNo', right_on=df2['AccountNo'].str.replace('-', '.'),
suffixes=('_df1', '_df2'), how='outer')
output:
AccountNo AccountNo_df1 name_df1 a/ctype_df1 AccountNo_df2 name_df2 a/ctype_df2
0 11.22.21 11.22.21 Henry checking 11-22-21 Henry checking
1 11.22.22 11.22.22 Sam Saving NaN NaN NaN
2 11.22.23 11.22.23 John Checking 11-22-23 John Checking
3 11.22.24 NaN NaN NaN 11-22-24 Rita Checking

keep one column's data in pandas and show all NANs from other columns only

Goal: Id like to still show who the person is so that I can display the NANs associated with them so that I can quickly find who is missing info.
Consider this dataset:
df:
Name Phone Address
John Doe NAN 123 lane
Jenny Gump 222-222-2222 NAN
Larry Bean NAN 561 road
Harry Smidlap 111-111-1111 555 highway
I'd like to clean the data up and show something like this (similar to an excel view when filtering for blanks):
Then maybe populate the empty data with something that says "Data exists" or just leave it blank. Im open to suggestions. And also drop the rows that have all data populated.
df:
Name Phone Address
John Doe NAN
Jenny Gump NAN
Larry Bean NAN
I've tried:
df[df.isnull().any(axis=1)]
That works great but I have a big data source and I see a lot of unnecessary info that already has data. I only care about seeing the person's name and what their missing.
Anyone have any ideas?
Since you require the Name column to be intact, you can just select other columns except Name and mask them, then create another data frame df2 which removes all the NaN values. After that you can just drop the indexes in df2 from df which you give you rows with only the NaN values as follows.
df.mask((df.columns != 'Name') & (df.notnull()), "", inplace=True)
df2 = df.dropna()
df.drop(df2.index, inplace=True)
This should give you the following output.
Name Phone Address
John Doe NAN
Jenny Gump NAN
Larry Bean NAN
Mask (replacing values where the condition is true) any place where it is not null with an empty string.
df.mask(df.notnull(), '')
This operates over multiple dimensions, passing a 2D set of true/false answers to the question "Replace or not?". And where it is true, it send the contents to /dev/null, while where it is not, it allows them to remain precariously.

Check if pandas column contains text in another dataframe and replace values

I have two df's, one for user names and another for real names. I'd like to know how I can check if I have a real name in my first df using the data of the other, and then replace it.
For example:
import pandas as pd
df1 = pd.DataFrame({'userName':['peterKing', 'john', 'joe545', 'mary']})
df2 = pd.DataFrame({'realName':['alice','peter', 'john', 'francis', 'joe', 'carol']})
df1
userName
0 peterKing
1 john
2 joe545
3 mary
df2
realName
0 alice
1 peter
2 john
3 francis
4 joe
5 carol
My code should replace 'peterKing' and 'joe545' since these names appear in my df2. I tried using pd.contains, but I can only verify if a name appears or not.
The output should be like this:
userName
0 peter
1 john
2 joe
3 mary
Can someone help me with that? Thanks in advance!
You can use loc[row, colum], here you can see the documentation about loc method. And Series.str.contain method to select the usernames you need to replace with the real names. In my opinion, this solution is clear in terms of readability.
for real_name in df2['realName'].to_list():
df1.loc[ df1['userName'].str.contains(real_name), 'userName' ] = real_name
Output:
userName
0 peter
1 john
2 joe
3 mary

Allocating specified strings into their own columns inside a dataframe

I am wondering if it is even possible to do what I want. I am right now using
df.loc[df.T_LOSS_DESC.str.contains("slip", na=False)]
It locates the column T_LOSS_DESC and then anywhere in that column where their is the specific word like "slip" it returns those rows. My first question is there anyway to put the results in its own column? also if so is their anyway to specify more than one possible keyword to look for? Example would be
df.loc[df.T_LOSS_DESC.str.contains("slip,Slip,Slipped", na=False)]
is that viable to do? or can I only use one parameter?
what my dataframe looks like:
T_LOSS_DESC
1 Bob was running and Slipped
2 Jeff got burnt by the sun
3 James went for a walk
what I would like my dataframe to look like is if it finds matches inside that column I am looking at I want it to put the matches in a different column.
So my final dataframe would look like such:
T_LOSS_DESC Slippery
1 Bob was running and Slipped Bob was running and Slipped
2 Jeff got burnt by the sun
3 James went for a walk
so since only one of my strings matched with the strings i was looking for in the column it would then bring that one match over into a new column called Slippery
Thanks in advance.
IIUC:
In [95]: df['new'] = df.loc[df.T_LOSS_DESC.str.contains("slip|Slip|Slipped", na=False)]
In [96]: df
Out[96]:
T_LOSS_DESC new
0 Bob was running and Slipped Bob was running and Slipped
1 Jeff got burnt by the sun NaN
2 James went for a walk NaN
alternatively you can do it this way:
In [116]: df.loc[df.T_LOSS_DESC.str.contains("slip|Slip|Slipped", na=False), 'Slippery'] = df.T_LOSS_DESC
In [117]: df
Out[117]:
T_LOSS_DESC Slippery
0 Bob was running and Slipped Bob was running and Slipped
1 Jeff got burnt by the sun NaN
2 James went for a walk NaN

pandas merge dataframe with NaN (or "unknown") for missing values

I have 2 dataframes, one of which has supplemental information for some (but not all) of the rows in the other.
names = df({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank'],
'classification':['thief','thief','good','thief']})
I would like to take the classification column from the info dataframe above and add it to the names dataframe above. However, when I do combined = pd.merge(names, info) the resulting dataframe is only 4 rows long. All of the rows that do not have supplemental info are dropped.
Ideally, I would have the values in those missing columns set to unknown. Resulting in a dataframe where some people are theives, some are good, and the rest are unknown.
EDIT:
One of the first answers I received suggested using merge outter which seems to do some weird things. Here is a code sample:
names = df({'names':['bob','frank','bob','bob','bob''james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','dev','dev','dev''sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank','joe','bill'],
'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna("unknown")
The strange thing is that in the output I'll get a row where the resulting name is "bobjames" and another where position is "devsys". Finally, even though bill does not appear in the names dataframe it shows up in the resulting dataframe. So I really need a way to say lookup a value in this other dataframe and if you find something tack on those columns.
In case you are still looking for an answer for this:
The "strange" things that you described are due to some minor errors in your code. For example, the first (appearance of "bobjames" and "devsys") is due to the fact that you don't have a comma between those two values in your source dataframes. And the second is because pandas doesn't care about the name of your dataframe but cares about the name of your columns when merging (you have a dataframe called "names" but also your columns are called "names"). Otherwise, it seems that the merge is doing exactly what you are looking for:
import pandas as pd
names = pd.DataFrame({'names':['bob','frank','bob','bob','bob', 'james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','dev','dev','dev', 'sys','sys','sys','sup','sup','sup']})
info = pd.DataFrame({'names':['joe','mark','tim','frank','joe','bill'],
'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna('unknown', inplace=True)
which will result in:
names position classification
0 bob dev unknown
1 bob dev unknown
2 bob dev unknown
3 bob dev unknown
4 frank dev thief
5 james dev unknown
6 tim sys good
7 ricardo sys unknown
8 mike sys unknown
9 mark sup thief
10 joan sup unknown
11 joe sup thief
12 joe sup good
13 bill unknown thief
I think you want to perform an outer merge:
In [60]:
pd.merge(names, info, how='outer')
Out[60]:
names position classification
0 bob dev NaN
1 frank dev thief
2 james dev NaN
3 tim sys good
4 ricardo sys NaN
5 mike sys NaN
6 mark sup thief
7 joan sup NaN
8 joe sup thief
There is section showing the type of merges can perform: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
For outer or inner join also join function can be used. In the case above let's suppose that names is the main table (all rows from this table must occur in result). Then to run left outer join use:
what = names.set_index('names').join(info.set_index('names'), how='left')
resp.
what = names.set_index('names').join(info.set_index('names'), how='left').fillna("unknown")
set_index functions are used to create temporary index column (same in both tables). When dataframes would have contain such index column, then this step wouldn't be necessary. For example:
# define index when create dataframes
names = pd.DataFrame({'names':['bob',...],'position':['dev',...]}).set_index('names')
info = pd.DataFrame({'names':['joe',...],'classification':['thief',...]}).set_index('names')
what = names.join(info, how='left')
To perform other types of join just change how attribute (left/right/inner/outer are allowed). More info here
Think of it as an SQL join operation. You need a left-outer join[1].
names = pd.DataFrame({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = pd.DataFrame({'names':['joe','mark','tim','frank'],'classification':['thief','thief','good','thief']})
Since there are names for which there is no classification, a left-outer join will do the job.
a = pd.merge(names, info, how='left', on='names')
The result is ...
>>> a
names position classification
0 bob dev NaN
1 frank dev thief
2 james dev NaN
3 tim sys good
4 ricardo sys NaN
5 mike sys NaN
6 mark sup thief
7 joan sup NaN
8 joe sup thief
... which is fine. All the NaN results are ok if you look at both the tables.
Cheers!
[1] - http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

Categories

Resources