How to merge two datasets by specific column in pandas - python

I'm playing around with the Kaggle dataset "European Soccer Database" and want to combine it with another FIFA18-dataset.
My problem is the name-column in these two datasets are using different format.
For example: "lionel messi" in one dataset and in the other it is "L. Messi"
I would to convert the "L. Messi" to the lowercase version "lionel messi" for all rows in dataset.
What would be the most intelligent way to go about this?

One simple way is to convert the names in both dataframes into a common format so they can be matched.* Let's assume that in df1 names are in the L. Messi format and in df2 names are in the lionel messi format. What would a common format look like? You have several choices, but one option would be all lowercase, with just the first initial followed by a period: l. messi.
df1 = pd.DataFrame({'names': ['L. Messi'], 'x': [1]})
df2 = pd.DataFrame({'names': ['lionel messi'], 'y': [2]})
df1.names = df1.names.str.lower()
df2.names = df2.names.apply(lambda n: n[0] + '.' + n[n.find(' '):])
df = df1.merge(df2, left_on='names', right_on='names')
*Note: This approach is totally dependent on the names being "matchable" in this way. There are plenty of cases that could cause this simple approach to fail. If a team has two members, Abby Wambach and Aaron Wambach, they'll both look like a. wambach. If one dataframe tries to differentiate them by using other initials in their name, like m.a. wambach and a.k. wambach, the naive matching will fail. How you handle this depends on the size your data - maybe you can try match most players this way, and see who gets dropped, and write custom code for them.

Related

How Do I Count The Number of Times a Subset of Words Appear In My Pandas Dataframe?

I have a bunch of keywords stored in a 620x2 pandas dataframe seen below. I think I need to treat each entry as its own set, where semicolons separate elements. So, we end up with 1240 sets. Then I'd like to be able to search how many times keywords of my choosing appear together. For example, I'd like to figure out how many times 'computation theory' and 'critical infrastructure' appear together as a subset in these sets, in any order. Is there any straightforward way I can do this?
Use .loc to find if the keywords appear together.
Do this after you have split the data into 1240 sets. I don't understand whether you want to make new columns or just want to keep the columns as is.
# create a filter for keyword 1
filter_keyword_2 = (df['column_name'].str.contains('critical infrastructure'))
# create a filter for keyword 2
filter_keyword_2 = (df['column_name'].str.contains('computation theory'))
# you can create more filters with the same construction as above.
# To check the number of times both the keywords appear
len(df.loc[filter_keyword_1 & filter_keyword_2])
# To see the dataframe
subset_df = df.loc[filter_keyword_1 & filter_keyword_2]
.loc selects the conditional data frame. You can use subset_df=df[df['column_name'].str.contains('string')] if you have only one condition.
To the column split or any other processing before you make the filters or run the filters again after processing.
Not sure if this is considered straightforward, but it works. keyword_list is the list of paired keywords you want to search.
df['Author Keywords'] = df['Author Keywords'].fillna('').str.split(';\s*').apply(set)
df['Index Keywords'] = df['Index Keywords'].fillna('').str.split(';\s*').apply(set)
df.apply(lambda x : x.apply(lambda y : all([kw in y for kw in keyword_list]))).sum().sum()

InvalidIndexError when using map() on a dataframe column

I'm getting the error pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects when I attempt to concatenate two dataframes. I believe the issue is present somewhere in this code where I map a column to another column.
mapping_g = {'Hospice' : ['ALLCARE', 'CARING EDGE MINOT', 'CARING EDGE HERMANTOWN', 'CARING EDGE BISMARK', 'BLUEBIRD HOSPICE', 'DOCTORS HOSPICE', \
'FIRST CHOICE HOSPICE', 'KEYSTONE HOSPICE' , 'JOURNEY\'S HOSPICE', 'LIGHTHOUSE HOSPICE', 'SALMON VALLEY HOSPICE'] ,'Group': ['ACH1507', \
'CE11507', 'CE21507', 'CE51507', 'BBH1507', 'DOC1507', 'FCH1507', 'KEY1507', 'JOU1507', 'LHH1507', 'SVH1507']}
g_mapping_df = pd.DataFrame(data=mapping_g)
g = dict(zip(g_mapping_df.Group, g_mapping_df.Hospice))
raw_pbm_data['Name Of Hospice'] = raw_pbm_data['GroupID'].map(g)
combined_data = pd.concat([raw_direct_data,raw_pbm_data], axis=0, ignore_index=True)
I think it's something to do with when I place the GroupID column into the Name of Hospice column in the second to last line.
Realized that raw_pbm_data['Name Of Hospice'] = raw_pbm_data['GroupID'].map(g)
should have actually been raw_pbm_data['HospiceName'] = raw_pbm_data['GroupID'].map(g)
I'm working with a bunch of different Excel files and all the column names are slightly different while all being similar. I can't manually rename the Excel files because they're in the format my employer wants them to be in, so I am juggling the information in dataframes. Therefore, the above change solved my issue as it was simply appending a new column onto the dataframe instead of replacing an existing column, which I believe makes sense as to why concatenate was breaking.

How to merge/concatenate two dataframe whose two columns with partially string matched?

I used Chicago crime data for my analysis, but there is no community name given, so I collected community name in chicago from online source. However, Redfin real estate data collected by Region/ neighborhood instead of community name. when I tried to merge prepossessed Chicago crime data with Redfin real estate data, I got merge error because Region name in Redfin data has partial string matching with Chicago crime data. I tried regex to do partial matching first then to merge two dataframe by year and name of community name.
Is there any solution for merging two dataframe whose columns yields partial string matching? can any one point me out? Thanks
preprocessed data:
here I create public gist for viewing data that I used:
exampled data snippet on public gist
my attempt
pd.merge(chicago_crime, redfin, left_on='community_name', right_on='Region')
but this gives me a lot of NAN which means above concatenation is not correct. what should I do? any idea to make this right? thanks
This is my approach. The first approach is applying split() to split each word in the key column in both dataframes.
chicago_crime['community_name'] = [cn.split() for cn in chicago_crime['community_name']]
redfin['Region'] = [rg.split() for rg in redfin['Region']]
Then I tried to compare each element in the resulted list of the column in chicago_crime with each element in the resulted list of the column in redfin. Then the matched elements are stored in a new column named merge_ref for both dataframes.
idx, datavalue = [], []
for i,dv in enumerate(chicago_crime['community_name']):
for d in dv:
if d in redfin['Region'][i]:
if i not in idx:
idx.append(i)
datavalue.append(d)
chicago_crime['merge_ref'] = datavalue
redfin['merge_ref'] = datavalue
Finally, merge both dataframes on merge_ref:
df_merge = pd.merge(chicago_crime[['community_area','community_name','merge_ref']], redfin, on='merge_ref')
However, since the values in merge_ref from both dataframe is not unique, the number of rows might increase. But at least, it gives you a hint.
Updated
Using your mapping solution:
### mapping neiborhood to community name
code_pairs_neighborhoods = [[p[0], p[1]] for p in [pair.strip().split('\t') for pair in neighborhood_Map.strip().split('\n')]]
neighborhood_name_dic = {k[0]:k[1] for k in code_pairs_neighborhoods} #neighborhood -> community area
chicago_crime['neighborhood'] = chicago_crime['community_name'].map(neighborhood_name_dic)
redfin['neighborhood'] = redfin['Region'].map(neighborhood_name_dic)
df_merge = pd.merge(chicago_crime, redfin, on='neighborhood')
print(df_merge)
A quick look at the two datasets, it appears that Chicago.Region is of form Chicago, IL - region_name while Redfin.community_name is region_name. So I tried:
areas = ['Chicago, IL - ' + s for s in redfin.community_name.unique()]
# check if areas in the chicago.Region
a = [s in chicago.Region.unique() for s in areas]
sum(a), len(a)
# 63, 77
which matches 63 of 77 areas in redfin.community.unique(). If it's good enough, you can do:
pd.merge(redfin, chicago,
left_on='Chicago, IL - ' + redfin.community_name,
right_on='Region')

Python Data Analysis from SQL Query

I'm about to start some Python Data analysis unlike anything I've done before. I'm currently studying numpy, but so far it doesn't give me insight on how to do this.
I'm using python 2.7.14 Anaconda with cx_Oracle to Query complex records.
Each record will be a unique individual with a column for Employee ID, Relationship Tuples (Relationship Type Code paired with Department number, may contain multiple), Account Flags (Flag strings, may contain multiple). (3 columns total)
so one record might be:
[(123456), (135:2345678, 212:4354670, 198:9876545), (Flag1, Flag2, Flag3)]
I need to develop a python script that will take these records and create various counts.
The example record would be counted in at least 9 different counts
How many with relationship: 135
How many with relationship: 212
How many with relationship: 198
How many in Department: 2345678
How many in Department: 4354670
How many in Department: 9876545
How many with Flag: Flag1
How many with Flag: Flag2
How many with Flag: Flag3
The other tricky part of this, is I can't pre-define the relationship codes, departments, or flags What I'm counting for has to be determined by the data retrieved from the query.
Once I understand how to do that, hopefully the next step to also get how many relationship X has Flag y, etc., will be intuitive.
I know this is a lot to ask about, but If someone could just point me in the right direction so I can research or try some tutorials that would be very helpful. Thank you!
At least you need to structurate this data to make a good analysis, you can do it in your database engine or in python (I will do it by this way, using pandas like SNygard suggested).
At first, I create some fake data(it was provided by you):
import pandas as pd
import numpy as np
from ast import literal_eval
data = [[12346, '(135:2345678, 212:4354670, 198:9876545)', '(Flag1, Flag2, Flag3)'],
[12345, '(136:2343678, 212:4354670, 198:9876541, 199:9876535)', '(Flag1, Flag4)']]
df = pd.DataFrame(data,columns=['id','relationships','flags'])
df = df.set_index('id')
df
This return a dataframe like this:
raw_pandas_dataframe
In order to summarize or count by columns, we need to improve our data structure, in some way that we can apply group by operations with department, relationships or flags.
We will convert our relationships and flags columns from string type to a python list of strings. So, the flags column will be a python list of flags, and the relationships column will be a python list of relations.
df['relationships'] = df['relationships'].str.replace('\(','').str.replace('\)','')
df['relationships'] = df['relationships'].str.split(',')
df['flags'] = df['flags'].str.replace('\(','').str.replace('\)','')
df['flags'] = df['flags'].str.split(',')
df
The result is:
dataframe_1
With our relationships column converted to list, we can create a new dataframe with as much columns
as relations in that lists we have.
rel = pd.DataFrame(df['relationships'].values.tolist(), index=rel.index)
After that we need to stack our columns preserving its index, so we will use pandas multi_index: the id and the relation column number(0,1,2,3)
relations = rel.stack()
relations.index.names = ['id','relation_number']
relations
We get: dataframe_2
At this moment we have all of our relations in rows, but still we can't group by using
relation_type feature. So we will split our relations data in two columns: relation_type and department using :.
clear_relations = relations.str.split(':')
clear_relations = pd.DataFrame(clear_relations.values.tolist(), index=clear_relations.index,columns=['relation_type','department'])
clear_relations
The result is
dataframe_3_clear_relations
Our relations are ready to analyze, but our flags structure still is very useless. So we will convert the flag list, to columns and after that we will stack them.
flags = pd.DataFrame(df['flags'].values.tolist(), index=rel.index)
flags = flags.stack()
flags.index.names = ['id','flag_number']
The result is dataframe_4_clear_flags
Voilá!, It's all ready to analyze!.
So, for example, how many relations from each type we have, and wich one is the biggest:
clear_relations.groupby('relation_type').agg('count')['department'].sort_values(ascending=False)
We get: group_by_relation_type
All code: Github project
If you're willing to consider other packages, take a look at pandas which is built on top of numpy. You can read sql statements directly into a dataframe, then filter.
For example,
import pandas
sql = '''SELECT * FROM <table> WHERE <condition>'''
df = pandas.read_sql(sql, <connection>)
# Your output might look like the following:
0 1 2
0 12346 (135:2345678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag3)
1 12345 (136:2343678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag4)
# Format your records into rows
# This part will take some work, and really depends on how your data is formatted
# Do you have repeated values? Are the records always the same size?
# Select only the rows where relationship = 125
rel_125 = df[df['Relationship'] = 125]
The pandas formatting is more in depth than fits in a Q&A, but some good resources are here: 10 Minutes to Pandas.
You can also filter the rows directly, though it may not be the most efficient. For example, the following query selects only the rows where a relationship starts with '212'.
df[df['Relationship'].apply(lambda x: any(y.startswith('212') for y in x))]

What's a better way to clean data before a merge?

I have two different data frames that I need to merge and the merge column ('title') needs to be cleaned up before the merge can happen. Sample data example looks like the following;
data1 = pd.DataFrame({'id': ['a12bcde0','b20bcde9'], 'title': ['a.b. company','company_b']})
data2 = pd.DataFrame({'serial_number': ['01a2b345','10ab2030','40ab4060'],'title':['ab company','company_b (123)','company_f']})
As expected the merge will not succeed on the first title. I have been using the replace() method but it get's unmanageable very quickly because I have 100s of titles to correct due to spellings, case sensitivity, etc.
Any other suggestions regarding how to best cleanup and merge the data?
Full Example:
import pandas as pd
import numpy as np
data1 = pd.DataFrame({'id': ['a12bcde0','b20bcde9'], 'title': ['a.b. company','company_b']})
data2 = pd.DataFrame({'serial_number': ['01a2b345','10ab2030','40ab4060'],'title':['ab company','company_b (123)','company_f']})
data2['title'].replace(regex=True,inplace=True,to_replace=r"\s\(.*\)",value=r'')
replacements = {
'title': {
r'a.b. company *.*': 'ab company'
}
}
data1.replace(replacements, regex=True, inplace=True)
pd.merge(data1, data2, on='title')
First things first, there is no perfect solution for this problem, but I suggest doing two things:
Do any easy cleaning you can do before hand, including removing any characters you don't expect.
Apply some fuzzy matching logic
You'll see this isn't perfect since even this example doesn't work 100% percent.
First, let's start by making your example a tiny bit more complicated, introducing a regular typo (coampany_b instead of company_b, something that won't get picked up by the easy cleaning below)
data1 = pd.DataFrame({'id': ['a12bcde0','b20bcde9', 'csdfsjkbku'], 'title': ['a.b. company','company_b', 'coampany_b']})
data2 = pd.DataFrame({'serial_number': ['01a2b345','10ab2030','40ab4060'],'title':['ab company','company_b (123)','company_f']})
Then let's assume you only expect [a-z] characters as #Maarten Fabré mentioned. So let's lowercase everything and remove anything else.
data1['cleaned_title'] = data1['title'].str.lower().replace(regex=True,inplace=False,to_replace=r"[^a-z]", value=r'')
data2['cleaned_title'] = data2['title'].str.lower().replace(regex=True,inplace=False,to_replace=r"[^a-z]", value=r'')
Now, let's use difflib's get_close_matches (read more and other options here)
import difflib
data1['closestmatch'] = data1.cleaned_title.apply(lambda x: difflib.get_close_matches(x, data2.cleaned_title)[0])
data2['closestmatch'] = data1.cleaned_title.apply(lambda x: difflib.get_close_matches(x, data2.cleaned_title)[0])
Here is the resulting data1, looking good!
id title cleaned_title closestmatch
0 a12bcde0 a.b. company abcompany abcompany
1 b20bcde9 company_b companyb companyb
2 csdfsjkbku coampany_b coampanyb companyb
Now, here is data2, looking a bit less good... We asked it to find the closest match, so it found one for company_f, while you clearly don't want it.
serial_number title cleaned_title closestmatch
0 01a2b345 ab company abcompany abcompany
1 10ab2030 company_b (123) companyb companyb
2 40ab4060 company_f companyf companyb
The ideal case scenario is if you have a clean list of company titles on the side, in which case you should find the closest match based on that. If you don't, you'll have to get creative or manually clean up the hit and miss.
To wrap this up, you can now perform a regular merge on 'closestmatch'.
You could try to make an simplified_name column in each of the 2 dataframes by setting all characters to lowercase and removing all the non [a-z ] characters and join on this column if this doesn't lead to collisions

Categories

Resources