Cross-checking dataframes in Python

Cross-checking dataframes in Python - python

I am working on a Pandas issue.
Currently in df1:
start
Stop
NYPenn
WUnion
GCTerm
30thSt
TUStat
LAUnio
JaStat
MillSt
ChiUnS
MonCen
OGTran
SouthS
Currently in df2 (Prime):
Train_Code
City
NYPenn
New York City
WUnion
D.C.
GCTerm
New York City
30thSt
Philadelphia
TUStat
Toronto
LAUnio
Los Angeles
MonCen
Montreal
OGTran
Chicago
SouthS
Boston
I want to use the train codes to determine which start/stop in df1 contain prime stations. I would need to run each element in both columns in df1 against df2 (Train_Code) to output the results indicating which station was a prime (or if both stations are prime) into another dataframe (df3).
df3 should end up being:
start
Stop
Results
City
Results
City
NYPenn
WUnion
Yes
New York City
Yes
D.C.
GCTerm
30thSt
TUStat
LAUnio
JaStat
MillSt
NO
NaN
NO
NaN
ChiUnS
MonCen
NO
NaN
Yes
Montreal
OGTran
SouthS
**Note: I didn't fill in df3 all the way but I gave examples of how it should be filled.
[If I added another column indicating there was a layover station, the code should work run against the layover column as well.]

This will get you close:
df1s = df1.stack().rename('Train_Code').to_frame()
df1s.loc[:,'City'] = df1s['Train_Code'].map(df2.set_index('Train_Code')['City'])
df1s['Results'] = np.where(df1s['City'].notna(), 'Yes', 'NO')
df1s.unstack()
Output:
Train_Code City Results
start Stop start Stop start Stop
0 NYPenn WUnion New York City D.C. Yes Yes
1 GCTerm 30thSt New York City Philadelphia Yes Yes
2 TUStat LAUnio Toronto Los Angeles Yes Yes
3 JaStat MillSt NaN NaN NO NO
4 ChiUnS MonCen NaN Montreal NO Yes
5 OGTran SouthS Chicago Boston Yes Yes

Related

How can I compare one column of a dataframe to multiple other columns using SequenceMatcher?

I have a dataframe with 6 columns, the first two are an id and a name column, the remaining 4 are potential matches for the name column.
id name match1 match2 match3 match4
id name match1 match2 match3 match4
1 NXP Semiconductors NaN NaN NaN NaN
2 Cincinnati Children's Hospital Medical Center Montefiore Medical center Children's Hospital Los Angeles Cincinnati Children's Hospital Medical Center SSM Health SLU Hospital
3 Seminole Tribe of Florida The State Board of Administration of Florida NaN NaN NaN
4 Miami-Dade County County of Will County of Orange NaN NaN
5 University of California California Teacher's Association Yale University University of Toronto University System of Georgia
6 Bon Appetit Management Waste Management Sculptor Capital NaN NaN
I'd like to use SequenceMatcher to compare the name column with each match column if there is a value and return the match value with the highest ratio, or closest match, in a new column at the end of the dataframe.
So the output would be something like this:
id name match1 match2 match3 match4 best match
1 NXP Semiconductors NaN NaN NaN NaN NaN
2 Cincinnati Children's Hospital Medical Center Montefiore Medical center Children's Hospital Los Angeles Cincinnati Children's Hospital Medical Center SSM Health SLU Hospital Cincinnati Children's Hospital Medical Center
3 Seminole Tribe of Florida The State Board of Administration of Florida NaN NaN NaN The State Board of Administration of Florida
4 Miami-Dade County County of Will County of Orange NaN NaN County of Orange
5 University of California California Teacher's Association Yale University University of Toronto University System of Georgia California Teacher's Association
6 Bon Appetit Management Waste Management Sculptor Capital NaN NaN Waste Management
I've gotten the data into the dataframe and have been able to compare one column to a single other column using the apply method:
df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1)
However, I'm not sure how to loop over multiple columns in the same row. I also thought about trying to reformat my data so it that the method above would work, something like this:
name match
name1 match1
name1 match2
name1 match3
However, I was running into issues dealing with the NaN values. Open to suggestions on the best route to accomplish this.

I ended up solving this using the second idea of reformatting the table. Using the melt function I was able to get a two column table of the name field with each possible match. From there I used the original lambda function to compare the two columns and output a ratio. From there it was relatively easy to go through and see the most likely matches, although it did require some manual effort.
df = pd.read_csv('output.csv')
df1 = df.melt(id_vars = ['id', 'name'], var_name = 'match').dropna().drop('match',1).sort_values('name')
df1['diff'] = df1.apply(lambda x: diff.SequenceMatcher(None, x[1].strip(), x[2].strip()).ratio(), axis=1)
df1.to_csv('comparison-output.csv', encoding='utf-8')

How to Split a column into two by comma delimiter, and put a value without comma in second column and not in first?

I have a column in a df that I want to split into two columns splitting by comma delimiter. If the value in that column does not have a comma I want to put that into the second column instead of first.
Origin
New York, USA
England
Russia
London, England
California, USA
USA
I want the result to be:
Location
Country
New York
USA
NaN
England
NaN
Russia
London
England
California
USA
NaN
USA
I used this code
df['Location'], df['Country'] = df['Origin'].str.split(',', 1)

We can try using str.extract here:
df["Location"] = df["Origin"].str.extract(r'(.*),')
df["Country"] = df["Origin"].str.extract(r'(\w+(?: \w+)*)$')

Here is a way by using str.extract() and named groups
df['Origin'].str.extract(r'(?P<Location>[A-Za-z ]+(?=,))?(?:, )?(?P<Country>\w+)')
Output:
Location Country
0 New York USA
1 NaN England
2 NaN Russia
3 London England
4 California USA
5 NaN USA

Merge is not working on two dataframes of multi level index

First DataFrame : housing, This data Frame contains MultiIndex (State, RegionName) and some relevant values in other 3 columns.
State RegionName 2008q3 2009q2 Ratio
New York New York 499766.666667 465833.333333 1.072844
California Los Angeles 469500.000000 413900.000000 1.134332
Illinois Chicago 232000.000000 219700.000000 1.055985
Pennsylvania Philadelphia 116933.333333 116166.666667 1.006600
Arizona Phoenix 193766.666667 168233.333333 1.151773
Second DataFrame : list_of_university_towns, Contains the names of States and Some regions and has default numeric index
State RegionName
1 Alabama Auburn
2 Alabama Florence
3 Alabama Jacksonville
4 Arizona Phoenix
5 Illinois Chicago
Now the inner join of the two dataframes :
uniHousingData = pd.merge(list_of_university_towns,housing,how="inner",on=["State","RegionName"])
This gives no values in the resultant uniHousingData dataframe, while it should have the bottom two values (index#4 and 5 from list_of_university_towns)
What am I doing wrong?

I found the issue. There was space at the end of the string in the RegionName column of the second dataframe. used Strip() method to remove the space and it worked like a charm.

Pandas read_html returned column with NaN values in Python

I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..
wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)

Try something like this (include flavor as bs4):
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df = df[0]
print(df.head())
Image Stadium City State \
0 NaN Aggie Memorial Stadium Las Cruces NM
1 NaN Alamodome San Antonio TX
2 NaN Alaska Airlines Field at Husky Stadium Seattle WA
3 NaN Albertsons Stadium Boise ID
4 NaN Allen E. Paulson Stadium Statesboro GA
Team Conference Capacity \
0 New Mexico State Independent 30,343[1]
1 UTSA C-USA 65000
2 Washington Pac-12 70,500[2]
3 Boise State Mountain West 36,387[3]
4 Georgia Southern Sun Belt 25000
.............................
.............................
To replace anything under square brackets use:
df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Hope this helps.

Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.
You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.

Answer Posted by #anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.
df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df4 = df4[0]
Solution was to takeout "r" presented by #anky_91 in line 1 and line 4
print(df4.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Name: Capacity, dtype: object

Pandas 'concat/upsert' dataframes

I am looking for an efficient way to select matching rows in 2 x dataframes based on a shared row value, and upsert these into a new dataframe I can use to map differences between the intersection of them into a third slightly different dataframe that compares them.
**Example:**
DataFrame1
FirstName, City
Mark, London
Mary, Dallas
Abi, Madrid
Eve, Paris
Robin, New York
DataFrame2
FirstName, City
Mark, Berlin
Abi, Delhi
Eve, Paris
Mary, Dallas
Francis, Rome
In the dataframes, I have potential matching/overlapping on 'name', so the intersection on these is:
Mark, Mary, Abi, Eve
excluded from the join are:
Robin, Francis
I construct a dataframe that allows values from both to be compared:
DataFrameMatch
FirstName_1, FirstName_2, FirstName_Match, City_1, City_2, City_Match
And insert/update (upsert) so my output is:
DataFrameMatch
FirstName_1 FirstName_2 FirstName_Match City_1 City_2 City_Match
Mark Mark True London Berlin False
Abi Abi True Madrid Delhi False
Mary Mary True Dallas Dallas True
Eve Eve True Paris Paris True
I can then report on the difference between the two lists, and what particular fields are different.

merge
According to your output. You only want rows where 'FirstName' matches. You then want another column that evaluates whether cities match.
d1.merge(d2, on='FirstName', suffixes=['_1', '_2']).eval('City_Match = City_1 == City_2')
FirstName City_1 City_2 City_Match
0 Mark London Berlin False
1 Mary Dallas Dallas True
2 Abi Madrid Delhi False
3 Eve Paris Paris True
Details
You could do a simple merge and end up with
FirstName City
0 Mary Dallas
1 Eve Paris
Which takes all common columns by default. So I had to restrict the columns via the on argument, hence on='FirstName'
d1.merge(d2, on='FirstName')
FirstName City_x City_y
0 Mark London Berlin
1 Mary Dallas Dallas
2 Abi Madrid Delhi
3 Eve Paris Paris
Which gets us closer but now I want to adjust those suffixes.
d1.merge(d2, on='FirstName', suffixes=['_1', '_2'])
FirstName City_1 City_2
0 Mark London Berlin
1 Mary Dallas Dallas
2 Abi Madrid Delhi
3 Eve Paris Paris
Lastly, I'll add a new column that shows the evaluation of 'city_1' being equal to 'city_2'. I chose to use pandas.DataFrame.eval. You can see the results above.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cross-checking dataframes in Python - python

Related

How can I compare one column of a dataframe to multiple other columns using SequenceMatcher?

How to Split a column into two by comma delimiter, and put a value without comma in second column and not in first?

Merge is not working on two dataframes of multi level index

Pandas read_html returned column with NaN values in Python

Pandas 'concat/upsert' dataframes

Categories

Resources