Finding Matching ID's based on similar string match

Finding Matching ID's based on similar string match - python

I have a large pandas dataframe ( 10 million records) shown below (snapshot) :
CID Address
100 22 park street springvale nsw2655
101 U111 28 james road, Vic 2755
102 22 park st. springvale, nsw-2655
103 29 Bino Avenue , Mac - 3990
104 Unit 111 28 James rd, Vic 2755
105 Unit 111 28 James rd, Victoria 2755
I want to self-join with the same dataframe to get a list of matching CID (Customer IDs) having the same/similar addresses in a pandas dataframe.
I have tried using fuzzywuzzy but it's taking long time just to find the matches
Expected Output :
CID Address
100 [102]
101 [104,105]
102 [100]
103
104 [101,105]
105 [101,104]
what is the best way to solve this ?

Related

How to regex extract CAR MAKE from URL in pandas df column

I am trying to extract from URL str "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte..."
the entire Make name, i.e. "Mercedes-Benz"
BUT my pattern only returns the first letter, i.e. "M"
Please help me come up with the correct pattern to use on pandas df.
Thank you
CODE:
URLS_by_City['Make'] = URLS_by_City['Page'].str.extract('.+([A-Z])\w+(?=[\/])+', expand=True) Clean_Make = URLS_by_City.dropna(subset=["Make"]) Clean_Make # WENT FROM 5K rows --> to 2688 rows
Page City Pageviews Unique Pageviews Avg. Time on Page Entrances Bounce Rate % Exit **Make**
71 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Jose 310 149 00:00:27 149 2.00% 47.74% **B**
103 /used/Audi/2015-Audi-SQ5-286f67180a0e09a872992... Menlo Park 250 87 00:02:36 82 0.00% 32.40% **A**
158 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Francisco 202 98 00:00:18 98 2.04% 48.02% **B**
165 /used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cf... San Francisco 194 93 00:00:42 44 2.22% 29.38% **A**
168 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... (not set) 192 91 00:00:11 91 2.20% 47.40% **B**
... ... ... ... ... ... ... ... ... ...
4995 /used/Subaru/2019-Subaru-Crosstrek-5717b3040a0... Union City 10 3 00:02:02 0 0.00% 30.00% **S**
4996 /used/Tesla/2017-Tesla-Model+S-15605a190a0e087... San Jose 10 5 00:01:29 5 0.00% 50.00% **T**
4997 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Las Vegas 10 4 00:00:09 2 0.00% 40.00% **T**
4998 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Austin 10 4 00:03:29 2 0.00% 40.00% **T**
4999 /used/Tesla/2018-Tesla-Model+3-5f29cdc70a0e09a... Orinda 10 4 00:04:00 1 0.00% 0.00% **T**
TRIED:
`example_url = "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1"
pattern = ".+([a-zA-Z0-9()])\w+(?=[/])+"
wanted_make = URLS_by_City['Page'].str.extract(pattern)
wanted_make
`
0
0 r
1 r
2 NaN
3 NaN
4 r
... ...
4995 r
4996 l
4997 l
4998 l
4999 l
It worked in regex online tool.
but unfortunately not in my jupyter notebook
EXAMPLE PATTERNS - I bolded what should match:
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cff998e0f96e.htm
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2021-Audi-RS+5-b92922bd0a0e09a91b4e6e9a29f63e8f.htm
/used/LEXUS/2018-LEXUS-GS+350-dffb145e0a0e09716bd5de4955662450.htm
/used/Porsche/2014-Porsche-Boxster-0423401a0a0e09a9358a179195e076a9.htm
/used/Audi/2014-Audi-A6-1792929d0a0e09b11bc7e218a1fa7563.htm
/used/Honda/2018-Honda-Civic-8e664dd50a0e0a9a43aacb6d1ab64d28.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/used-inventory/index.htm
/new-inventory/index.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/

I have tried completing your requirement in Jupyter Notebook.
PFB the code and screenshots:
I have created a dummy pandas dataframe(data_df), below is a screenshot of the same
I have created a pattern based on the pattern of the string to be extracted
pattern = "^/used/(.*)/(?=[20][0-9{2}])"
Used the patten to extract required data from the URLs and saved it in another column in the same dataframe
data_df['Car Maker'] = data_df['urls'].str.extract(pattern)
Below is a screenshot of the output
I hope this is helpful..

I would use:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'([^/]+)/\d{4}\b')
This targets the URL path segment immediately before the portion with the year. You could also try this version:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'/[^/]+/([^/]+)')

Below code will give you the model & VIN values:
pattern2 = '^/used/[a-zA-Z\-]*/([0-9]{4}[a-zA-Z0-9\-+]*)-[a-z0-9]*.htm'
pattern3 = '^/used/[a-zA-Z\-]*/[0-9]{4}[a-zA-Z0-9\-+]*-([a-z0-9]*).htm'
data_df['Model'] = data_df['urls'].str.extract(pattern2)
data_df['VIN'] = data_df['urls'].str.extract(pattern3)
Here is a screenshot of the output:

How to run hypothesis test with pandas data frame and specific conditions?

I am trying to run a hypothesis test using model ols. I am trying to do this model Ols for tweet count based on four groups that I have in my data frame. The four groups are Athletes, CEOs, Politicians, and Celebrities. I have the four groups each labeled for each name in one column as a group.
frames = [CEO_df, athletes_df, Celebrity_df, politicians_df]
final_df = pd.concat(frames)
final_df=final_df.reindex(columns=["name","group","tweet_count","retweet_count","favorite_count"])
final_df
model=ols("tweet_count ~ C(group)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
I want to do something along the lines of:
model=ols("tweet_count ~ C(Athlete) + C(Celebrity) + C(CEO) + C(Politicians)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
Is that even possible? How else will I be able to run a hypothesis test with those conditions?
Here is my printed final_df:
name group tweet_count retweet_count favorite_count
0 #aws_cloud # #ReInvent R “Ray” Wang 王瑞光 #1A CEO 6 6 0
1 Aaron Levie CEO 48 1140 18624
2 Andrew Mason CEO 24 0 0
3 Bill Gates CEO 114 78204 439020
4 Bill Gross CEO 36 486 1668
... ... ... ... ... ...
56 Tim Kaine Politician 48 8346 50898
57 Tim O'Reilly Politician 14 28 0
58 Trey Gowdy Politician 12 1314 6780
59 Vice President Mike Pence Politician 84 1146408 0
60 klay thompson Politician 48 41676 309924

Plot triple bar graph from a single column grouped by another column using pandas

This is my dataset.
Country Type Disaster Count
0 CHINA P REP Industrial Accident 415
1 CHINA P REP Transport Accident 231
2 CHINA P REP Flood 175
3 INDIA Transport Accident 425
4 INDIA Flood 206
5 INDIA Storm 121
6 UNITED STATES Storm 348
7 UNITED STATES Transport Accident 159
8 UNITED STATES Flood 92
9 PHILIPPINES Storm 249
10 PHILIPPINES Transport Accident 84
11 PHILIPPINES Flood 71
12 INDONESIA Transport Accident 136
13 INDONESIA Flood 110
14 INDONESIA Seismic Activity 77
I would like to make a triple bar chart and the label is based on the column 'Type'. I would also like to group the bar based on the column 'Country'.
I have tried using (with df as the DataFrame object of the pandas library),
df.groupby('Country').plot.bar()
but the result came out as multiple bar charts representing each group in the 'Country' column.
The expected output is similar to this:
What are the codes that I need to run in order to achieve this graph?

There are two ways -
df.set_index('Country').pivot(columns='Type').plot.bar()
df.set_index(['Country','Type']).plot.bar()

More efficient way to group data based on a mapping file

I can do this in excel but I am looking for a way to do this in python. Do you know of a way to do the following
Initial
District_1 District_2 District_3
Food 69 47 65
Water 87 86 32
Shelter 63 84 27
Mapping
District_1 London
District_2 London
District_3 Boston
Desired
London Boston
Food 116 65
Water 173 32
Shelter 147 27

I'm going to assume your mapping is a dictionary
mapping = {'District_1': 'London', 'District_2': 'London', 'District_3': 'Boston'}
Then use groupby with axis=1
df.groupby(mapping, axis=1).sum()
Boston London
Food 65 116
Water 32 173
Shelter 27 147
When you pass a dictionary to groupby, its get method gets applied to the axis of choice (axis=0 by default) and the result defines the groups.

How to perform groupby and mean on categorical columns in Pandas

I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.

You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding Matching ID's based on similar string match - python

Related

How to regex extract CAR MAKE from URL in pandas df column

How to run hypothesis test with pandas data frame and specific conditions?

Plot triple bar graph from a single column grouped by another column using pandas

More efficient way to group data based on a mapping file

How to perform groupby and mean on categorical columns in Pandas

Categories

Resources