How to split 2 columns with patterns? - python

I have a dataset (df) with 2 columns of arbitrary length, and I need to split it up based on the value.
BUS
CODE
150 H.S.London-lon3 11£150 H.S.London-lon3 16£150 H.S.London-lon3 120
GERI
400 Airport Luton-ptr5 12£400 Airport Luton-ptr5 15£400 Airport Luton-ptr5 17
24£JTR
005 Plaza-cata-md6 08£005 Plaza-cata-md6 012£005 Plaza-cata-md6 18
78£TDE
I've been trying to split it to look like this:
bus
directions
zone
time
code
name
150
H.S.London
lon3
11
NaN
GERI
400
Airport Luton
ptr5
12
24
JTR
005
Plaza-cata
md6
08
78
TDE
So far, I tried to split by patterns, but isn't working and I'm out of ideas or how to split it in other way.
bus = '(?P<bus>[\d]+) (?P<direction>[\w\W]+)-(?P<zone>[\w]+)'
code = '(?P<code>[\S]+)£(?P<name>\d+)
df.BUS.str.extract(bus)).join(df.CODE.str.extract(code)
I was wondering if anyone had a good solution to this.

You can use .str.extract with regex pattern containing named capturing groups:
code = r'^(?P<code>\d+)?.*?(?P<name>[A-Za-z]+)'
bus = r'^(?P<bus>\d+)\s(?P<directions>.*?)-(?P<zone>[^\-]+)\s(?P<time>\d+)'
df['BUS'].str.extract(bus).join(df['CODE'].str.extract(code))
bus directions zone time code name
0 150 H.S.London lon3 11 NaN GERI
1 400 Airport Luton ptr5 12 24 JTR
2 005 Plaza-cata md6 08 78 TDE
See the regex demo for code pattern here and for bus pattern here.

You could use split:
For your code column:
new_cols = ['code','name']
df[new_cols] = df.CODE.str.split(pat = '£', expand = True)
Im sure you can find a way to do this for your first column, and if you have duplicates remove them after splitting

Related

How to regex extract CAR MAKE from URL in pandas df column

I am trying to extract from URL str "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte..."
the entire Make name, i.e. "Mercedes-Benz"
BUT my pattern only returns the first letter, i.e. "M"
Please help me come up with the correct pattern to use on pandas df.
Thank you
CODE:
URLS_by_City['Make'] = URLS_by_City['Page'].str.extract('.+([A-Z])\w+(?=[\/])+', expand=True) Clean_Make = URLS_by_City.dropna(subset=["Make"]) Clean_Make # WENT FROM 5K rows --> to 2688 rows
Page City Pageviews Unique Pageviews Avg. Time on Page Entrances Bounce Rate % Exit **Make**
71 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Jose 310 149 00:00:27 149 2.00% 47.74% **B**
103 /used/Audi/2015-Audi-SQ5-286f67180a0e09a872992... Menlo Park 250 87 00:02:36 82 0.00% 32.40% **A**
158 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Francisco 202 98 00:00:18 98 2.04% 48.02% **B**
165 /used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cf... San Francisco 194 93 00:00:42 44 2.22% 29.38% **A**
168 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... (not set) 192 91 00:00:11 91 2.20% 47.40% **B**
... ... ... ... ... ... ... ... ... ...
4995 /used/Subaru/2019-Subaru-Crosstrek-5717b3040a0... Union City 10 3 00:02:02 0 0.00% 30.00% **S**
4996 /used/Tesla/2017-Tesla-Model+S-15605a190a0e087... San Jose 10 5 00:01:29 5 0.00% 50.00% **T**
4997 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Las Vegas 10 4 00:00:09 2 0.00% 40.00% **T**
4998 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Austin 10 4 00:03:29 2 0.00% 40.00% **T**
4999 /used/Tesla/2018-Tesla-Model+3-5f29cdc70a0e09a... Orinda 10 4 00:04:00 1 0.00% 0.00% **T**
TRIED:
`example_url = "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1"
pattern = ".+([a-zA-Z0-9()])\w+(?=[/])+"
wanted_make = URLS_by_City['Page'].str.extract(pattern)
wanted_make
`
0
0 r
1 r
2 NaN
3 NaN
4 r
... ...
4995 r
4996 l
4997 l
4998 l
4999 l
It worked in regex online tool.
but unfortunately not in my jupyter notebook
EXAMPLE PATTERNS - I bolded what should match:
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cff998e0f96e.htm
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2021-Audi-RS+5-b92922bd0a0e09a91b4e6e9a29f63e8f.htm
/used/LEXUS/2018-LEXUS-GS+350-dffb145e0a0e09716bd5de4955662450.htm
/used/Porsche/2014-Porsche-Boxster-0423401a0a0e09a9358a179195e076a9.htm
/used/Audi/2014-Audi-A6-1792929d0a0e09b11bc7e218a1fa7563.htm
/used/Honda/2018-Honda-Civic-8e664dd50a0e0a9a43aacb6d1ab64d28.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/used-inventory/index.htm
/new-inventory/index.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/
I have tried completing your requirement in Jupyter Notebook.
PFB the code and screenshots:
I have created a dummy pandas dataframe(data_df), below is a screenshot of the same
I have created a pattern based on the pattern of the string to be extracted
pattern = "^/used/(.*)/(?=[20][0-9{2}])"
Used the patten to extract required data from the URLs and saved it in another column in the same dataframe
data_df['Car Maker'] = data_df['urls'].str.extract(pattern)
Below is a screenshot of the output
I hope this is helpful..
I would use:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'([^/]+)/\d{4}\b')
This targets the URL path segment immediately before the portion with the year. You could also try this version:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'/[^/]+/([^/]+)')
Below code will give you the model & VIN values:
pattern2 = '^/used/[a-zA-Z\-]*/([0-9]{4}[a-zA-Z0-9\-+]*)-[a-z0-9]*.htm'
pattern3 = '^/used/[a-zA-Z\-]*/[0-9]{4}[a-zA-Z0-9\-+]*-([a-z0-9]*).htm'
data_df['Model'] = data_df['urls'].str.extract(pattern2)
data_df['VIN'] = data_df['urls'].str.extract(pattern3)
Here is a screenshot of the output:

Identify records that has at least one unmatched given 3 datasets

I have 3 datasets and I would like to know which ID has at least one unmatched when comparing Dataset A, Dataset B and Dataset C. May I know how could I achieve this in Python?
Dataset A
ID Salary
12 12,000
14 13,004
16 1,400
17 500
19 900
20 12,000
Dataset B
ID Name
13 John
12 James
15 Jacob
19 Michael
20 Seth
Dataset C
ID State
16 WA
17 WA
15 VC
19 NSW
20 WA
Since you mentioned Python I assumed you are using Pandas for the DataFrames.
import pandas as pd
DatasetA = pd.DataFrame({"ID":[12,14,16,17,19,20],"Salary":[12000,13004,1400,500,900,12000]})
DatasetB = pd.DataFrame({"ID":[13,12,15,19,20],"Name":["John","James","Jacob","Michael","Seth"]})
DatasetC = pd.DataFrame({"ID":[16,17,15,19,20],"State":["WA","WA","VC","NSW","WA"]})
IDs_A = set(DatasetA["ID"])
IDs_B = set(DatasetB["ID"])
IDs_C = set(DatasetC["ID"])
AB = IDs_A.symmetric_difference(IDs_B)
BC = IDs_B.symmetric_difference(IDs_C)
AC = IDs_A.symmetric_difference(IDs_C)
result = AB.union(BC).union(AC)
print(result)

Splitting DataFrame and maintaining DataFrame group integrity

To whom it may concern,
I have a very large dataframe (MasterDataFrame) that contains ~180K groups that I would like to split into 5 smaller DataFrames and process each smaller DataFrame separately. Does anyone know of any way that I could achieve this split into 5 smaller DataFrames without accidentally splitting/jeopardizing the integrity of any of the groups from the MasterDataFrame? In other words, I would like for the 5 smaller DataFrames to not have overlapping groups.
Thanks in advance,
Christos
This is what my dataset looks like:
|======MasterDataset======|
Name Age Employer
Tom 12 Walmart
Nick 15 Disney
Chris 18 Walmart
Darren 19 KMart
Nate 43 ESPN
Harry 23 Walmart
Uriel 24 KMart
Matt 23 Disney
. . .
. . .
. . .
I need to be able to split my dataset such that the groups shown in the MasterDataset above are preserved. The smaller groups into which my MasterDataset will be split need to look like this:
|======SubDataset1======|
Name Age Employer
Tom 12 Walmart
Chris 18 Walmart
Harry 23 Walmart
Darren 19 KMart
Uriel 24 KMart
|======SubDataset2======|
Name Age Employer
Nick 15 Disney
Matt 23 Disney
I assume that you mean the number of lines with "groups"
For that .iloc should be perfect.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
df_1 = df.iloc[0:100000,:]
df_2 = df.iloc[100001:200000,:]
....

cleaning a column of strings in a pandas dataframe with str comprehension

I have a dataframe (df1) constructed from a survey in which participants entered their gender as a string and so there is a gender column that looks like:
id gender age
1 Male 19
2 F 22
3 male 20
4 Woman 32
5 female 26
6 Male 22
7 make 24
etc.
I've been using
df1.replace('male', 'Male')
for example, but this is really clunky and involves knowing the exact format of each response to fix it.
I've been trying to use various string comprehensions and string operations in Pandas, such as .split(), .replace(), and .capitalize(), with np.where() to try to get:
id gender age
1 Male 19
2 Female 22
3 Male 20
4 Female 32
5 Female 26
6 Male 22
7 Male 24
I'm sure there must be a way to use regex to do this but I can't seem to get the code right.
I know that it is probably a multi-step process of removing " ", then capitalising the entry, then replacing the capitalised values.
Any guidance would be much appreciated pythonistas!
Kev
Adapt the code in my comment to replace every record that starts with an f with the word Female:
df1["gender"] = df1.gender.apply(lambda s: re.sub(
"(^F)([A-Za-z]+)*", # pattern
"Female", # replace
s.strip().title()) # string
)
Similarly for F with M in the pattern and replace with Male for Male.
Relevant regex docs
Regex help

Find non-matching pairs in 2 dataframes and make new missing dataframe Python

I have two uneven dataframes that have all the same variables except for a pair of ID values that vary from one to the other.
For example one of the dataframes, df1, looks like this:
Name Name ID State Gen ID Unit ID
Nikki 9 AZ 1-1 1
Nikki 9 AZ 1-2 2
Nikki 9 AZ 1-3 3
Mondip 101 NY 1A 1A
Mondip 101 NY 1B 1B
James 11 CA 12-1 12
James 11 CA 13-1 13
Sandra 88 NJ 1 1
.
.
.
The other dataframe df2 looks like this:
Name Name ID State Unit ID
Monte 97 PA 4-1
Monte 97 PA 4-2
Nikki Ltd 9 AZ 1
Nikki Ltd 9 AZ 2
Mondip 101 NY 1A
Mondip 101 NY 1B
James 11 CA 12-1
James 11 CA 13-1
.
.
.
As you can see the Gen ID column and the Unit ID column are somehow connected. Sometimes the Unit ID in df2 can be either the Gen ID or the Unit ID in df1.
What I want to do is to create a new dataframe or list of each set of Name, Name ID, and State that does not match df1 and df2. Sometimes the name matches slightly Nikki and Nikki Ltd so I need to take care of this using the Name ID.
For example the new dataframe output df_missing would be:
Name Name ID State Gen ID Unit ID
Monte 97 PA 4-1
Monte 97 PA 4-2
Sandra 88 NJ 1 1
Is there an easy way to do this?
If we assume that you can identify names that are close enough then the first step would be to replace instances of 'Nikki LTD' with 'Nikki'. Once you do that its a simple matter to identify the names that are not mutual to each dataframe. These names are
merged_df = pd.concat([df1, df2])
s1 = set(df1['Name'].unique())
s2 = set(df2['Name'].unique())
# read as every in s1 thats not in s2 and everyone in s2 thats not in s1
mutually_distinct_names = list((s1 - s2).union(s2 - s1))
missing_df = merged_df[merged_df['Name'].isin(muutally_distinct_names)]

Categories

Resources