I am trying to extract from URL str "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte..."
the entire Make name, i.e. "Mercedes-Benz"
BUT my pattern only returns the first letter, i.e. "M"
Please help me come up with the correct pattern to use on pandas df.
Thank you
CODE:
URLS_by_City['Make'] = URLS_by_City['Page'].str.extract('.+([A-Z])\w+(?=[\/])+', expand=True) Clean_Make = URLS_by_City.dropna(subset=["Make"]) Clean_Make # WENT FROM 5K rows --> to 2688 rows
Page City Pageviews Unique Pageviews Avg. Time on Page Entrances Bounce Rate % Exit **Make**
71 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Jose 310 149 00:00:27 149 2.00% 47.74% **B**
103 /used/Audi/2015-Audi-SQ5-286f67180a0e09a872992... Menlo Park 250 87 00:02:36 82 0.00% 32.40% **A**
158 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Francisco 202 98 00:00:18 98 2.04% 48.02% **B**
165 /used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cf... San Francisco 194 93 00:00:42 44 2.22% 29.38% **A**
168 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... (not set) 192 91 00:00:11 91 2.20% 47.40% **B**
... ... ... ... ... ... ... ... ... ...
4995 /used/Subaru/2019-Subaru-Crosstrek-5717b3040a0... Union City 10 3 00:02:02 0 0.00% 30.00% **S**
4996 /used/Tesla/2017-Tesla-Model+S-15605a190a0e087... San Jose 10 5 00:01:29 5 0.00% 50.00% **T**
4997 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Las Vegas 10 4 00:00:09 2 0.00% 40.00% **T**
4998 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Austin 10 4 00:03:29 2 0.00% 40.00% **T**
4999 /used/Tesla/2018-Tesla-Model+3-5f29cdc70a0e09a... Orinda 10 4 00:04:00 1 0.00% 0.00% **T**
TRIED:
`example_url = "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1"
pattern = ".+([a-zA-Z0-9()])\w+(?=[/])+"
wanted_make = URLS_by_City['Page'].str.extract(pattern)
wanted_make
`
0
0 r
1 r
2 NaN
3 NaN
4 r
... ...
4995 r
4996 l
4997 l
4998 l
4999 l
It worked in regex online tool.
but unfortunately not in my jupyter notebook
EXAMPLE PATTERNS - I bolded what should match:
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cff998e0f96e.htm
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2021-Audi-RS+5-b92922bd0a0e09a91b4e6e9a29f63e8f.htm
/used/LEXUS/2018-LEXUS-GS+350-dffb145e0a0e09716bd5de4955662450.htm
/used/Porsche/2014-Porsche-Boxster-0423401a0a0e09a9358a179195e076a9.htm
/used/Audi/2014-Audi-A6-1792929d0a0e09b11bc7e218a1fa7563.htm
/used/Honda/2018-Honda-Civic-8e664dd50a0e0a9a43aacb6d1ab64d28.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/used-inventory/index.htm
/new-inventory/index.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/
I have tried completing your requirement in Jupyter Notebook.
PFB the code and screenshots:
I have created a dummy pandas dataframe(data_df), below is a screenshot of the same
I have created a pattern based on the pattern of the string to be extracted
pattern = "^/used/(.*)/(?=[20][0-9{2}])"
Used the patten to extract required data from the URLs and saved it in another column in the same dataframe
data_df['Car Maker'] = data_df['urls'].str.extract(pattern)
Below is a screenshot of the output
I hope this is helpful..
I would use:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'([^/]+)/\d{4}\b')
This targets the URL path segment immediately before the portion with the year. You could also try this version:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'/[^/]+/([^/]+)')
Below code will give you the model & VIN values:
pattern2 = '^/used/[a-zA-Z\-]*/([0-9]{4}[a-zA-Z0-9\-+]*)-[a-z0-9]*.htm'
pattern3 = '^/used/[a-zA-Z\-]*/[0-9]{4}[a-zA-Z0-9\-+]*-([a-z0-9]*).htm'
data_df['Model'] = data_df['urls'].str.extract(pattern2)
data_df['VIN'] = data_df['urls'].str.extract(pattern3)
Here is a screenshot of the output:
Related
I have a large pandas dataframe ( 10 million records) shown below (snapshot) :
CID Address
100 22 park street springvale nsw2655
101 U111 28 james road, Vic 2755
102 22 park st. springvale, nsw-2655
103 29 Bino Avenue , Mac - 3990
104 Unit 111 28 James rd, Vic 2755
105 Unit 111 28 James rd, Victoria 2755
I want to self-join with the same dataframe to get a list of matching CID (Customer IDs) having the same/similar addresses in a pandas dataframe.
I have tried using fuzzywuzzy but it's taking long time just to find the matches
Expected Output :
CID Address
100 [102]
101 [104,105]
102 [100]
103
104 [101,105]
105 [101,104]
what is the best way to solve this ?
So I have this df or table coming from a pdf tranformation on this way example:
ElementRow
ElementColumn
ElementPage
ElementText
X1
Y1
X2
Y2
1
50
0
1
Emergency Contacts
917
8793
2191
8878
2
51
0
1
Contact
1093
1320
1451
1388
3
51
2
1
Relationship
2444
1320
3026
1388
4
51
7
1
Work Phone
3329
1320
3898
1388
5
51
9
1
Home Phone
4260
1320
4857
1388
6
51
10
1
Cell Phone
5176
1320
5684
1388
7
51
12
1
Priority Phone
6143
1320
6495
1388
8
51
14
1
Contact Address
6542
1320
7300
1388
9
51
17
1
City
7939
1320
7300
1388
10
51
18
1
State
8808
1320
8137
1388
11
51
21
1
Zip
9134
1320
9294
1388
12
52
0
1
Silvia Smith
1093
1458
1973
1526
13
52
2
1
Mother
2444
1458
2783
1526
13
52
7
1
(123) 456-78910
5176
1458
4979
1526
14
52
10
1
Austin
7939
1458
8406
1526
15
52
15
1
Texas
8808
1458
8961
1526
16
52
20
1
76063
9134
1458
9421
1526
17
52
2
1
1234 Parkside Ct
6542
1458
9421
1526
18
53
0
1
Naomi Smith
1093
2350
1973
1526
19
53
2
1
Aunt
2444
2350
2783
1526
20
53
7
1
(123) 456-78910
5176
2350
4979
1526
21
53
10
1
Austin
7939
2350
8406
1526
22
53
15
1
Texas
8808
2350
8961
1526
23
53
20
1
76063
9134
2350
9421
1526
24
53
2
1
3456 Parkside Ct
6542
2350
9421
1526
25
54
40
1
End Employee Line
6542
2350
9421
1526
25
55
0
1
Emergency Contacts
917
8793
2350
8878
I'm trying to separate each register by rows taking as a reference ElementRow column and keep the headers from the first rows and then iterate through the other rows after. The column X1 has a reference on which header should be the values. I would like to have the data like this way.
Contact
Relationship
Work Phone
Cell Phone
Priority
ContactAddress
City
State
Zip
1
Silvia Smith
Mother
(123) 456-78910
1234 Parkside Ct
Austin
Texas
76063
2
Naomi Smith
Aunt
(123) 456-78910
3456 Parkside Ct
Austin
Texas
76063
Things I tried:
To take rows between iterating through the columns. tried to slice taking the first index and the last index but showed this error:
emergStartIndex = df.index[df['ElementText'] == 'Emergency Contacts']
emergLastIndex = df.index[df['ElementText'] == 'End Employee Line']
emerRows_between = df.iloc[emergStartIndex:emergLastIndex]
TypeError: cannot do positional indexing on RangeIndex with these indexers [Int64Index([...
That way is working with this numpy trick.
emerRows_between = df.iloc[np.r_[1:54,55:107]]
emerRows_between
but when trying to replace the index showed this:
emerRows_between = df.iloc[np.r_[emergStartIndex:emergLastIndex]]
emerRows_between
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I tried iterating row by row like this but in some point the df reach the end and I'm receiving index out of bound.
emergencyContactRow1 = df['ElementText','X1'].iloc[emergStartIndex+1].reset_index(drop=True)
emergencyContactRow2 = df['ElementText','X1'].iloc[emergStartIndex+2].reset_index(drop=True)
emergencyContactRow3 = df['ElementText','X1'].iloc[emergStartIndex+3].reset_index(drop=True)
emergencyContactRow4 = df['ElementText','X1'].iloc[emergStartIndex+4].reset_index(drop=True)
emergencyContactRow5 = df['ElementText','X1'].iloc[emergStartIndex+5].reset_index(drop=True)
emergencyContactRow6 = df['ElementText','X1'].iloc[emergStartIndex+6].reset_index(drop=True)
emergencyContactRow7 = df['ElementText','X1'].iloc[emergStartIndex+7].reset_index(drop=True)
emergencyContactRow8 = df['ElementText','X1'].iloc[emergStartIndex+8].reset_index(drop=True)
emergencyContactRow9 = df['ElementText','X1'].iloc[emergStartIndex+9].reset_index(drop=True)
emergencyContactRow10 = df['ElementText','X1'].iloc[emergStartIndex+10].reset_index(drop=True)
frameEmergContact1 = [emergencyContactRow1 , emergencyContactRow2 , emergencyContactRow3, emergencyContactRow4, emergencyContactRow5, emergencyContactRow6, emergencyContactRow7, , emergencyContactRow8,, emergencyContactRow9, , emergencyContactRow10]
df_emergContact1= pd.concat(frameEmergContact1 , axis=1)
df_emergContact1.columns = range(df_emergContact1.shape[1])
So how to make this code dynamic or how to avoid the index out of bound errors and keep my headers taking as a reference only the first row after the Emergency Contact row?. I know I didn't try to use the X1 column yet, but I have to resolve first how to iterate through those multiple indexes.
Each iteration from Emergency Contact index to End Employee line belongs to one person or one employee from the whole dataframe, so the idea after capture all those values is to keep also a counter variable to see how many times the data is captured between those two indexes.
It's a bit ugly, but this should do it. Basically you don't need the first or last two rows, so if you get rid of those, then pivot the X1 and ElemenTex columns you will be pretty close. Then it's a matter of getting rid of null values and promoting the first row to header.
df = df.iloc[1:-2][['ElementTex','X1','ElementRow']].pivot(columns='X1',values='ElementTex')
df = pd.DataFrame([x[~pd.isnull(x)] for x in df.values.T]).T
df.columns = df.iloc[0]
df = df[1:]
Split the dataframe into chunks whenever "Emergency Contacts" appears in column "ElementText"
Parse each chunk into the required format
Append to the output
import numpy as np
list_of_df = np.array_split(data, data[data["ElementText"]=="Emergency Contacts"].index)
output = pd.DataFrame()
for frame in list_of_df:
df = frame[~frame["ElementText"].isin(["Emergency Contacts", "End Employee Line"])].dropna()
if df.shape[0]>0:
temp = pd.DataFrame(df.groupby("X1")["ElementText"].apply(list).tolist()).T
temp.columns = temp.iloc[0]
temp = temp.drop(0)
output = output.append(temp, ignore_index=True)
>>> output
0 Contact Relationship Work Phone ... City State Zip
0 Silvia Smith Mother None ... Austin Texas 76063
1 Naomi Smith Aunt None ... Austin Texas 76063
I ran the below code in Jupyter Notebook, I was expecting the output to appear like an excel table but instead the output was split up and not in a table. How can I get it to show up in table format?
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("Robbery_2014_to_2019.csv")
print(df.head())
Output:
X Y Index_ event_unique_id occurrencedate \
0 -79.270393 43.807190 17430 GO-2015134200 2015-01-23T14:52:00.000Z
1 -79.488281 43.764091 19205 GO-20142956833 2014-09-21T23:30:00.000Z
2 -79.215836 43.761856 15831 GO-2015928336 2015-03-23T11:30:00.000Z
3 -79.436264 43.642963 16727 GO-20142711563 2014-08-15T22:00:00.000Z
4 -79.369461 43.654526 20091 GO-20142492469 2014-07-12T19:00:00.000Z
reporteddate premisetype ucr_code ucr_ext \
0 2015-01-23T14:57:00.000Z Outside 1610 210
1 2014-09-21T23:37:00.000Z Outside 1610 200
2 2015-06-03T15:08:00.000Z Other 1610 220
3 2014-08-16T00:09:00.000Z Apartment 1610 200
4 2014-07-14T01:35:00.000Z Apartment 1610 100
offence ... occurrencedayofyear occurrencedayofweek \
0 Robbery - Business ... 23.0 Friday
1 Robbery - Mugging ... 264.0 Sunday
2 Robbery - Other ... 82.0 Monday
3 Robbery - Mugging ... 227.0 Friday
4 Robbery With Weapon ... 193.0 Saturday
occurrencehour MCI Division Hood_ID Neighbourhood \
0 14 Robbery D42 129 Agincourt North (129)
1 23 Robbery D31 27 York University Heights (27)
2 11 Robbery D43 137 Woburn (137)
3 22 Robbery D11 86 Roncesvalles (86)
4 19 Robbery D51 73 Moss Park (73)
Long Lat ObjectId
0 -79.270393 43.807190 2001
1 -79.488281 43.764091 2002
2 -79.215836 43.761856 2003
3 -79.436264 43.642963 2004
4 -79.369461 43.654526 2005
[5 rows x 29 columns]
Use display(df.head()) (produces slightly nicer output than without display()
Print function is applied to represent any kind of information like string or estimated value.
Whereas Display() will display the dataset in
I have got the following DF:
carrier_name sol_carrier
aapt 702
aapt carrier 185
afrix 72
afr-ix 4
airtel 35
airtel 2
airtel dia and broadband 32
airtel mpls standard circuits 32
amt 6
anca test 1
appt 1
at tokyo 1
at&t 5041
att 2
batelco 723
batelco 2
batelco (manual) 4
beeline 1702
beeline - 01 6
beeline - 02 6
i need to get a unique list of carrier_name so I have done some basic housekeeping as I only want to keep the names with no white spaces at the beginign or end of the observation with the following code:
`carrier = pd.DataFrame(data['sol_carrier'].value_counts(dropna=False))
carrier['carrier_name'] = carrier.index
carrier['carrier_name'] = carrier['carrier_name'].str.strip()
carrier['carrier_name'] = carrier['carrier_name'].str.replace('[^a-zA-Z]', ' ')
carrier['carrier_name'] = np.where(carrier['carrier_name']==' ',np.NaN,carrier['carrier_name'])
carrier['carrier_name'] = carrier['carrier_name'].str.strip()
carrier = carrier.reset_index(drop=True)
carrier = carrier[['carrier_name','sol_carrier']]
carrier.sort_values(by='carrier_name')`
what happens here is that i get a list of carrier_name but still get some duplicate observations like airtel or beelinefor example. I dont understand why this is happening as both observations are the same and and there are no more whitespaces at the begining or the end of the observation and, this observations are followed by its respective value_counts()so there is no reason for them to be duplicated. Here is the same DF but after the above code has been applied:
carrier_name sol_carrier
aapt 702
aapt carrier 185
afr ix 4
afrix 72
airtel 35
airtel 2
airtel dia and broadband 32
airtel mpls standard circuits 32
amt 6
anca test 1
appt 1
at t 5041
at tokyo 1
att 2
batelco 723
batelco 2
batelco manual 4
beeline 1702
beeline 6
beeline 6
That happens because you don't aggregate the results you just change the values in 'carrier_name' columns.
To aggregate the results call
carrier.groupby('carrier_name').sol_carrier.sum()
or modify the 'data' dataframe and then call
data['sol_carrier'].value_counts()
I'm currently writing a piece of code to extract data from a list of observation sites (an example is given below). I currently have a list of regular expressions to remove any lines which don't contain data I'm looking for. All of the regular expressions successfully indicate the lines on which metadata is contained except for the one searching for the date. When tested at regexr.com, the expression works just fine, but when running code, I am unable to remove the lines. What am I missing to remove the lines containing dates?
Example of data
! CD = 2 letter state (province) abbreviation
! STATION = 16 character station long name
! ICAO = 4-character international id
! IATA = 3-character (FAA) id
! SYNOP = 5-digit international synoptic number
! LAT = Latitude (degrees minutes)
! LON = Longitude (degree minutes)
! ELEV = Station elevation (meters)
! M = METAR reporting station. Also Z=obsolete? site
! N = NEXRAD (WSR-88D) Radar site
! V = Aviation-specific flag (V=AIRMET/SIGMET end point, A=ARTCC T=TAF U=T+V)
! U = Upper air (rawinsonde=X) or Wind Profiler (W) site
! A = Auto (A=ASOS, W=AWOS, M=Meso, H=Human, G=Augmented) (H/G not yet impl.)
! C = Office type F=WFO/R=RFC/C=NCEP Center
! Digit that follows is a priority for plotting (0=highest)
! Country code (2-char) is last column
!
!2345678901234567890123456789012345678901234567890123456789012345678901234567890 1234567890
!
ALASKA 16-DEC-13
CD STATION ICAO IATA SYNOP LAT LONG ELEV M N V U A C
AK ADAK NAS PADK ADK 70454 51 53N 176 39W 4 X T 7 US
AK AKHIOK PAKH AKK 56 56N 154 11W 14 X 8 US
AK AMBLER PAFM AFM 67 06N 157 51W 88 X 7 US
AK ANAKTUVUK PASS PAKP AKP 68 08N 151 44W 642 X 7 US
AK ANCHORAGE INTL PANC ANC 70273 61 10N 150 01W 38 X T X A 5 US
AK ANCHORAGE/WFO PAFC AFC 61 10N 150 02W 48 F 8 US
AK ANCHORAG/NIKISKI PAHG AHG 60 44N 151 21W 74 X 8 US
AK ANCHORAGE/LAKE H PALH LHD 61 11N 149 58W 22 X A 7 US
AK ANCHORAGE/ARTCC PZAN ZAN 61 10N 149 59W 22 A 8 US
AK ANCHORAGE/MERRIL PAMR MRI 61 13N 149 51W 41 X A 7 US
AK ANGOON SEAPLANE PAGN 57 30N 134 35W 2 X 8 US
AK ANIAK PANI ANI 70232 61 35N 159 32W 26 X 7 US
AK ANNETTE ISLAND PANT ANN 70398 55 02N 131 34W 36 X X A 5 US
AK ANVIK PANV ANV 62 39N 160 11W 99 X 7 US
AK ARCTIC VILLAGE PARC ARC 68 07N 145 35W 636 X 7 US
AK ATQASUK BURNELL PATQ ATK 70 28N 157 26W 29 X 7 US
AK ATKA PAAK AKA 52 13N 174 12W 17 X 7 US
AK BARROW PABR BRW 70026 71 17N 156 48W 7 X T X A 5 US
AK BARROW ARM-NSA 70027 71 19N 156 37W 7 X 8 US
AK BARTER ISLAND PABA BTI 70086 70 08N 143 35W 2 X W 7 US
AK BETHEL PABE BET 70219 60 47N 161 51W 41 X T X A 5 US
AK BETHEL/88D PABC ABC 60 48N 161 53W 49 X 8 US
AK BETTLES PABT BTT 70174 66 55N 151 31W 195 X T A 6 US
AK BIG RIVER LAKES PALV LVR 60 49N 152 18W 12 X 7 US
AK BIRCHWOOD PABV BCV 61 25N 149 31W 29 X 7 US
AK BREVIG_MISSION PFKT 65 20N 166 28W 9 X 7 US
AK BUCKLAND PABL BVK 65 59N 161 09W 7 X 7 US
AK CANTWELL PATW TTW 63 23N 148 57W 668 X 7 US
AK CAPE LISBURNE PALU LUR 70104 68 53N 166 08W 3 X T W 6 US
AK CAPE NEWENHAM PAEH EHM 70305 58 39N 162 04W 161 X T 6 US
AK CAPE ROMANZOF PACZ CZF 70212 61 47N 166 02W 146 X T 6 US
AK CENTRAL PARL 65 34N 144 47W 284 X 7 US
AK CENTRAL PACE 65 34N 144 47W 286 X 7 US
AK CENTRAL AK PROF CEN 70197 65 30N 144 41W 259 W 8 US
AK CHANDALAR LAKE PALR WCR 67 30N 148 29W 585 X 7 US
AK CHEVAK PAVA 61 32N 165 36W 23 X 7 US
AK CHIGNIK BAY PAJC AJC 56 19N 158 22W 15 X 7 US
AK CIRCLE/PAFC RFC PACR CRC 65 50N 144 04W 182 X R 7 US
AK COLD BAY PACD CDB 70316 55 12N 162 43W 30 X T X A 5 US
AK CORDOVA PACV CDV 70296 60 30N 145 30W 12 X T A 6 US
AK DEADHORSE PASC SCC 70 12N 148 28W 15 X T A 6 US
AK DEERING PADE DEE 66 04N 162 46W 5 X A 7 US
AK DELTA JUNCTION PABI BIG 70267 64 00N 145 44W 386 X T A 6 US
My Code
station_file = open('../DATA/stations.txt', 'r')
data = station_file.read()
skip_res = ['^$', '^.*d{2}\-[A-Z]{3}\-\d{2}','^!'] #List of regular expressions which only appear in lines of metadata (not actual data)
data = data.split('\n')
for loop in data:
breakcheck = False # In the event a regular expression matches, this will turn to true and skip that line
for check in skip_res:
current = re.compile(check)
if current.search(loop) == None:
continue
else:
breakcheck = True
break
if breakcheck:
continue
else:
print(loop) # Should only print out lines containing actual data.
Your pattern for matching the date is missing a \ before the first d. Change it to:
r'\d{2}-[A-Z]{3}-\d{2}'
Since you are using re.search() you don't need to match from the beginning of the string. Also, you don't need to escape the -.
Note the use a a raw string (denoted by the r prefix) to specify the pattern. Generally you should use raw strings for regex patterns because there are some string escape sequences that are also regex patterns, e.g. \b. As a normal string this represents the backspace character. In a raw string it is treated as \ followed by b which is the regex pattern for "beginning or end of a word".
Another thing worth mentioning is that you can check for a match of more than one pattern at at time by joining the patterns together with |. Think of it as "or". Then your code can be written more concisely:
skip_res = [r'^$', r'\d{2}-[A-Z]{3}-\d{2}',r'^!']
skip_pattern = r'|'.join(skip_res)
with open ('../DATA/stations.txt', 'r') as station_file:
for line in station_file:
if re.search(skip_pattern, line):
continue
print(line)
Compiling the regex pattern provides no benefit when there are only a handful of them because the re module will cache them.
Your date regex is missing a backslash before the first "d".
'^.*d{2}\-[A-Z]{3}\-\d{2}'
should be
'^.*\d{2}\-[A-Z]{3}\-\d{2}'