I am getting the string from the front end which contains both string and number Eg: "L'Oreal Paris L'Huile Nail Paint, 224 Rose Ballet, 13.5ml".
Now I want to separate the 13.5ml to 13.5 as one value and ml as another value to insert the value in the backend table.
You could try using re.findall with the regex pattern \d+(?:\.\d+)?ml:
input = "L'Oreal Paris L'Huile Nail Paint, 224 Rose Ballet, 13.5ml"
matches = re.findall(r'(\d+(?:\.\d+)?)(ml)', input)
print(matches)
This prints:
[('13.5', 'ml')]
Edit:
To handle capturing a known list of units, you may modify the above regex pattern to the following:
\d+(?:\.\d+)?(?:GM|KG|LIT)
This uses an alteration to represent each possible unit, and you may add new units as you see fit.
data = "L'Oreal Paris L'Huile Nail Paint, 224 Rose Ballet, 13.5ml, 14dl"
for i in range(len(data)-1):
try:
# if number is before letter
int(data[i])
if data[i+1].isalpha():
data = data[:i+1] + ' ' + data[i+1:] # add space between number and letter
except:
pass
print (data)
output:
L'Oreal Paris L'Huile Nail Paint, 224 Rose Ballet, 13.5 ml, 14 dl
Related
The regex I am using is \d+-\d+, but I'm not quite sure about how to separate the Roman numbers and how to create a new column with them.
I have this dataset:
Date_Title Date Copies
05-21 I. Don Quixote 1605 252
21-20 IV. Macbeth 1629 987
10-12 ML. To Kill a Mockingbird 1960 478
12 V. Invisible Man 1897 136
Basically, I would like to split the "Date Title", so, when I print a row, I would get this:
('05-21 I', 'I', 'Don Quixote', 1605, 252)
Or
('10-12 ML', 'ML', 'To Kill a Mockingbird',1960, 478)
In the first place, the numbers and the roman numeral, in the second; only the Roman numeral, in the third the name, and the fourth and fifth would be the same as the dataset.
You can use
df = pd.DataFrame({'Date_Title':['05-21 I. Don Quixote','21-20 IV. Macbeth','10-12 ML. To Kill a Mockingbird','12 V. Invisible Man'], 'Date':[1605,1629,1960,1897], 'Copies':[252,987,478,136]})
rx = r'^(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})))\.\s*(.*)'
df[['NumRoman','Roman','Name']] = df.pop('Date_Title').str.extract(rx)
df = df[['NumRoman','Roman','Name', 'Date', 'Copies']]
>>> df
NumRoman Roman Name Date Copies
0 05-21 I I Don Quixote 1605 252
1 21-20 IV IV Macbeth 1629 987
2 10-12 ML ML To Kill a Mockingbird 1960 478
3 12 V V Invisible Man 1897 136
See the regex demo. Details:
^ - start of string
(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3}))) - Group 1 ("NumRoman"):
\d+(?:-\d+)? - one or more digits followed with an optional sequence of a - and one or more digits
\s* - zero or more whitespaces
(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})) - Group 2 ("Roman"): see How do you match only valid roman numerals with a regular expression? for explanation
\. - a dot
\s* - zero or more whitespaces
(.*) - Group 3 ("Name"): any zero or more chars other than line break chars, as many as possible
Note df.pop('Date_Title') removes the Date_Title column and yields it as input for the extract method. df = df[['NumRoman','Roman','Name', 'Date', 'Copies']] is necessary if you need to keep the original column order.
I am pretty sure there might be a more optimal solution, but this is would be a fast way of solving it:
df['Date_Title'] = df['Date_Title'].apply(lambda x: (x.split()[0],x.split()[1],' '.join(x.split()[2:])
Or:
df['Date_Title'] = (df['Date_Title'].str.split().str[0],
df['Date_Title'].str.split().str[1],
' '.join(df['Date_Title'].str.split().str[2:])
Focusing on the string split:
string = "21-20 IV. Macbeth"
i = string.index(".") # Finds the first point
date, roman = string[:i].split() # 21-20, IV
title = string[i+2:] # Macbeth
df=df.assign(x=df['Date_Title'].str.split('\.').str[0],y=df['Date_Title'].str.extract('(\w+(?=\.))'),z=df['Date_Title'].str.split('\.').str[1:].str.join(','))
Required:
Check if the text passed includes a possible U.S. zip code, formatted as follows: exactly 5 digits, and sometimes, but not always, followed by a dash with 4 more digits. The zip code needs to be preceded by at least one space, and cannot be at the start of the text.
My Code:
import re
def check_zip_code (text):
result = re.search(r"^.* +\d{5}", text)
return result != None
For the occasional r"\-\d{4}" (a dash with 4 more digits), I tried to include it by changing line 3 to:
result = re.search(r"^.* +\d{5}|\-\d{4}", text)
But it does not work.
I have the following questions:
How to solve the above zip code problem?
How to partially use | in the whole raw string?
(e.g. "a1|2" can match either a1 or a2)
Some of the test cases:
print(check_zip_code("The zip codes for New York are 10001 thru 11104.")) # True
print(check_zip_code("90210 is a TV show")) # False
print(check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001.")) # True
print(check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9.")) # False
You are looking for an optional group, not an alternation. Additionally, add a negative lookahead at the beginning. That said, you can use:
(?!^)\b\d{5}(?:-\d{4})?\b
See a demo on regex101.com.
I have a pandas dataframe containing addresses. Some are formatted correctly like 481 Rogers Rd York ON. Others have a space missing between the city quandrant and the city name, for example: 101 9 Ave SWCalgary AB or even possibly: 101 9 Ave SCalgary AB, where SW refers to south west and S to south.
I'm trying to find a regex that will add a space between second and third capital letters if they are followed by lowercase letters, or if there are only 2 capitals followed by lower case, add a space between the first and second.
So far, I've found that ([A-Z]{2,3}[a-z]) will match the situation correctly, but I can't figure out how to look back into it and sub at position 2 or 3. Ideally, I'd like to use an index to split the match at [-2:] but I can't figure out how to do this.
I found that re.findall('(?<=[A-Z][A-Z])[A-Z][a-z].+', '101 9 Ave SWCalgary AB')
will return the last part of the string and I could use a look forward regex to find the start and then join them but this seems very inefficient.
Thanks
You may use
df['Test'] = df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
See this regex demo
Details
\b - a word boundary
([A-Z]{1,2}) - Capturing group 1 (later referred with \1 from the replacement pattern): one or two uppercase letters
([A-Z][a-z]) - Capturing group 2 (later referred with \2 from the replacement pattern): an uppercase letter + a lowercase one.
If you want to specifically match city quadrants, you may use a bit more specific regex:
df['Test'] = df['Test'].str.replace(r'\b([NS][EW]|[NESW])([A-Z][a-z])', r'\1 \2')
See this regex demo. Here, [NS][EW]|[NESW] matches N or S that are followed with E or W, or a single N, E, S or W.
Pandas demo:
import pandas as pd
df = pd.DataFrame({'Test':['481 Rogers Rd York ON',
'101 9 Ave SWCalgary AB',
'101 9 Ave SCalgary AB']})
>>> df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
0 481 Rogers Rd York ON
1 101 9 Ave SW Calgary AB
2 101 9 Ave S Calgary AB
Name: Test, dtype: object
You can use
([A-Z]{1,2})(?=[A-Z][a-z])
to capture the first (or first and second) capital letters, and then use lookahead for a capital letter followed by a lowercase letter. Then, replace with the first group and a space:
re.sub(r'([A-Z]{1,2})(?=[A-Z][a-z])', r'\1 ', str)
https://regex101.com/r/TcB4Ph/1
I am trying to use regex to extract degrees/minutes/seconds and feet in a legal description for a land parcel. An example of a written legal description would be something like this:
CONT FROM THE PT ON THE NWLY ROW LN OF CO RD NO 31 N 56D 54M 00S W 365
FT TH S 32D 06M 00S W 91/89 FT TH S 61D 54M 00S E 335/77 FT TO THE
NWLY ROW OF SD CO RD NO 31 TH N 32D 06M 00S E 62/62 FT TO THE POB EXC
THAT PART CONVEYED IN BOOK 1132 PAGE 473 0/5900A
I have written a regex that will go through this and find the area's that are what I am looking for such as: N 32D 06M 00S E 62/62 FT.
The problem is sometimes the feet are not written directly after the degrees/minutes/seconds. For example it might say instead: N 32D 06M 00S E along the road for 62/62 FT.
The "along the road for" is the part that messes with my regex.
Is there a good way to get around this? Below is an example of my code
Input for user:
legal_input=input("Paste legal description from RW here: ")
Regex code to find cogo:
cogo_rgx = re.compile(r'([N]{,2}[S]{,2} \w{,1}\d{,2}D{,1} \d{,2}M{,1} \d{,2}S{,1}\s{,2}\w) (\s{,2}\d{1,4}\W{,1}\d{,2} FT){,1}')
full_legal=cogo_rgx.findall(legal_input)
Print message:
print("\nCogo below: \n")
Print the key from the dictionary followed by the value(dms followed by feet). This makes it easier to read:
for key, value in full_legal:
print(key, value)
Try Regex: ((?:N|S) \d{2}D \d{2}M \d{2}S (?:E|W) )(?:.)*?(?=\d+(?:\/\d+)? FT)(\d+(?:\/\d+)? FT)
and combine capture groups 1 and 2
Demo
For some reason this little part of my code is giving me a problem. I have been trying to figure out why it is giving me a "list index out of range" error
#This works fine, and finds a match
if re.search("Manufacturer\/Distributor name:?", arg) != None:
#---->This is giving me the problem, "List index out of range"<----
address = arg.split("Manufacturer\/Distributor name:?", 1)[1]
This is the arg I'm feeding it:
Product Name: Tio Nacho Shampoo Mexican Herbs Recommended Use: Shampoo Manufacturer/Distributor name: Garcoa Laboratories, Inc. 26135 Mureau Road Calabasas, CA 91302 (818) 225 - 0375 Emerg ency telephone number: CHEMTREC 1 - 800 - 424 - 9300 2 .
When I have it set to [1], this is the result:
List index out of range
When I have the split set to [0], this is the result:
/Distributor name: Garcoa Laboratories, Inc. 26135 Mureau Road Calabasas, CA 91302 (818) 225 - 0375 Emerg ency telephone number: CHEMTREC 1 - 800 - 424 - 9300 2 .
I'm trying to get this result:
Garcoa Laboratories, Inc. 26135 Mureau Road Calabasas, CA 91302 (818) 225 - 0375 Emerg ency telephone number: CHEMTREC 1 - 800 - 424 - 9300 2 .
Its matching to it, but the split for some reason doesn't want to work. What am I missing? Why does it give that result for [0]
Thanks for the help!
str.split() doesn't take a regular expression, you need to use re.split().
address = re.split(r'Manufacturer\/Distributor name:?', arg, 1)[1]
You should also get in the habit of using raw strings for regular expressions, otherwise you need to escape the \.
I'm assuming arg is a string. string.split() does not accept regex as delimiter. You can read about it here.
Instead, you should use arg.split("Manufacturer/Distributor name", 1)[1].