Using regex to parse string patterns from a single string

Using regex to parse string patterns from a single string - python

I have read in product pricing for some products. As you will see below, not every product pricing string is set up the same. What I am trying to do is to parse out the sub-strings I do not want.
Below is the code I have which works, but there has to be a more efficient way to do this.
tmp1 = p_pricing.replace("from ", "")
tmp1 = tmp1.replace("Options Available on Open Box", "")
tmp1 = tmp1.replace("Open Box Price: From ", "")
tmp1 = re.sub(r'\([^)]*\)', '', tmp1)
tmp1 = re.split("[$]", tmp1)
Below is a small sample of my pricing string:
$11.99($6.00 per item)$14.99
from $13.99$18.25
$9.89($4.94 per item)$14.99
from $9.83($3.28 per item)
from $15.99$29.99
from $84.99$104.95
from $9.83($3.28 per item)
$3.47
$94.99$129.99
from $14.34$19.90
from $25.01$65.00Options Available on Open Box

It seems you just want to get the numeric values of all prices in each string.
You can use
re.findall(r'\$(\d+(?:\.\d+)?)', text)
See the regex demo.
Details
\$ - a $ char
(\d+(?:\.\d+)?) - Capturing group 1: one or more digits, and then an optional occurrence of a . and one or more digits.
See the Python demo:
import re
pattern = r"\$(\d+(?:\.\d+)?)"
text = "$11.99($6.00 per item)$14.99\nfrom $13.99$18.25\n$9.89($4.94 per item)$14.99\nfrom $9.83($3.28 per item) \nfrom $15.99$29.99\nfrom $84.99$104.95\nfrom $9.83($3.28 per item) \n$3.47\n$94.99$129.99\nfrom $14.34$19.90\nfrom $25.01$65.00Options Available on Open Box"
print( re.findall(pattern, text) )
Output:
['11.99', '6.00', '14.99', '13.99', '18.25', '9.89', '4.94', '14.99', '9.83', '3.28', '15.99', '29.99', '84.99', '104.95', '9.83', '3.28', '3.47', '94.99', '129.99', '14.34', '19.90', '25.01', '65.00']

As you are replacing from and opening till closing parenthesis in your code using \([^)]*\) with an empty string, you can get all the prices outside of the parenthesis by matching from an opening parenthesis till a closing parenthesis.
Then use an alternation | and capture what you want to keep.
The digits are in capture group 1.
\([^()]*\)|\$(\d+(?:\.\d+))
\([^()]*\) Match from an opening till closing parenthesis
| Or
\$ Match a dollar sign
(\d+(?:\.\d+)) Capture group 1 Match 1+ digits and an optional decimal part
See a regex demo or a Python demo
Example code
import re
pattern = r"\([^()]*\)|\$(\d+(?:\.\d+))"
s = "$11.99($6.00 per item)$14.99 from $13.99$18.25 $9.89($4.94 per item)$14.99 from $9.83($3.28 per item) from $15.99$29.99 from $84.99$104.95 from $9.83($3.28 per item) $3.47 $94.99$129.99 from $14.34$19.90 from $25.01$65.00Options Available on Open Box"
print([s for s in re.findall(pattern, s) if s])
Output
['11.99', '14.99', '13.99', '18.25', '9.89', '14.99', '9.83', '15.99', '29.99', '84.99', '104.95', '9.83', '3.47', '94.99', '129.99', '14.34', '19.90', '25.01', '65.00']

Related

Split String based on multiple Regex matches

First of all, I checked these previous posts, and did not help me. 1 & 2 & 3
I have this string (or a similar case could be) that need to be handled with regex:
"Text Table 6-2: Management of children study and actions"
What I am supposed to do is detect the word Table and the word(s) before if existed
detect the numbers following and they can be in this format: 6 or 6-2 or 66-22 or 66-2
Finally the rest of the string (in this case: Management of children study and actions)
After doing so, the return value must be like this:
return 1 and 2 as one string, the rest as another string
e.g. returned value must look like this: Text Table 6-2, Management of children study and actions
Below is my code:
mystr = "Text Table 6-2: Management of children study and actions"
if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
print("True matched")
parts_of_title = re.search("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr)
print(parts_of_title)
print(" ".join(parts_of_title.group().split()[0:3]), parts_of_title.group().split()[-1])
The first requirement is returned true as should be but the second doesn't so, I changed the code and used compile but the regex functionality changed, the code is like this:
mystr = "Text Table 6-2: Management of children study and actions"
if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
print("True matched")
parts_of_title = re.compile("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?").split(mystr)
print(parts_of_title)
Output:
True matched
['', 'Text ', 'Table', '-2', ':\tManagement of children study and actions']
So based on this, how I can achieve this and stick to a clean and readable code? and why does using compile change the matching?

The matching changes because:
In the first part, you call .group().split() where .group() returns the full match which is a string.
In the second part, you call re.compile("...").split() where re.compile returns a regular expression object.
In the pattern, this part will match only a single word [a-zA-Z0-9]+[ ], and if this part should be in a capture group [0-9]([-][0-9]+)? the first (single) digit is currently not part of the capture group.
You could write the pattern writing 4 capture groups:
^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)
See a regex demo.
import re
pattern = r"^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)"
s = "Text Table 6-2: Management of children study and actions"
m = re.match(pattern, s)
if m:
print(m.groups())
Output
('Text ', 'Table', '6-2', 'Management of children study and actions')
If you want point 1 and 2 as one string, then you can use 2 capture groups instead.
^((?:.*? )?(?:[Ll]ist|[Tt]able|[Ff]igure)\s+\d+(?:-\d+)?):\s+(.+)
Regex demo
The output will be
('Text Table 6-2', 'Management of children study and actions')

you have already had answers but I wanted to try your problem to train myself so I give you all the same what I found if you are interested:
((?:[a-zA-Z0-9]+)? ?(?:[Ll]ist|[Tt]able|[Ff]igure)).*?((?:[0-9]+\-[0-9]+)|(?<!-)[0-9]+): (.*)
And here is the link to my tests: https://regex101.com/r/7VpPM2/1

Find values using regex (includes brackets)

it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.

You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']

import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)

The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo

Regex to catch only the certain part of the string

Is there universal regex to catch only the names of companies?
Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc
Q4_2018_Control4_Corp
The output should be:
American_Airlines_Group_Inc
Apple_Inc
Alcoa_Inc
Arconic_Inc
Orkla_ASA
AGCO_Corp
Autodesk_Inc
Note:
The name of the company may contain symbols or numbers

You can use this regex,
[a-zA-Z]+(?:_[a-zA-Z]+)*$
Your company names all start with alphabetical words and hyphen separated till end of string, for which above regex will work fine.
Here, [a-zA-Z]+ starts matching alphabetical company names, and (?:_[a-zA-Z]+)* further matches any alphabetical words having hyphen before them and $ ensures the matched string ends with the string.
Regex Demo
Python code,
import re
arr = ['Q4_2017_American_Airlines_Group_Inc','Q1_2016_Apple_Inc','Q4_2014_Alcoa_Inc','Q3_2015_Arconic_Inc','Q3_2017_Orkla_ASA','Q2_2018_AGCO_Corp','Quarter_3_2018_Autodesk_Inc']
for s in arr:
m = re.search(r'[a-zA-Z]+(?:_[a-zA-Z]+)*$', s)
print(s, '-->', m.group())
Prints,
Q4_2017_American_Airlines_Group_Inc --> American_Airlines_Group_Inc
Q1_2016_Apple_Inc --> Apple_Inc
Q4_2014_Alcoa_Inc --> Alcoa_Inc
Q3_2015_Arconic_Inc --> Arconic_Inc
Q3_2017_Orkla_ASA --> Orkla_ASA
Q2_2018_AGCO_Corp --> AGCO_Corp
Quarter_3_2018_Autodesk_Inc --> Autodesk_Inc
Also, if you have a single string of those company names, then you can use following code and use re.findall to list all company names,
import re
s = '''Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc'''
print(re.findall(r'(?m)[a-zA-Z]+(?:_[a-zA-Z]+)*$', s))
Prints,
['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']
Edit:
As Chyngyz Akmatov raised, if name can contain numbers and in general any symbol, then this regex will get the name properly, which assumes company name starts after year part and underscore.
(?<=\d{4}_).*$
Demo handling any character in company name

You can use re.sub:
import re
data = [re.sub('\w+\d{4}_', '', i) for i in filter(None, content.split('\n'))]
Output:
['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']

You can also use this regex:
_\d+(?:_\d+)*_(.*)
Code:
import re
lst = ['Q4_2017_American_Airlines_Group_Inc', 'Q1_2016_Apple_Inc', 'Q4_2014_Alcoa_Inc', 'Q3_2015_Arconic_Inc', 'Q3_2017_Orkla_ASA', 'Q2_2018_AGCO_Corp', 'Quarter_3_2018_Autodesk_Inc']
for x in lst:
print(re.search(r'_\d+(?:_\d+)*_(.*)', x).group(1))
# American_Airlines_Group_Inc
# Apple_Inc
# Alcoa_Inc
# Arconic_Inc
# Orkla_ASA
# AGCO_Corp
# Autodesk_Inc

Assuming there are only normal letters and the names are the end of each line :
grep -o '[A-Za-z][A-Za-z_]*$' names

regular expression for the extracting multiple patterns

I have string like this
string="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
tokens=re.findall('(.*)\r\n(.*?:)(.*?])',string)
Output
('Claim Status', '[Primary Status:', ' Paidup to Rebilled]')
('General Info.', '[PA Number:', ' R180126187]')
('Claim Insurance: Modified', '[Ins. Mode:', ' Primary]')
Wanted output:
('Claim Status', 'Primary Status:Paidup to Rebilled')
('General Info.', 'PA Number:R180126187')
('Claim Insurance: Modified', 'Ins. Mode:Primary','ICN: ########', 'Id: #########')

You may achieve what you need with a solution like this:
import re
s="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
res = []
for m in re.finditer(r'^(.+)(?:\r?\n\s*\[(.+)])?\r?$', s, re.M):
t = []
t.append(m.group(1).strip())
if m.group(2):
t.extend([x.strip() for x in m.group(2).strip().split('], [') if ':' in x])
res.append(tuple(t))
print(res)
See the Python online demo. Output:
[('Claim Status', 'Primary Status: Paidup to Rebilled'), ('General Info.', 'PA Number: #######'), ('Claim Insurance: Modified', 'Ins. Mode: Primary', 'ICN: #######', 'Id: ########')]
With the ^(.+)(?:\r?\n\s*\[(.+)])?\r?$ regex, you match two consecutive lines with the second being optional (due to the (?:...)? optional non-capturing group), the first is captured into Group 1 and the subsequent one (that starts with [ and ends with ]) is captured into Group 2. (Note that \r?$ is necessary since in the multiline mode $ only matches before a newline and not a carriage return.) Group 1 value is added to a temporary list, then the contents of the second group is split with ], [ (if you are not sure about the amount of whitespace, you may use re.split(r']\s*,\s*\[', m.group(2))) and then only add those items that contain a : in them to the temporary list.

You are getting three elements per result because you are using "capturing" regular expressions. Rewrite your regexp like this to combine the second and third match:
re.findall('(.*)\r\n((?:.*?:)(?:.*?]))',string)
A group delimited by (?:...) (instead of (...)) is "non-capturing", i.e. it doesn't count as a match target for \1 etc., and it does not get "seen" by re.findall. I have made both your groups non-capturing, and added a single capturing (regular) group around them.

Python regex: Match ALL consecutive capitalized words

Short question:
I have a string:
title="Announcing Elasticsearch.js For Node.js And The Browser"
I want to find all pairs of words where each word is properly capitalized.
So, expected output should be:
['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']
What I have right now is this:
'[A-Z][a-z]+[\s-][A-Z][a-z.]*'
This gives me the output:
['Announcing Elasticsearch.js', 'For Node.js', 'And The']
How can I change my regex to give desired output?

You can use this:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.

There's probably a more efficient way to do this, but you could use a regex like this:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$) to ensure the matched group(current) matches the matched group(next).
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]

If your Python code at the moment is this
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regex to parse string patterns from a single string - python

Related

Split String based on multiple Regex matches

Find values using regex (includes brackets)

Regex to catch only the certain part of the string

regular expression for the extracting multiple patterns

Python regex: Match ALL consecutive capitalized words

Categories

Resources