Using regex to extract holding company - python

Given a string which follows the structure as-
" (subsidiary of <holding_company>) <post_>"
where
holding_company may contain alphabets & some special characters including brackets
post_ may contain any character
Example string: " google (subsidiary of alphabet (inc.)) xyz"
How to extract the holding company name using regex?

The regular expression to extract that is as follows:
"subsidiary of\s+(.*)\)\s+\S+"
In Python2 code, you'd do something like:
import re
regex = r"subsidiary of\s+(.*)\)\s+\S+"
test_str = "\" (subsidiary of <holding_company>) <post_>\""
m = re.search(regex, test_str)
if m:
# if it found the pattern, the company name is in group(1)
print m.group(1)
See it in action here: https://repl.it/repls/ShyFocusedInstructions#main.py

This suld get you there :
(?<=\(subsidiary of)(.*)(?=\) )

This creates capture groups for your holding company and post.
You may need to expand the regex to include additional special characters.
Here's the regex on regex101 if you need to expand it
https://regex101.com/r/xpVfqU/1
#!/usr/bin/python3
import re
str=" (subsidiary of <holding_company>) <post_>"
holding_company=re.sub(r'\s\(subsidiary\ of\ ([\w<>]*)\)\s*(.*)', '\\1', str)
post=re.sub(r'\s\(subsidiary\ of\ ([\w<>]*)\)\s*(.*)', '\\2', str)
print(holding_company)
print(post)

Related

How can I find all paths in javascript file with regex in Python?

Sample Javascript (content):
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("src","/cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray="+e.ray),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("sdfdsfsfds",'/test/path'),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
regex = ""
endpoints = re.findall(regex, content)
Output I want:
> /cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray=
> /test/path
I want to find all fields starting with "/ and '/ with regex. I've tried many url regexes but it didn't work for me.
This should do it:
regex = r"""["']\/[^"']*"""
Note that you will need to trim the first character from the match. This also assumes that there are no quotation marks in the path.
Consider:
import re
txt = ... #your code
pat = r"(\"|\')(\/.*?)\1"
for el in re.findall(pat, txt):
print(el[1])
each el will be match of pattern starting with single, or double quote. Then minimal number of characters, then the same character as at the beginning (same type of quote).
.* stands for whatever number of any characters, following ? makes it non-greedy i.e. provides minimal characters match. Then \1 refers to first group, so it will match whatever type of quote was matched at the beginning. Then by specifying el[1] we return second group matched i.e. whatever was matched within quotes.

Having a problem with Python Regex: Prints "None" when printing "matches". Regex works in tester

I'm supposed to extract groups of text from a file with a top ten list: name, rank, etc. for each. You can see the file and the regex here https://regex101.com/r/fXK5YV/1. It works in there and you can see the capturing groups.
import re
pattern = '''
(?P<list><li\sclass="regular-search-result">(.|\n)*?(?<=\<span class=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span class=\"review-count rating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.match(yelp_html)
This prints None.
There is definitely text inside of yelp_html.
What am I doing wrong?
I see two issues:
You're not using a raw string (prefix the string with an r), which means that your backslashes are going to be trying to represent special things instead of being part of the string.
I believe your multiline string is going to be attempting to match both the newlines between each line and the spaces at the start of the string into your regex (which you don't want, given this is not how the regex is formatted in your link).
import re
pattern = r'''
(?P<list><li\sclass=\"regular-search-result\">(.|\n)*?(?<=\<span\sclass=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span\sclass=\"review-count\srating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})
(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.finditer(yelp_html)
for item in matches:
print(item.group('rank', 'name', 'stars', 'numrevs', 'price'))

Python Regex to Extract Domain from Text

I have the following regex:
r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
When I apply this to a text string with, let's say,
"this is www.website1.com and this is website2.com", I get:
['www.website1.com']
['website.com']
How can i modify the regex to exclude the 'www', so that I get 'website1.com' and 'website2.com? I'm missing something pretty basic ...
Try this one (thanks #SunDeep for the update):
\s(?:www.)?(\w+.com)
Explanation
\s matches any whitespace character
(?:www.)? non-capturing group, matches www. 0 or more times
(\w+.com) matches any word character one or more times, followed by .com
And in action:
import re
s = 'this is www.website1.com and this is website2.com'
matches = re.findall(r'\s(?:www.)?(\w+.com)', s)
print(matches)
Output:
['website1.com', 'website2.com']
A couple notes about this. First of all, matching all valid domain names is very difficult to do, so while I chose to use \w+ to capture for this example, I could have chosen something like: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}.
This answer has a lot of helpful info about matching domains:
What is a regular expression which will match a valid domain name without a subdomain?
Next, I only look for .com domains, you could adjust my regular expression to something like:
\s(?:www.)?(\w+.(com|org|net))
To match whichever types of domains you were looking for.
Here a try :
import re
s = "www.website1.com"
k = re.findall ( '(www.)?(.*?)$', s, re.DOTALL)[0][1]
print(k)
O/P like :
'website1.com'
if it is s = "website1.com" also it will o/p like :
'website1.com'

Python: Regex: Detecting hyphenated names and non-hyphenated names with one regex

I need to extract people's names from a really long string.
Their names are in this format: LAST, FIRST.
Some of these people have hyphenated names. Some don't.
My attempt with a smaller string:
Input:
import re
text = 'Smith-Jones, Robert&Epson, Robert'
pattern = r'[A-Za-z]+(-[A-Za-z]+)?,\sRobert'
print re.findall(pattern, text)
Expected output:
['Smith-Jones, Robert', 'Epson, Robert']
Actual output:
['-Jones', '']
What am I doing wrong?
Use
import re
text = 'Smith-Jones, Robert&Epson, Robert'
pattern = r'[A-Za-z]+(?:-[A-Za-z]+)?,\sRobert'
print re.findall(pattern, text)
# => ['Smith-Jones, Robert', 'Epson, Robert']
Just make the capturing group non-capturing. The thing is that findall returns capture group values if they are specified in the regex pattern. So, the best way to solve this in this pattern is just replace (...)? with (?:...)?.
See IDEONE demo

Regex match following substring in string python

I've come up with a regex expression that works well enough for my purposes for finding phone numbers.
I would like to take it a step further and use it in large text blocks to identify matching strings that follow the words 'cell' or 'mobile' by at most 10 characters. I would like it to return the number in Cell Phone: (954) 555-4444 as well as Mobile 555-777-9999 but not Fax: (555) 444-6666
something like (in pseudocode)
regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
bigstring = # Some giant string added together from many globbed files
matches = regex.search(bigstring)
for match in matches:
if match follows 'cell' or match follows 'mobile':
print match.group(0)
You can do:
txt='''\
Call me on my mobile anytime: 555-666-1212
The office is best at 555-222-3333
Dont ever call me at 555-666-2345 '''
import re
print re.findall(r'(?:(mobile|office).{0,15}(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4}))', txt)
Prints:
[('mobile', '555-666-1212'), ('office', '555-222-3333')]
You can do that with your regular expression. In the re documentation, you will find that the pattern r'(?<=abc)def' matches 'def' only if it is preceded by 'abc'.
Similarly r'Hello (?=World)' matches 'Hello ' if followed by 'World'

Categories

Resources