I have never had a very hard time with regular expressions up until now. I am hoping the solution is not obvious because I have probably spent a few hours on this problem.
This is my string:
<b>Carson Daly</b>: Ben Schwartz, Soko, Jacob Escobedo (R 2/28/14)<br>'
I want to extract 'Soko', and 'Jacob Escobedo' as individual strings. If I takes two different patterns for the extractions that is okay with me.
I have tried "\s([A-Za-z0-9]{1}.+?)," and other alterations of that regex to get the data I want but I have had no success. Any help is appreciated.
The names never follow the same tag or the same symbol. The only thing that consistently precedes the names is a space (\s).
Here is another string as an example:
<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>
An alternative approach would be to parse the string with an HTML parser, like lxml.
For example, you can use the xpath to find everything between a b tag with Carson Daly text and br tag by checking preceding and following siblings:
from lxml.html import fromstring
l = [
"""<b>Carson Daly</b>: Ben Schwartz, Soko, Jacob Escobedo (R 2/28/14)<br>'""",
"""<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>"""
]
for html in l:
tree = fromstring(html)
results = ''
for element in tree.xpath('//node()[preceding-sibling::b="Carson Daly" and following-sibling::br]'):
if not isinstance(element, str):
results += element.text.strip()
else:
text = element.strip(':')
if text:
results += text.strip()
print results.split(', ')
It prints:
['Ben Schwartz', 'Soko', 'Jacob Escobedo (R 2/28/14)']
['Wil Wheaton', 'the Birds of Satan', 'Courtney Kemp Agboh']
If you want to do it in regex (and with all the disclaimers on that topic), the following regex works with your strings. However, do note that you need to retrieve your matches from capture Group 1. In the online demo, make sure you look at the Group 1 captures in the bottom right pane. :)
<[^<]*</[^>]*>|<.*?>|((?<=,\s)\w[\w ]*\w|\w[\w ]*\w(?=,))
Basically, with the left alternations (separated by |) we match everything we don't want, then the final parentheses on the right capture what we do want.
This is an application of this question about matching a pattern except in certain situations (read that for implementation details including links to Python code).
Related
When using re.findall like my example below is there a to include the final four characters (.JPG)? As they may be lower or uppercase I can't just stitch it together with another string and be certain it will be correct. (In reality it's a list of dozens/hundreds of JPGs, some uppercase and some lowercase.)
I actually found the answer to this about 2 weeks ago but have since lost it (despite a lot of Googling).
I've done a lot of searching/reading and apologize if this exact problem has been asked before.
import re
examplestring = '/home/folder/image.JPG 200x400 20/12/2018'
print(re.findall(r'^(.*?).jpg', examplestring, flags=re.IGNORECASE))
Actual output:
['/home/folder/image']
I'm wanting the output to be:
['/home/folder/image.JPG']
Firstly, make sure to escape the dot since it's a special character in regex.
Either include .jpg in the group
^(.*?\.jpg)
or don't use a group at all
^.*?\.jpg
Method 1
Maybe,
(?i)\S+\.jpg
or
(?i)\S+\.jpe?g
just in case, if we would have had jpeg, might simply work OK.
RegEx Demo 1
We can include additional boundaries, if that'd be necessary, such as start anchor.
Also, the expression does not work if there would be any space in the dir names or filenames.
Method 2
If there would be horizontal spaces in the image path, then
(?i)^[^\r\n]+\.jpg
or
(?i)^[^\r\n]+\.jpe?g
would have been some options to explore.
RegEx Demo 2
Test
import re
string = '''
/home/folder/image.JPG 200x400 20/12/2018
/home/folder/image.jpg 200x400 20/12/2018
/home/folder/image.jpeg 200x400 20/12/2018
'''
expression = r'(?i)\S+\.jpe?g'
print(re.findall(expression, string))
Output
['/home/folder/image.JPG', '/home/folder/image.jpg', '/home/folder/image.jpeg']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
I'm new to regex and this is stumping me.
In the following example, I want to extract facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info. I've read up on lazy quantifiers and lookbehinds but I still can't piece together the right regex. I'd expect facebook.com\/.*?sk=info to work but it captures too much. Can you guys help?
<i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_96df30"></i></span><span class="fbProfileBylineLabel"><span itemprop="address" itemscope="itemscope" itemtype="http://schema.org/PostalAddress">7508 15th Avenue, Brooklyn, New York 11228</span></span></span><span class="fbProfileBylineFragment"><span class="fbProfileBylineIconContainer"><i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_9f18df"></i></span><span class="fbProfileBylineLabel"><span itemprop="telephone">(718) 837-9004</span></span></span></div></div></div><a class="title" href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info" aria-label="About Dr. Morris Westfried - Dermatologist">
As much as I love regex, this is an html parsing task:
>>> from bs4 import BeautifulSoup
>>> html = .... # that whole text in the question
>>> soup = BeautifulSoup(html)
>>> pred = lambda tag: tag.attrs['href'].endswith('sk=info')
>>> [tag.attrs['href'] for tag in filter(pred, soup.find_all('a'))]
['https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info']
This works :)
facebook\.com\/[^>]*?sk=info
Debuggex Demo
With only .* it finds the first facebook.com, and then continues until the sk=info. Since there's another facebook.com between, you overlap them.
The unique thing between that you don't want is a > (or <, among other characters), so changing anything to anything but a > finds the facebook.com closest to the sk=info, as you want.
And yes, using regex for HTML should only be used in basic tasks. Otherwise, use a parser.
Why your pattern doesn't work:
You pattern doesn't work because the regex engine try your pattern from left to right in the string.
When the regex engine meets the first facebook.com\/ in the string, and since you use .*? after, the regex engine will add to the (possible) match result all the characters (including " or > or spaces) until it finds sk=info (since . can match any characters except newlines).
This is the reason why fejese suggests to replace the dot with [^"] or aliteralmind suggests to replace it with [^>] to make the pattern fail at this position in the string (the first).
Using an html parser is the easiest way if you want to deal with html. However, for a ponctual match or search/replace, note that if an html parser provide security, simplicity, it has a cost in term of performance since you need to load the whole tree of your document for a single task.
The problem is that you have an other facebook.com part. You can restrict the .* not to match " so it needs to stay within one attribute:
facebook\.com\/[^"]*;sk=info
I'm parsing some TV episodes that have been transcribed by different people, meaning I need to search for a variety of formats. For example, new scenes are indicated one of two ways:
[A coffee shop]
or
INT. Coffee shop - NIGHT
Right now, I match this with the following regex in Python:
re.findall("(^\[(.+?)\]$)|(^[INTEXT]{3}\. .+?$)", text)
where "text" is the text of the entire script (hence using findall). This always appears on its own line, hence the ^$
This gives me something like: (None, None, "INT. Coffee Shop - NIGHT") for example.
My question: How do you construct a regex to search for one of two complex patterns, using the | notation, without also creating submatches that you don't really want? Or is there a better way?
Many thanks.
UPDATE: I had overlooked the idea of non-capturing groups. I can accomplish what I want with:
"(?:^\[.+?\]$)|(?:^[INTEX]{3}\. .+?$)"
However, this raises a new question. I don't actually want the brackets or the INT/EXT in the scenes, just the location. I thought that I could use actual groups within the none-capturing groups, but I'm still getting those blank matches for the other expression, like so:
import re
pattern = "(?:^\[(.+?)\]$)|(?:^[INTEX]{3}\. (.+?)$)"
examples = [
"[coffee shop]",
"INT. COFFEE SHOP - DAY",
"EXT. FIELD - NIGHT",
"[Hugh's aparment]"
]
for example in examples:
print re.findall(pattern, example)
'''
[('coffee shop', '')]
[('', 'COFFEE SHOP - DAY')]
[('', 'FIELD - NIGHT')]
[("Hugh's aparment", '')]
'''
I can just join() them, but is there a better way?
Based on the limited examples you've provided, how about using assertions for the brackets:
re.findall("((?<=^\[)[^[\]]+(?=\]$)|^[INTEXT]{3}\. .+?$)", text)
You may be better off just using two expressions.
patterns = [r'^\[(.+?)\]$', r'^(?:INT|EXT)\. (.+?)$']
for example in examples:
print re.findall(patterns[0], example) or re.findall(patterns[1], example)
This seems to do what you want:
(?m)^(?=(?:\[|[INTEX]{3}\.\s+)([^\]\r\n]+))(?:\[\1\]|[INTEX]{3}\. \1)$
First the lookahead peeks at the text of the scene marker, capturing it in group #1. Then the rest of the regex goes ahead and consumes the whole line containing the marker. Although now I think about it, you don't really have to consume anything. This works, too:
result = re.findall(r"(?m)^(?=(?:\[|[INTEX]{3}\.\s+)([^\]\r\n]+))", subject)
The marker text is still captured in group #1, so it still gets added to the result of findall(). Then again, I don't see why you would want to use findall() here. If you're trying to normalize the scene markers by replacing them in place, you'll have to use the consuming version of the regex.
Also, notice the (?m). In your examples you always apply the regex to the scene markers in isolation. To pluck them out of the whole script, you have to set the MULTILINE flag, turning ^ and $ into line anchors.
How to find all words except the ones in tags using RE module?
I know how to find something, but how to do it opposite way? Like I write something to search for, but acutally I want to search for every word except everything inside tags and tags themselves?
So far I managed this:
f = open (filename,'r')
data = re.findall(r"<.+?>", f.read())
Well it prints everything inside <> tags, but how to make it find every word except thats inside those tags?
I tried ^, to use at the start of pattern inside [], but then symbols as . are treated literally without special meaning.
Also I managed to solve this, by splitting string, using '''\= <>"''', then checking whole string for words that are inside <> tags (like align, right, td etc), and appending words that are not inside <> tags in another list. But that a bit ugly solution.
Is there some simple way to search for every word except anything that's inside <> and these tags themselves?
So let say string 'hello 123 <b>Bold</b> <p>end</p>'
with re.findall, would return:
['hello', '123', 'Bold', 'end']
Using regex for this kind of task is not the best idea, as you cannot make it work for every case.
One of solutions that should catch most of such words is regex pattern
\b\w+\b(?![^<]*>)
If you want to avoid using a regular expression, BeautifulSoup makes it very easy to get just the text from an HTML document:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
text = "".join(soup.findAll(text=True))
From there, you can get the list of words with split:
words = text.split()
Something like re.compile(r'<[^>]+>').sub('', string).split() should do the trick.
You might want to read this post about processing context-free languages using regular expressions.
Strip out all the tags (using your original regex), then match words.
The only weakness is if there are <s in the strings other than as tag delimiters, or the HTML is not well formed. In that case, it is better to use an HTML parser.
I want to extract data from such regex:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
I've found related question extract contents of regex
but in my case I shoud iterate somehow.
As paprika mentioned in his/her comment, you need to identify the desired parts of any matched text using ()'s to set off the capture groups. To get the contents from within the td tags, change:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
to:
<td>([a-zA-Z]+)</td><td>([\d]+.[\d]+)</td><td>([\d]+)</td><td>([\d]+.[\d]+)</td>
^^^^^^^^^ ^^^^^^^^^^^ ^^^^^ ^^^^^^^^^^^
group 1 group 2 group 3 group 4
And then access the groups by number. (Just the first line, the line with the '^'s and the one naming the groups are just there to help you see the capture groups as specified by the parentheses.)
dataPattern = re.compile(r"<td>[a-zA-Z]+</td>... etc.")
match = dataPattern.find(htmlstring)
field1 = match.group(1)
field2 = match.group(2)
and so on. But you should know that using re's to crack HTML source is one of the paths toward madness. There are many potential surprises that will lurk in your input HTML, that are perfectly working HTML, but will easily defeat your re:
"<TD>" instead of "<td>"
spaces between tags, or between data and tags
" " spacing characters
Libraries like BeautifulSoup, lxml, or even pyparsing will make for more robust web scrapers.
As the poster clarified, the <td> tags should be removed from the string.
Note that the string you've shown us is just that: a string. Only if used in the context of regular expression functions is it a regular expression (a regexp object can be compiled from it).
You could remove the <td> tags as simply as this (assuming your string is stored in s):
s.replace('<td>','').replace('</td>','')
Watch out for the gotchas however: this is really of limited use in the context of real HTML, just as others pointed out.
Further, you should be aware that whatever regular expression [string] is left, what you can parse with that is probably not what you want, i.e. it's not going to automatically match anything that it matched before without <td> tags!