Python finding regex in a String - python

I'm trying to find all cases of money values in a string called webpage.
String webpage is the text from this webpage, in my program it's just hardcoded because that's all that is needed, but I won't paste it all here.
regex = r'^[$£€]?(([\d]{1,3},([\d]{3},)*[\d]{3}|[0-9]+)(\.[0-9][0-9])?(\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)
it's returning [], I expected it to return [$131bn, £100bn, $100bn, $17.4bn]

Without knowing the text it has to search, you could use the regex:
([€|$|£]+[0-9a-zA-Z\,\.]+)
to capture everything that contains €, £ or $, and then print the amount without following words or letters. See the example in action here: http://rubular.com/r/a7O7AGF9Zl.
Using this regex we get this code:
import re
webpage = '''
one
million
dollars
test123
$1bn asd
€5euro
$1923,1204bn
€1293.1205 million'''
regex = r'([€|$]+[0-9a-zA-Z\,\.]+)'
res = re.findall(regex, webpage)
print(res)
with the output:
['$1bn', '€5euro', '$1923,1204bn', '€1293.1205']
EDIT: Using the same regex on the provided website, it returns the output of:
['$131bn', '$100bn', '$17.4bn.', '$52.4bn']
If you modify the regex further to find e.g. 500million, you can add 0-9 to your first bracket, as you then search for either £, €, $ or anything that starts with 0-9.
Output of:
webpage = '''
one
million
€1293.1205 million
500million
'''
regex = r'([€|$0-9]+[0-9a-zA-Z\,\.]+)'
Therefore becomes:
['€1293.1205', '500million']

the first error on your regex is the ^ at the beginning of the string, which will only match the first character on the string, which isn't helpful when using findall.
Also you are defining a lot of groups (()) , that I assume you don't really need, so escape all of them (adding ?: next to the opened parenthesis) and you are going to get very close to what you want:
regex = r'[$£€](?:(?:[\d]{1,3},(?:[\d]{3},)*[\d]{3}|[0-9]+)(?:\.[0-9][0-9])?(?:\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)

A webscraping solution:
import urllib
import itertools
from bs4 import BeautifulSoup as soup
import re
s = soup(str(urllib.urlopen('http://www.bbc.com/news/business-41779341').read()), 'lxml')
final_data = list(itertools.chain.from_iterable(filter(lambda x:x, [re.findall('[€\$£][\w\.]+', i.text) for i in s.findAll('p')])))
Output:
[u'$131bn', u'\xa3100bn', u'$100bn', u'$17.4bn.']

Related

Convert vimeo link into an embed link in python

I am using Python and django, and I have some vimeo URLs I need to convert to their embed versions. For example, this:
https://vimeo.com/76979871
has to be converted into this:
https://player.vimeo.com/video/76979871
but Not converted
My code is below:
_vm = re.compile(
r'/(?:https?:\/\/(?:www\.)?)?vimeo.com\/(?:channels\/|groups\/([^\/]*)\/videos\/|album\/(\d+)\/video\/|)(\d+)(?:$|\/|\?)/', re.I)
_vm_format = 'https://player.vimeo.com/video/{0}'
def replace(match):
groups = match.groups()
print(_vm_format)
return _vm_format.format(groups[5])
return _vm.sub(replace, text)
The given regular expression fits several variants of Vimeo URL:
https://vimeo.com/76979871
https://vimeo.com/channels/76979871
https://vimeo.com/groups/sdf/videos/76979871
https://vimeo.com/album/12321/video/76979871
The video number, provided it is really the only thing that you need for your player, will be in capture group 1 (groups[1]) after you slightly correct the regular expression: r'(?:https?:\/\/(?:www\.)?)?vimeo.com\/(?:channels\/|groups\/(?:[^\/]*)\/videos\/|album\/(?:\d+)\/video\/|)(\d+)(?:$|\/|\?)'. All other parentheses are non-capturing groups.
If, however, the player code is different for different URL types, then you better split your regular expression in four; and there will be different replacements for each.
You have to remove \ from both the end and use capture group 3 to get video id
(?:https?:\/\/(?:www\.)?)?vimeo.com\/(?:channels\/|groups\/([^\/]*)\/videos\/|album\/(\d+)\/video\/|)(\d+)(?:$|\/|\?)
Example
import re
_vm = re.compile(
r'(?:https?:\/\/(?:www\.)?)?vimeo.com\/(?:channels\/|groups\/([^\/]*)\/videos\/|album\/(\d+)\/video\/|)(\d+)(?:$|\/|\?)', re.I)
_vm_format = 'https://player.vimeo.com/video/{0}'
def replace(match):
groups = match.groups()
return _vm_format.format(groups[2])
urls=["https://vimeo.com/76979871",
"https://vimeo.com/channels/76979871",
"https://vimeo.com/groups/sdf/videos/76979871",
"https://vimeo.com/album/12321/video/76979871"]
for u in urls:
print(_vm.sub(replace, u))
Output
https://player.vimeo.com/video/76979871
https://player.vimeo.com/video/76979871
https://player.vimeo.com/video/76979871
https://player.vimeo.com/video/76979871

How can I get more than one digit using parenthesis in regular expressions

I was trying to extract values from a html code using urllib and regular expressions in python3 and when I tried to run this code, it only gave me one of the digits of the number instead of both values even though I added a "+" sign meaning one or more times. What's wrong here?
import re
import urllib.error,urllib.parse,urllib.request
from bs4 import BeautifulSoup
finalnums=[]
sumn=0
urlfile = urllib.request.urlopen("http://py4e-data.dr-chuck.net/comments_42.html")
html=urlfile.read()
soup = BeautifulSoup( html,"html.parser" )
spantags = soup("span")
for span in spantags:
span=span.decode()
numlist=re.findall(".+([0-9].*)<",span)
print(numlist)
finalnums.extend(numlist)
for anum in finalnums:
sumn=sumn+int(anum)
print("Sum = ",sumn)
This is an example of the string I'm trying to extract the number from:
<span class="comments">54</span>
Use numlist=re.findall("\d+",span) to search for all contiguous groups of digit characters.
\d is a character class that's equivalent to [0-9], so it would also work if you did numlist=re.findall("[0-9]+",span)
Since there is only one number in each <span> tag:
sumn = 0
for span in spantags:
sumn += int(re.search(r'\d+', span.decode()).group(0))

What am i doing wrong with this regular expression

links = re.findall('href="(http(s?)://[^"]+)"',page)
I have this regular expression to find all links in a website, I am getting this result:
('http://asecuritysite.com', '')
('https://www.sans.org/webcasts/archive/2013', 's')
When what I want is only this:
http://asecuritysite.com
https://www.sans.org/webcasts/archive/2013
If I eliminate the "( after the href it gives me loads of errors, can someone explain why?
If you use more than 1 capturing group, re.findall return list of tuples instead of list of strings. Try following (only using single group):
>>> import re
>>> page = '''
... here
... there
... '''
>>> re.findall(r'href="(https?:\/\/[^"]+)"',page)
['http://asecuritysite.com', 'https://www.sans.org/webcasts/archive/2013']
According to re.findall documentation:
If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group.
Try getting rid of the second group (the (s?) in your original pattern):
links = re.findall('href="(https?:\/\/[^"]+)"',page)
What you are doing wrong is trying to parse HTML with Regex. And that sir, is a sin.
See here for the horrors of Regex parsing HTML
An alternative is to use something like lxml to parse the page and extract the links something like this
urls = html.xpath('//a/#href')
You're going to run into problems too if it's a single quote before the https? instead of double.
(https?:\/\/[^\"\'\>]+) will capture the entire string; what you could then do is prepend (href=.?) to it, and you'd end up with two capture groups:
Full regex: (href=.?)(https?:\/\/[^\"\'\>]+)
MATCH 1
[Group 1] href='
[Group 2] http://asecuritysite.com
MATCH 2
[Group 1] href='
[Group 2] https://www.sans.org/webcasts/archive/2013
http://regex101.com/r/gO8vV7 here is a working example

Extracting text from a line: Regex in Python

I'm working with regular expressions in Python and I'm struggling with this.
I have data in a file of lines like this one:
|person=[[Old McDonald]]
and I just want to be able to extract Old McDonald from this line.
I have been trying with this regular expression:
matchLine = re.match(r"\|[a-z]+=(\[\[)?[A-Z][a-z]*(\]\])", line)
print matchLine
but it doesn't work; None is the result each time.
Construct [A-Z][a-z]* does not match Old McDonald. You probably should use something like [A-Z][A-Za-z ]*. Here is code example:
import re
line = '|person=[[Old McDonald]]'
matchLine = re.match ('\|[a-z]+=(?:\[\[)?([A-Z][A-Za-z ]*)\]\]', line)
print matchLine.group (1)
The output is Old McDonald for me. If you need to search in the middle of the string, use re.search instead of re.match:
import re
line = 'blahblahblah|person=[[Old McDonald]]blahblahblah'
matchLine = re.search ('\|[a-z]+=(?:\[\[)?([A-Z][A-Za-z ]*)\]\]', line)
print matchLine.group (1)

python regex search addition to parse a tag in a text file

I got some help with this earlier today but I cannot figure out the last part of the problem I am having. This regex search returns all of the matches in the open file from the input. What I need to do is also find which part of the file that the match comes from.
Each section is opened and closed with a tag. For example one of the tags opens with <opera> and ends with </opera>. What I want to be able to do is when I find a match I want to either go backwards to the open tag or forwards to the close tag and include the contents of the tag, in this case "opera" in the output. My question is can I do this with an addition to the regular expression or is there a better way? Here is the code I have that works great already:
text = open_file.read()
#the test string for this code is "NNP^CC^NNP"
grammarList = raw_input("Enter your grammar string: ");
tags = grammarList.split("^")
tags_pattern = r"\b" + r"\s+".join(r"(\w+)/{0}".format(tag) for tag in tags) + r"\b"
# gives you r"\b(\w+)/NNP\s+(\w+)/CC\s+(\w+)/NNP\b"
from re import findall
print(findall(tags_pattern, text))
One way to do it would be to find all occurrences of your start and end section tags (say they're <opera> and </opera>), get the indices, and compare them to each match of tags_pattern. This uses finditer which is like findall but returns indices too. Something like:
startTags = re.finditer("<opera>",text)
endTags = re.finditer("</opera>",text)
matches = re.finditer(tags_pattern,text)
# Now, [m.start() for m in matches] gives the starting index into `text`.
# if <opera> starts at subindices 0, 1000, 2345
# and you get a match starting at subindex 1100,
# then it's in the 1000-2345 block.
for m in matches:
# find first
sec = [i for i in xrange(len(startTags)) if i>startTags[i].start()]
if len(sec)=0:
print "err couldn't find it"
else:
sec = sec[0]
print "found in\n" + text[startTags[sec].start():endTags[sec].end()]
(Note: you can get the matched text with m.group() Default () has group 0 (ie entire string), and you can use m.group(i) for the ith capturing group).
from BeautifulSoup import BeautifulSoup
tags = """stuff outside<opera>asdfljlaksdjf lkasjdfl kajsdlf kajsdf stuff
<asdf>asdf</asdf></opera>stuff outside"""
soup = BeautifulSoup(tags)
soup.opera.text
Out[22]: u'asdfljlaksdjf lkasjdfl kajsdlf kajsdf stuffasdf'
str(soup.opera)
Out[23]: '<opera>asdfljlaksdjf lkasjdfl kajsdlf kajsdf stuff
<asdf>asdf</asdf></opera>'

Categories

Resources