I have a problem with regex and my string...
I need a solution in my regex code for take a float number of a string. I don't know why this code doesn't work.
from bs4 import BeautifulSoup
import urllib2
from re import sub
url = 'http://www.ebay.es/itm/PET-SHOP-BOYS-OFFICIAL-PROMO-BARCELONA-ELECTRIC-TOUR-BEER-CERVEZA-20cl-BOTTLE-/111116266655' #raw_input('Dime la url que deseas: ')
code = urllib2.urlopen(url).read();
soup = BeautifulSoup(code)
info = soup.find('span', id='v4-27').contents[0]
print info
info = sub("[\D]+,+[\D]", "", info)
i = float(info)
print i
\D means Non-digits. You need to use \d instead. Look here for details: http://en.wikipedia.org/wiki/Regular_expression#Character_classes
Updated
I see, that your approach is to replace all non-digits chars. To my mind, match needed information is more clear:
>>> import re
>>> s = "15,00 EUR"
>>> price_string = re.search('(\d+,\d+)', s).group(1)
>>> price_string
'15,00'
>>> float(price_string.replace(',', '.'))
15.0
Related
I have the string "/browse/advanced-computer-science-modules?title=machine-learning"** in Python. I want to print the string in between the second "/" and the "?", which is "advanced-computer-science-modules".
I've created a regular expression that is as follows ^([a-z]*[\-]*[a-z])*?$ but it prints nothing when I run the .findall() function from the re module.
I created my own regex and imported the re module in python. Below is a snippet of my code that returned nothing.
regex = re.compile(r'^([a-z]*[\-]*[a-z])*?$')
str = '/browse/advanced-computer-science-modules?title=machine-learning'
print(regex.findall(str))
Since this appears to be a URL, I'd suggest you use URL-parsing tools instead:
>>> from urllib.parse import urlsplit
>>> url = '/browse/advanced-computer-science-modules?title=machine-learning'
>>> s = urlsplit(url)
SplitResult(scheme='', netloc='', path='/browse/advanced-computer-science-modules', query='title=machine-learning', fragment='')
>>> s.path
'/browse/advanced-computer-science-modules'
>>> s.path.split('/')[-1]
'advanced-computer-science-modules'
The regex is as follows:
\/[a-zA-Z\-]+\?
Then you catch the substring:
regex.findall(str)[1:len(str) - 1]
Very specific to this problem, but it should work.
Alternatively, you can use split method of a string:
str = '/browse/advanced-computer-science-modules?title=machine-learning'
result = str.split('/')[-1].split('?')[0]
print(result)
#advanced-computer-science-modules
My text is
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
I am trying to extract value of posted_data which is 2e54eba66f8f2881c8e78be8342428xd
My code :
extract_posted_data = re.search(r'(\"posted_data\": \")(\w*)', my_text)
print (extract_posted_data)
and it prints None
Thank you
This particular example doesn't seem like it needs regular expressions at all.
>>> my_text
'"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
>>> import json
>>> result = json.loads('{%s}' % my_text)
>>> result
{'posted_data': '2e54eba66f8f2881c8e78be8342428xd', 'isropa': False, 'rx': 'NO', 'readal': 'false'}
>>> result['posted_data']
'2e54eba66f8f2881c8e78be8342428xd'
With BeautifulSoup:
>>> import json
...
... from bs4 import BeautifulSoup
...
... soup = BeautifulSoup('<script type="text/javascript"> "posted_data":"2738273283723hjasda" </script>')
...
... result = json.loads('{%s}' % soup.script.text)
>>> result
{'posted_data': '2738273283723hjasda'}
>>> result['posted_data']
'2738273283723hjasda'
This is because your original code has an additional space. It should be:
extract_posted_data = re.search(r'(\"posted_data\":\")(\w*)', my_text)
And in fact, '\' is unnecessary here. Just:
extract_posted_data = re.search(r'("posted_data":")(\w*)', my_text)
Then:
extract_posted_data.group(2)
is what you want.
>>> my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
>>> extract_posted_data = re.search(r'("posted_data":")(\w*)', my_text)
>>> extract_posted_data.group(2)
'2e54eba66f8f2881c8e78be8342428xd'
You need to change your regex to use lookarounds, as follows:
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
extract_posted_data = re.search(r'(?<="posted_data":")\w*(?=")', my_text)
print (extract_posted_data[0])
Prints 2e54eba66f8f2881c8e78be8342428xd
Also re.search() returns a Match object, so to get the first match (the only match) you get index 0 of the match:
as others have mentioned json would be a better tool for this data but you can also use this regex (I added a \s* in case in the future there are spaces in between):
regex: "posted_data":\s*"(?P<posted_data>[^"]+)"
import re
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
m = re.search(r'"posted_data":\s*"(?P<posted_data>[^"]+)"', my_text)
if m:
print(m.group('posted_data'))
I'm trying to find all cases of money values in a string called webpage.
String webpage is the text from this webpage, in my program it's just hardcoded because that's all that is needed, but I won't paste it all here.
regex = r'^[$£€]?(([\d]{1,3},([\d]{3},)*[\d]{3}|[0-9]+)(\.[0-9][0-9])?(\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)
it's returning [], I expected it to return [$131bn, £100bn, $100bn, $17.4bn]
Without knowing the text it has to search, you could use the regex:
([€|$|£]+[0-9a-zA-Z\,\.]+)
to capture everything that contains €, £ or $, and then print the amount without following words or letters. See the example in action here: http://rubular.com/r/a7O7AGF9Zl.
Using this regex we get this code:
import re
webpage = '''
one
million
dollars
test123
$1bn asd
€5euro
$1923,1204bn
€1293.1205 million'''
regex = r'([€|$]+[0-9a-zA-Z\,\.]+)'
res = re.findall(regex, webpage)
print(res)
with the output:
['$1bn', '€5euro', '$1923,1204bn', '€1293.1205']
EDIT: Using the same regex on the provided website, it returns the output of:
['$131bn', '$100bn', '$17.4bn.', '$52.4bn']
If you modify the regex further to find e.g. 500million, you can add 0-9 to your first bracket, as you then search for either £, €, $ or anything that starts with 0-9.
Output of:
webpage = '''
one
million
€1293.1205 million
500million
'''
regex = r'([€|$0-9]+[0-9a-zA-Z\,\.]+)'
Therefore becomes:
['€1293.1205', '500million']
the first error on your regex is the ^ at the beginning of the string, which will only match the first character on the string, which isn't helpful when using findall.
Also you are defining a lot of groups (()) , that I assume you don't really need, so escape all of them (adding ?: next to the opened parenthesis) and you are going to get very close to what you want:
regex = r'[$£€](?:(?:[\d]{1,3},(?:[\d]{3},)*[\d]{3}|[0-9]+)(?:\.[0-9][0-9])?(?:\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)
A webscraping solution:
import urllib
import itertools
from bs4 import BeautifulSoup as soup
import re
s = soup(str(urllib.urlopen('http://www.bbc.com/news/business-41779341').read()), 'lxml')
final_data = list(itertools.chain.from_iterable(filter(lambda x:x, [re.findall('[€\$£][\w\.]+', i.text) for i in s.findAll('p')])))
Output:
[u'$131bn', u'\xa3100bn', u'$100bn', u'$17.4bn.']
I have a URL string which is https://example.com/about/hello/
I want to split string as 'https://example.com', 'about' ,'hello'
How to do this ??
Use the urlparse to correctly parse a URL:
import urlparse
url = 'https://example.com/about/hello/'
parts = urlparse.urlparse(url)
paths = [p for p in parts.path.split('/') if p]
print 'Scheme:', parts.scheme # https
print 'Host:', parts.netloc # example.com
print 'Path:', parts.path # /about/hello/
print 'Paths:', paths # ['about', 'hello']
At the end of the day, the information you want are in the parts.scheme, parts.netloc and paths variables.
You may do this :
First split by '/'
Then join by '/' only before the 3rd occurance
Code:
text="https://example.com/about/hello/"
groups = text.split('/')
print( "/".join(groups[:3]),groups[3],groups[4])
Output:
https://example.com about hello
Inspired in Hai Vu's answer. This solution is for Python 3
from urllib.parse import urlparse
url = 'https://example.com/about/hello/'
parts = [p for p in urlparse(url).path.split('/') if p]
parts.insert(0, ''.join(url.split('/')[:3]))
There are lots of ways to do this. You could use re.split() to split on a regular expression, for instance.
>>> import re
>>> re.split(r'\b/\b', 'https://example.com/about/hello/')
['https://example.com', 'about', 'hello']
re is part of the standard library, documented here.
https://docs.python.org/3/library/re.html#re.split
The regex itself uses \b which means a boundy between a "word" character and a "non-word" character. You can use regex101 to explore how it works. https://regex101.com/r/mY8fV8/1
Suppose i had a url as below
url = 'https://www.advertise-example.com/ads/2022/presents'
Now i am trying to get the integer value 2022 out from the above url.we can use list slicing here, but the integer value can be increased, so i used regular expressions but couldn't get the exact result, can anyone tell me how to do this
Thanks in advance........
>>> import re
>>> url = 'https://www.advertise-example.com/ads/2022/presents'
>>> int(re.search(r'\d+', url).group())
2022
from urlparse import urlsplit
import re
url = 'https://www.advertise-example.com/ads/2022/presents'
spliturl = urlsplit(url)
int(re.search(r'\d+', spliturl.path).group())
Possibly look at re.findall if you're expecting or want to handle more > 1 digit in the url...
Alternatively, not using re:
digits = [int(el) for el in spliturl.path.split('/') if el.isdigit()]
Here is a solution without using regex
>>> import itertools
>>> url = 'https://www.advertise-example.com/ads/2022/presents'
>>> int(next(''.join(g) for k, g in itertools.groupby(url, str.isdigit) if k))
2022