My text is
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
I am trying to extract value of posted_data which is 2e54eba66f8f2881c8e78be8342428xd
My code :
extract_posted_data = re.search(r'(\"posted_data\": \")(\w*)', my_text)
print (extract_posted_data)
and it prints None
Thank you
This particular example doesn't seem like it needs regular expressions at all.
>>> my_text
'"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
>>> import json
>>> result = json.loads('{%s}' % my_text)
>>> result
{'posted_data': '2e54eba66f8f2881c8e78be8342428xd', 'isropa': False, 'rx': 'NO', 'readal': 'false'}
>>> result['posted_data']
'2e54eba66f8f2881c8e78be8342428xd'
With BeautifulSoup:
>>> import json
...
... from bs4 import BeautifulSoup
...
... soup = BeautifulSoup('<script type="text/javascript"> "posted_data":"2738273283723hjasda" </script>')
...
... result = json.loads('{%s}' % soup.script.text)
>>> result
{'posted_data': '2738273283723hjasda'}
>>> result['posted_data']
'2738273283723hjasda'
This is because your original code has an additional space. It should be:
extract_posted_data = re.search(r'(\"posted_data\":\")(\w*)', my_text)
And in fact, '\' is unnecessary here. Just:
extract_posted_data = re.search(r'("posted_data":")(\w*)', my_text)
Then:
extract_posted_data.group(2)
is what you want.
>>> my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
>>> extract_posted_data = re.search(r'("posted_data":")(\w*)', my_text)
>>> extract_posted_data.group(2)
'2e54eba66f8f2881c8e78be8342428xd'
You need to change your regex to use lookarounds, as follows:
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
extract_posted_data = re.search(r'(?<="posted_data":")\w*(?=")', my_text)
print (extract_posted_data[0])
Prints 2e54eba66f8f2881c8e78be8342428xd
Also re.search() returns a Match object, so to get the first match (the only match) you get index 0 of the match:
as others have mentioned json would be a better tool for this data but you can also use this regex (I added a \s* in case in the future there are spaces in between):
regex: "posted_data":\s*"(?P<posted_data>[^"]+)"
import re
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
m = re.search(r'"posted_data":\s*"(?P<posted_data>[^"]+)"', my_text)
if m:
print(m.group('posted_data'))
Related
I have a file with the following format:
===Subtitle 1===
text ....
===Subtitle 2===
text ....
How could I replace ===Subtitle 1=== with Section Subtitle 1 using python?
I tried this:
import re
s = '===Subtitle 1==='
lst = re.findall('===[\S+a-zA-Z0-9]===', s)
print lst
But I cannot print out anything.
You don't need use regex here, just use str.replace() like this:
>>> a = '===Subtitle 1==='
>>> a.replace('=', '')
'Subtitle 1'
>>>
But if you'd like use regex...
>>> import re
>>> re.findall('=+(.+?)=+', a)
['Subtitle 1']
>>> re.findall('=+(.+?)=+', a)[0]
'Subtitle 1'
>>>
common is always present regardless of string. Using that information, I'd like to grab the substring that comes just before it, in this case, "banana":
string = "apple_orange_banana_common_fruit"
In this case, "fruit":
string = "fruit_common_apple_banana_orange"
How would I go about doing this in Python?
You can use re.search() to extract the substring:
>>> import re
>>> s = 'apple_orange_banana_common_fruit'
>>> re.search(r'([a-zA-Z]+)_common', s).group(1)
'banana'
This will return a list of matches:
import re
string = "apple_orange_banana_common_fruit"
preceding_word = re.findall("[A-Za-z]+(?=_common)", string)
If common only occurs once per string, you might be better off using hwnd's solution.
import re
string = "apple_orange_bananna_common_fruit"
preceding_word = re.search('([a-zAZ]+)(?=_common)', string)
print (preceding_word.group(1))
>>> string = "fruit_common_apple_banana_orange"
>>> parts = string.split('_')
>>> print parts[parts.index('common') - 1]
fruit
>>> string = "apple_orange_banana_common_fruit"
>>> parts = string.split('_')
>>> print parts[parts.index('common') - 1]
banana
I have a problem with regex and my string...
I need a solution in my regex code for take a float number of a string. I don't know why this code doesn't work.
from bs4 import BeautifulSoup
import urllib2
from re import sub
url = 'http://www.ebay.es/itm/PET-SHOP-BOYS-OFFICIAL-PROMO-BARCELONA-ELECTRIC-TOUR-BEER-CERVEZA-20cl-BOTTLE-/111116266655' #raw_input('Dime la url que deseas: ')
code = urllib2.urlopen(url).read();
soup = BeautifulSoup(code)
info = soup.find('span', id='v4-27').contents[0]
print info
info = sub("[\D]+,+[\D]", "", info)
i = float(info)
print i
\D means Non-digits. You need to use \d instead. Look here for details: http://en.wikipedia.org/wiki/Regular_expression#Character_classes
Updated
I see, that your approach is to replace all non-digits chars. To my mind, match needed information is more clear:
>>> import re
>>> s = "15,00 EUR"
>>> price_string = re.search('(\d+,\d+)', s).group(1)
>>> price_string
'15,00'
>>> float(price_string.replace(',', '.'))
15.0
i need to find anything between
show_detail&
and
;session_id=1445045
in
https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0
using regex in python.
i know i need to use lookbehind/ahead but i can't seem to make it work!
please help!
thanks :)
Why use a regex?
>>>> url = 'https://ww.site.gov.....'
>>> start = url.index('show_detail&') + len('show_detail&')
>>> end = url.index(';session_id=')
>>> url[start:end]
'id=4035219;num=1'
>>> s= "https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0"
>>> s.split(";session_id=1445045")[0].split("show_detail&")[-1]
'id=4035219;num=1'
>>>
You can use a non greedy match (.*?) in between your markers.
>>> import re
>>> url = "https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0"
>>> m = re.search("show_detail&(.*?);session_id=1445045", url)
>>> m.group(1)
'id=4035219;num=1'
regex = re.compile(r"(?<=show_detail&).*?(?=;session_id=1445045)"
should work. See here for more info on lookaround assertions.
import re
url = "https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0"
pattern = "([^>].+)(show_detail&)([^>].+)(session_id=1445045)([^>].+)"
reg = re.compile(r''''''+pattern+'''''',flags = re.S)
match =reg.search(url)
print match.group(3)
this would work i think
I need help with two regex operations.
Get all text until an open bracket.
e.g. 'this is so cool (234)' => 'this is so cool'
Get the text inside the brackets, so the number '234'
Up until the paren: regex = re.compile("(.*?)\s*\(")
Inside the first set of parens: regex = re.compile(".*?\((.*?)\)")
Edit: Single regex version: regex = re.compile("(.*?)\s*\((.*?)\)")
Example output:
>>> import re
>>> r1 = re.compile("(.*?)\s*\(")
>>> r2 = re.compile(".*?\((.*?)\)")
>>> text = "this is so cool (234)"
>>> m1 = r1.match(text)
>>> m1.group(1)
'this is so cool'
>>> m2 = r2.match(text)
>>> m2.group(1)
'234'
>>> r3 = re.compile("(.*?)\s*\((.*?)\)")
>>> m3 = r3.match(text)
>>> m3.group(1)
'this is so cool'
>>> m3.group(2)
'234'
>>>
Note of course that this won't work right with multiple sets of parens, as it's only expecting one parenthesized block of text (as per your example). The language of matching opening/closing parens of arbitrary recurrence is not regular.
Sounds to me like you could just do this:
re.findall('[^()]+', mystring)
Splitting would work, too:
re.split('[()]', mystring)
Either way, the text before the first parenthesis will be the first item in the resulting array, and the text inside the first set of parens will be the second item.
No need for regular expression.
>>> s="this is so cool (234)"
>>> s.split("(")[0]
'this is so cool '
>>> s="this is so cool (234) test (123)"
>>> for i in s.split(")"):
... if "(" in i:
... print i.split("(")[-1]
...
234
123
Here is my own library function version without regex.
def between(left,right,s):
before,_,a = s.partition(left)
a,_,after = a.partition(right)
return before,a,after
s="this is so cool (234)"
print('\n'.join(between('(',')',s)))