I have a string and I want to extract a substring from that main string
Some sample strings are:
http://domain.com/xxxxx/xxxxxxxx?tags=%7C105651%7C102496&Asidebar=1&dnr=y
http://domain.com/xxxxx/xxxxxx?tags=%7C12784%7C102496&index=28&showFromBeginning=true&
I want to get the tags value.
In this case:
val = %7C105651%7C102496
val = %7C12784%7C102496
Is there any chance to get that?
Edit
tags = re.search('tags=(.+?)&Asidebar', url)
print tags
if tags:
found = tags.group(1)
print (found)
output: None
Note: I've just tried to get something from the first string only
Using urlparse.urlparse and cgi.parse_qs (Python 2.x):
>>> import urlparse
>>> import cgi
>>>
>>> s = 'http://domain.com/xxxxx/xxxxxxxx?tags=%7C105651%7C102496&Asidebar=1&dnr=y'
>>> cgi.parse_qs(urlparse.urlparse(s).query)
{'dnr': ['y'], 'Asidebar': ['1'], 'tags': ['|105651|102496']}
>>> cgi.parse_qs(urlparse.urlparse(s).query)['tags'][0]
'|105651|102496'
In Python 3.x, use urllib.parse.urlparse and urllib.parse.parse_qs:
>>> import urllib.parse
>>>
>>> s = 'http://domain.com/xxxxx/xxxxxxxx?tags=%7C105651%7C102496&Asidebar=1&dnr=y'
>>> urllib.parse.parse_qs(urllib.parse.urlparse(s).query)['tags'][0]
'|105651|102496'
You're almost there. You don't need to write Asidebar in your regex. Because in your second input string, there isn't a substring called Asidebar.
tags = re.search('tags=(.+?)&', url)
if tags:
found = tags.group(1)
print (found)
Related
I have the string "/browse/advanced-computer-science-modules?title=machine-learning"** in Python. I want to print the string in between the second "/" and the "?", which is "advanced-computer-science-modules".
I've created a regular expression that is as follows ^([a-z]*[\-]*[a-z])*?$ but it prints nothing when I run the .findall() function from the re module.
I created my own regex and imported the re module in python. Below is a snippet of my code that returned nothing.
regex = re.compile(r'^([a-z]*[\-]*[a-z])*?$')
str = '/browse/advanced-computer-science-modules?title=machine-learning'
print(regex.findall(str))
Since this appears to be a URL, I'd suggest you use URL-parsing tools instead:
>>> from urllib.parse import urlsplit
>>> url = '/browse/advanced-computer-science-modules?title=machine-learning'
>>> s = urlsplit(url)
SplitResult(scheme='', netloc='', path='/browse/advanced-computer-science-modules', query='title=machine-learning', fragment='')
>>> s.path
'/browse/advanced-computer-science-modules'
>>> s.path.split('/')[-1]
'advanced-computer-science-modules'
The regex is as follows:
\/[a-zA-Z\-]+\?
Then you catch the substring:
regex.findall(str)[1:len(str) - 1]
Very specific to this problem, but it should work.
Alternatively, you can use split method of a string:
str = '/browse/advanced-computer-science-modules?title=machine-learning'
result = str.split('/')[-1].split('?')[0]
print(result)
#advanced-computer-science-modules
My text is
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
I am trying to extract value of posted_data which is 2e54eba66f8f2881c8e78be8342428xd
My code :
extract_posted_data = re.search(r'(\"posted_data\": \")(\w*)', my_text)
print (extract_posted_data)
and it prints None
Thank you
This particular example doesn't seem like it needs regular expressions at all.
>>> my_text
'"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
>>> import json
>>> result = json.loads('{%s}' % my_text)
>>> result
{'posted_data': '2e54eba66f8f2881c8e78be8342428xd', 'isropa': False, 'rx': 'NO', 'readal': 'false'}
>>> result['posted_data']
'2e54eba66f8f2881c8e78be8342428xd'
With BeautifulSoup:
>>> import json
...
... from bs4 import BeautifulSoup
...
... soup = BeautifulSoup('<script type="text/javascript"> "posted_data":"2738273283723hjasda" </script>')
...
... result = json.loads('{%s}' % soup.script.text)
>>> result
{'posted_data': '2738273283723hjasda'}
>>> result['posted_data']
'2738273283723hjasda'
This is because your original code has an additional space. It should be:
extract_posted_data = re.search(r'(\"posted_data\":\")(\w*)', my_text)
And in fact, '\' is unnecessary here. Just:
extract_posted_data = re.search(r'("posted_data":")(\w*)', my_text)
Then:
extract_posted_data.group(2)
is what you want.
>>> my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
>>> extract_posted_data = re.search(r'("posted_data":")(\w*)', my_text)
>>> extract_posted_data.group(2)
'2e54eba66f8f2881c8e78be8342428xd'
You need to change your regex to use lookarounds, as follows:
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
extract_posted_data = re.search(r'(?<="posted_data":")\w*(?=")', my_text)
print (extract_posted_data[0])
Prints 2e54eba66f8f2881c8e78be8342428xd
Also re.search() returns a Match object, so to get the first match (the only match) you get index 0 of the match:
as others have mentioned json would be a better tool for this data but you can also use this regex (I added a \s* in case in the future there are spaces in between):
regex: "posted_data":\s*"(?P<posted_data>[^"]+)"
import re
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
m = re.search(r'"posted_data":\s*"(?P<posted_data>[^"]+)"', my_text)
if m:
print(m.group('posted_data'))
common is always present regardless of string. Using that information, I'd like to grab the substring that comes just before it, in this case, "banana":
string = "apple_orange_banana_common_fruit"
In this case, "fruit":
string = "fruit_common_apple_banana_orange"
How would I go about doing this in Python?
You can use re.search() to extract the substring:
>>> import re
>>> s = 'apple_orange_banana_common_fruit'
>>> re.search(r'([a-zA-Z]+)_common', s).group(1)
'banana'
This will return a list of matches:
import re
string = "apple_orange_banana_common_fruit"
preceding_word = re.findall("[A-Za-z]+(?=_common)", string)
If common only occurs once per string, you might be better off using hwnd's solution.
import re
string = "apple_orange_bananna_common_fruit"
preceding_word = re.search('([a-zAZ]+)(?=_common)', string)
print (preceding_word.group(1))
>>> string = "fruit_common_apple_banana_orange"
>>> parts = string.split('_')
>>> print parts[parts.index('common') - 1]
fruit
>>> string = "apple_orange_banana_common_fruit"
>>> parts = string.split('_')
>>> print parts[parts.index('common') - 1]
banana
Suppose i had a url as below
url = 'https://www.advertise-example.com/ads/2022/presents'
Now i am trying to get the integer value 2022 out from the above url.we can use list slicing here, but the integer value can be increased, so i used regular expressions but couldn't get the exact result, can anyone tell me how to do this
Thanks in advance........
>>> import re
>>> url = 'https://www.advertise-example.com/ads/2022/presents'
>>> int(re.search(r'\d+', url).group())
2022
from urlparse import urlsplit
import re
url = 'https://www.advertise-example.com/ads/2022/presents'
spliturl = urlsplit(url)
int(re.search(r'\d+', spliturl.path).group())
Possibly look at re.findall if you're expecting or want to handle more > 1 digit in the url...
Alternatively, not using re:
digits = [int(el) for el in spliturl.path.split('/') if el.isdigit()]
Here is a solution without using regex
>>> import itertools
>>> url = 'https://www.advertise-example.com/ads/2022/presents'
>>> int(next(''.join(g) for k, g in itertools.groupby(url, str.isdigit) if k))
2022
I have the following string
http://example.com/variable/controller/id32434242423423234?param1=321¶m2=4324342
How in best way to extract id value, in this case - 32434242423423234
Regardz,
Mladjo
You could just use a regular expression, e.g.:
import re
s = "http://example.com/variable/controller/id32434242423423234?param1=321¶m2=4324342"
m = re.search(r'controller/id(\d+)\?',s)
if m:
print "Found the id:", m.group(1)
If you need the value as an number rather than a string, you can use int(m.group(1)). There are plenty of other ways of doing this that might be more appropriate, depending on the larger goal of your code, but without more context it's hard to say.
>>> import urlparse
>>> res=urlparse.urlparse("http://example.com/variable/controller/id32434242423423234?param1=321¶m2=4324342")
>>> res.path
'/variable/controller/id32434242423423234'
>>> import posixpath
>>> posixpath.split(res.path)
('/variable/controller', 'id32434242423423234')
>>> directory,filename=posixpath.split(res.path)
>>> filename[2:]
'32434242423423234'
Using urlparse and posixpath might be too much for this case, but I think it is the clean way to do it.
>>> s
'http://example.com/variable/controller/id32434242423423234?param1=321¶m2=4324342'
>>> s.split("id")
['http://example.com/variable/controller/', '32434242423423234?param1=321¶m2=4324342']
>>> s.split("id")[-1].split("?")[0]
'32434242423423234'
>>>
While Regex is THE way to go, for simple things I have written a string parser. In a way, is the (uncomplete) reverse operation of a string formatting operation with PEP 3101. This is very convenient because it means that you do not have to learn another way of specifying the strings.
For example:
>>> 'The answer is {:d}'.format(42)
The answer is 42
The parser does the opposite:
>>> Parser('The answer is {:d}')('The answer is 42')
42
For your case, if you want an int as output
>>> url = 'http://example.com/variable/controller/id32434242423423234?param1=321¶m2=4324342'
>>> fmt = 'http://example.com/variable/controller/id{:d}?param1=321¶m2=4324342'
>>> Parser(fmt)(url)
32434242423423234
If you want a string:
>>> fmt = 'http://example.com/variable/controller/id{:s}?param1=321¶m2=4324342'
>>> Parser(fmt)(url)
32434242423423234
If you want to capture more things in a dict:
>>> fmt = 'http://example.com/variable/controller/id{id:s}?param1={param1:s}¶m2={param2:s}'
>>> Parser(fmt)(url)
{'id': '32434242423423234', 'param1': '321', 'param2': '4324342'}
or in a tuple:
If you want to capture more things in a dict:
>>> fmt = 'http://example.com/variable/controller/id{:s}?param1={:s}¶m2={:s}'
>>> Parser(fmt)(url)
('32434242423423234', '321', '4324342')
Give it a try, it is hosted here