I am using beautifuly soup to find all href tags.
links = myhtml.findAll('a', href=re.compile('????'))
I need to find all links that have 'abc123' in the href text.
I need help with the regex , see ??? in my code snippet.
If 'abc123' is literally what you want to search for, anywhere in the href, then re.compile('abc123') as suggested by other answers is correct. If the actual string you want to match contains punctuation, e.g. 'abc123.com', then use instead
re.compile(re.escape('abc123.com'))
The re.escape part will "escape" any punctuation so that it's taken literally, just like alphanumerics are; without it, some punctuation gets interpreted in various ways by RE's engine, for example the dot ('.') in the above example would be taken as "any single character whatsoever", so re.compile('abc123.com') would match, e.g. 'abc123zcom' (and many other strings of a similar nature).
"abc123" should give you what you want
if that doesn't work, than BS is probably using re.match in which case you would want ".*abc123.*"
If you want all the links with exactly 'abc123' you can simply put:
links = myhtml.findAll('a', href=re.compile('abc123'))
Related
I have
/url?q=http://dl.mytehranmusic.com/1392/Poya/New/1392/7/8/1/&sa=U&ved=0ahUKEwjIhcufvJXOAhWKrY8KHWjQBgQQFggTMAA&usg=AFQjCNF4phMtVM1Gmm1_kTpNOM6CXO0wIw
/url?q=http://mp3lees.org/index.php%3Fq%3DSia%2B-%2BElastic%2BHeart%2B(Feat.%2BThe%2BWeeknd%2B%2B%2BDiplo)&sa=U&ved=0ahUKEwjIhcufvJXOAhWKrY8KHWjQBgQQFggZMAE&usg=AFQjCNED4J0NRY5dmpC_cYMDJP9YM_Oxww
I am trying to find the http:// link out of the provided google search results link.
I have tried href = re.findall ('/url?q=(+/S)&', mixed) where mixed is variable name in which the unformatted link is stored.
You do not really need a regex to parse query strings. Use urlparse:
import urlparse
s = '/url?q=http://dl.mytehranmusic.com/1392/Poya/New/1392/7/8/1/&sa=U&ved=0ahUKEwjIhcufvJXOAhWKrY8KHWjQBgQQFggTMAA&usg=AFQjCNF4phMtVM1Gmm1_kTpNOM6CXO0wIw'
res = urlparse.parse_qs(urlparse.urlparse(s).query)
if (res['q']):
print(res['q'][0])
See the Python demo
If you absolutely want to have a regex solution for the reason you have not explained, I'd suggest
r'/url\?(?:\S*?&)?q=([^&]+)'
See the regex demo.
The (?:\S*?&) part allows to match the q anywhere inside the query string, and ([^&]+) will match 1 or more characters other than & and capture into a group returned with re.findall.
I want a python regex expression that can pull the contents between script[" and "] but there are other "]" which worries me
expected:
{bunch of javascript here. [\"apple\"] test}
my attempt:
javascript\[\"(.*)"]
target string:
//url//script["{bunch of javascript here. [\"apple\"] test}"]|//*[#attribute="eggs"]
link to the regex
You can't match nested brackets with the re module since it doesn't have the recursion feature to do that. However, in your example you can skip the innermost square brackets if you choose to ignore all brackets enclosed between double quotes.
try something like this:
p = re.compile(r'script\["([^\\"]*(?:\\.[^\\"]*)*)"]', re.S)
Note: I assumed here that the predicate is only related to the "text" content of the script node (and not an attribute, a number of item or an axe).
It's very hard to understand exactly what you want to achieve because of the way you have written the question. However if you are looking for the firs instance of "] AFTER a } then try this:
\["([^}]+}.*?)"\]
Link to the regex
This also would work:
\["(.*?}.*?)"\]
Link to the second regex example
I just hit a snag with regex and have no idea why this's not working.
Here is what BeautifulSoup doc says:
soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]
Here is my html:
Aouate</span><span class="pos_text pos3_l_4">
and I'm trying to match the span tag (last position).
>>> if soup.find(class_=re.compile("pos_text pos3_l_\d{1}")):
print "Yes"
# prints nothing - indicating there is no such pattern in the html
So, I'm just repeating the BS4 docs, except my regex is not working. Sure enough if I replace the \d{1} with 4 (as originally in the html) it succeedes.
Try "\\d" in your regex. It's probably interpreting "\d" as trying to escape 'd'.
Alternatively, a raw string ought to work. Just put an 'r' in front of the regex, like this:
re.compile(r"pos_text pos3_l_\d{1}")
I'm not entirely sure, but this worked for me:
soup.find(attrs={'class':re.compile('pos_text pos3_l_\d{1}')})
You are matching not for a class but for an specific combination of classes in an specific order.
From the documentation:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>] But searching for variants of the string value won’t work:
css_soup.find_all("p", class_="strikeout body")
# []
So you should problable fist match for post_text and then in the result try to match with a regexp in the matches for that search
How to find all words except the ones in tags using RE module?
I know how to find something, but how to do it opposite way? Like I write something to search for, but acutally I want to search for every word except everything inside tags and tags themselves?
So far I managed this:
f = open (filename,'r')
data = re.findall(r"<.+?>", f.read())
Well it prints everything inside <> tags, but how to make it find every word except thats inside those tags?
I tried ^, to use at the start of pattern inside [], but then symbols as . are treated literally without special meaning.
Also I managed to solve this, by splitting string, using '''\= <>"''', then checking whole string for words that are inside <> tags (like align, right, td etc), and appending words that are not inside <> tags in another list. But that a bit ugly solution.
Is there some simple way to search for every word except anything that's inside <> and these tags themselves?
So let say string 'hello 123 <b>Bold</b> <p>end</p>'
with re.findall, would return:
['hello', '123', 'Bold', 'end']
Using regex for this kind of task is not the best idea, as you cannot make it work for every case.
One of solutions that should catch most of such words is regex pattern
\b\w+\b(?![^<]*>)
If you want to avoid using a regular expression, BeautifulSoup makes it very easy to get just the text from an HTML document:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
text = "".join(soup.findAll(text=True))
From there, you can get the list of words with split:
words = text.split()
Something like re.compile(r'<[^>]+>').sub('', string).split() should do the trick.
You might want to read this post about processing context-free languages using regular expressions.
Strip out all the tags (using your original regex), then match words.
The only weakness is if there are <s in the strings other than as tag delimiters, or the HTML is not well formed. In that case, it is better to use an HTML parser.
I am dealing a problem to write a python regex 'not'to identify a certain pattern within href tags.
My aim is to replace all occurrences of DSS[a-z]{2}[0-9]{2} with a href link as shown below,but without replacing the same pattern occurring inside href tags
Present Regex:
replaced = re.sub("[^http://*/s](DSS[a-z]{2}[0-9]{2})", "\\1", input)
I need to add this new regex using an OR operator to the existing one I have
EDIT:
I am trying to use regex just for a simple operation. I want to replace the occurrences of the pattern anywhere in the html using a regex except occurring within<a><\a>.
The answer to any question having regexp and HTML in the same sentence is here.
In Python, the best HTML parser is indeed Beautilf Soup.
If you want to persist with regexp, you can try a negative lookbehind to avoid anything precessed by a ". At your own risk.