so I've been working on a web crawler to parse out readable contents from a news site I like, and I've been using regex pretty heavily in python2. I visited https://regexr.com/ to double check that I had the correct expression for this use case, but I keep getting different results than expected, specifically when I cross reference the output from regexr. Here is the expression
re.compile(ur"[\s\S\]*<p.*>([\s\S]+?)<\/p>")
And here is the html I am attempting to match
</figcaption></figure><p>Researchers at MIT and several other
institutions have developed a method for making photonic ...
It doesn't end up getting closed for some time, but the program doesn't grab this section at all, and only after the in
ygen levels</a>, and even blood pressure.</p>
does it begin to grab the html (EDIT: p elements). I guess I am confused by the inconsistencies with different regex engines, and I am trying to figure out when and where to modify my syntax, in this case to grab the entire p element, but also generally. This is my first time posting here, so I may have this formatted incorrectly, but thank you all in advance. Been lurking for a while now.
The expression [\s\S]* will match everything, and so will go straight past the beginning of the tag.
Within the tag, your expression p.* is greedy, and will not stop at the nearest closing bracket. Use .*? for non-greedy.
You seem to have a number of other syntax errors in the regex also. Cut and paste a valid regex.
In general it much easier and less error-prone to use a proper HTML parsing library, even for quite simple tasks. See for example the parsers in lxml.
Perhaps it's because you don't have a closing parenthesis ) in your regular expression?
Try starting with this, then build it out:
import re
s = """</figcaption></figure><p>Researchers at MIT and several other
institutions have developed a method for making photonic</p>"""
r = re.compile(r"<p>([\w\W ]*)</p>")
a = r.search(s)
print(a.group(1))
Note that you don't have to escape the forward slash.
In this case, I ended up getting the response I desired with #marekful 's expression substituted into the regex mentioned in the post. Thank you all for the assistance!
re.compile(ur"[\s\S\]*?<p[^>]*>([\w\W])*</\p>")
Related
There is a JS-file with the following structure:
https://sub-domain.domain.com/pa-th/subpath/v1/js/jsfile.js?_=1481126808853
and I'm using Telerik Fiddler Web Debugger v4.6.3.44034 to filter that URL with a regex. But it always fails.
within the Autoresponder Tab I added a Rule with following content:
REGEX:https://sub-domain\.domain\.de/pa-th/subpath/v1/js/jsfile\.js(.*)
I tried various configurations including masking the Slashes and adding a masked question mark, tried masking the hyphens etc. but none of it did replace the file. What is the correct way to do it? It seems I can't even test the regex on regex101.com because it seems to be a python regex in Fiddler but slightly different from normal python regex patterns and the research was even more confusing because it seems that different Fiddler versions use different Regex patterns. Can anyone provide a solution or some tips maybe?
Your regex is looking for .de, but your URL is .com On regex101.com this is a match:
URL text
https://sub-domain.domain.com/pa-th/subpath/v1/js/jsfile.js?_=1481126808853
Regex
https://sub-domain\.domain\.com/pa-th/subpath/v1/js/jsfile\.js(.*)
I don't recall what regex language Fiddler uses, but you may need to escape the forward slashes as well, like
https:\/\/sub-domain\.domain\.com\/pa-th\/subpath\/v1\/js\/jsfile\.js(.*)
I'm parsing a website with the requests module and I'm trying to get specific URLs inside tags (but a table of data as the tags are used more than once) without using BeautifulSoup. Here's part of the code I'm trying to parse:
<td class="notranslate" style="height:25px;">
<a class="post-list-subject" href="/Forum/ShowPost.aspx?PostID=80631954">
<div class="thread-link-outer-wrapper">
<div class="thread-link-container notranslate">
Forum Rule: Don't Spam in Any Way
</div>
I'm trying to get the text inside the tag:
/Forum/ShowPost.aspx?PostID=80631954
The thing is, because I'm parsing a forum site, there are multiple uses of those divider tags. I'd like to retrieve a table of post URLs using string.split using code similar to this:
htmltext.split('<a class="post-list-subject" href="')[1].split('"><div class="thread-link-outer-wrapper">')[0]
There is nothing in the HTML code to indicate a post number on the page, just links.
In my opinion there are better ways to do this. Even if you don't want to use BeautifulSoup, I would lean towards regular expressions. However, the task can definitely be accomplished using the code you want. Here's one way, using a list comprehension:
results = [chunk.split('">')[0] for chunk in htmltext.split('<a class="post-list-subject" href="')[1:]]
I tried to model it as closely off of your base code as possible, but I did simplify one of the split arguments to avoid whitespace issues.
In case regular expressions are fair game, here's how you could do it:
import re
target = '<a class="post-list-subject" href="(.*)">'
results = re.findall(target, htmltext)
Consider using Beautiful Soup. It will make your life a lot easier. Pay attention to the choice of parser so that you can get the balance of speed and leniency that is appropriate for your task.
It seems really dicey to try to pre-optimize without establishing your bottleneck is going to be html parsing. If you're worried about performance, why not use lxml? Module imports are hardly ever the bottleneck, and it sounds like you're shooting yourself in the foot here.
That said, this will technically do what you want, but it seriously is not more performant than using an HTML parser like lxml in the long run. Explicitly avoiding an HTML parser will also probably drastically increase your development time as you figure out obscure string manipulation snippets rather than just using the nice tree structure that you get for free with HTML.
strcleaner = lambda x : x.replace('\n', '').replace(' ', '').replace('\t', '')
S = strcleaner(htmltext)
S.split(strcleaner('<a class="post-list-subject" href="'))[1].split(strcleaner('"><div class="thread-link-outer-wrapper">'))[0]
The problem with the code you posted is that whitespace and newlines are characters too.
I am trying to write a script to scrape a website, and am using this one (http://www.theericwang.com/scripts/eBayRead.py).
I however want to use it to crawl sites other than ebay, and to customize to my needs.
I am fairly new to python and have limited re experience.
I am unsure of what this line achieves.
for url, title in re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
Could someone please give me some pointers?
Is there anything else I need to consider if I port this for other sites?
In general, parsing HTML is best done with a library such as BeautifulSoup, which takes care of virtually all of the heavy lifting for you, leaving you with more intuitive code. Also, read #Tadeck's link below - regex and HTML shouldn't be mixed if it can be avoided (to put it lightly).
As for your question, that line uses something called 'regular expression' to find matching patterns in a text (in this case, HTML). re.findall() is a method that returns a list, so if we focus on just that:
re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
r indicates that the following will be interpreted 'raw', meaning that characters like backslashes, etc., will be interpreted literally.
href="([^"]+)
The parentheses indicate a group (what we care about in the match), and the [^"]+ means 'match anything that isn't a quote'. As you can probably guess, this group will return the URL of the link.
.*class="vip"
The .* matches anything (well, almost anything) 0 or more times (which here could include other tags, the closing quote of the link, whitespace, etc.). Nothing special with class="vip" - it just needs to appear.
title=\'([^\']+)', lines):
Here you see an escaped quote and then another group as we saw above. This time, we are capturing anything between the two apostrophes after the title tag.
The end result of this is you are iterating through a list of all matches, and those matches are going to look something like (my_matched_link, my_matched_title), which are passed into for url, title, after which further processing is done.
I am not sure if this would answer your question. But you can consider scrapy: http://scrapy.org for crawling various websites. It is a nice infrastructure which provides a lot of flexibility and is easy to customize to some specific needs.
Regular expressions are bad for parsing HTML
The above is the main idea I would like to communicate to you. For why, see this question: RegEx match open tags except XHTML self-contained tags.
In short, HTML can change as a text (eg. new attribute can be added, order of attributes can be changed, or some other changes may be introduced), but it will result in the exact same HTML as interpreted by web browsers, while completely breaking your script.
The HTML should be parsed using specialized HTML parsers or web scrapers. They know the difference, when it becomes significant.
What to use for scraping?
There are multiple solutions, but one of the most notable ones is: ScraPy. Try it, you may start to love it.
I'm trying to build a small wiki, but I'm having problems writing the regex rules for them.
What I'm trying to do is that every page should have an edit page of its own, and when I press submit on the edit page, it should redirect me to the wiki page.
I want to have the following urls in my application:
http://example.com/<page_name>
http://example.com/_edit/<page_name>
My URLConf has the following rules:
url(r'(_edit/?P<page_name>(?:[a-zA-Z0-9_-]+/?)*)', views.edit),
url(r'(?P<page_name>(^(?:_edit?)?:[a-zA-Z0-9_-]+/?)*)', views.page),
But they're not working for some reason.
How can I make this work?
It seems that one - or both - match the same things.
Following a more concise approach I'd really define the edit URL as:
http://example.com/<pagename>/edit
This is more clear and guessable in my humble opinion.
Then, remember that Django loops your url patterns, in the same order you defined them, and stops on the first one matching the incoming request. So the order they are defined with is really important.
Coming with the answer to your question:
^(?P<page_name>[\w]+)$ matches a request to any /PageName
Please always remember the starting caret and the final dollar signs, that are saying we expect the URL to start and stop respectively right before and after our regexp, otherwise any leading or trailing symbol/character would make the regexp match as well (while you likely want to show up a 404 in that case).
^_edit/(?P<page_name>[\w]+)$ matches the edit URL (or ^(?P<page_name>[\w]+)/edit$ if you like the user-friendly URL commonly referred to as REST urls, while RESTfullnes is a concept that has nothing to do with URL style).
Summarizing put the following in your urls:
url(r'^(?P<page_name>[\w]+)$', views.page)
url(r'^_edit/(?P<page_name>[\w]+)$', views.edit)
You can easily force URLs not to have some particular character by changing \w with a set defined by yourself.
To learn more about Django URL Dispatching read here.
Note: Regexp's are as powerful as dangerous, especially when coming on network. Keep it simple, and be sure to really understand what are you defining, otherwise your web application may be exposed to several security issues.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski
Please try the following URLs, that are simpler:
url(r'_edit/(?P<page_name>[\w-]+), views.edit)'
url(r'(?P<page_name>[\w-]+), views.page),
Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.
Use lxml which is the best xml/html library for python.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
And if you just want to sanitize the html look at the lxml.html.clean module
Use BeautifulSoup! It's perfect for this, where you have incoming markup of dubious virtue and need to get something reasonable out of it. Just pass in the original text, extract all the string tags, and join them.
While I agree with Lucas that regular expressions are not all that scary, I still think that you should go with a specialized HTML parser. This is because the HTML standard is hairy enough (especially if you want to parse arbitrarily "HTML" pages taken off the Internet) that you would need to write a lot of code to handle the corner cases. It seems that python includes one out of the box.
You should also check out the python bindings for TidyLib which can clean up broken HTML, making the success rate of any HTML parsing much higher.
How about parsing the HTML data and extracting the data with the help of the parser ?
I'd try something like the author described in chapter 8.3 in the Dive Into Python book
if you use django you might also use
http://docs.djangoproject.com/en/dev/ref/templates/builtins/#striptags
;)
You might need something more complicated than a regular expression. Web pages often have angle brackets that aren't part of a tag, like this:
<div>5 < 7</div>
Stripping the tags with regex will return the string "5 " and treat
< 7</div>
as a single tag and strip it out.
I suggest looking for already-written code that does this for you. I did a search and found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.
Regular expressions are not scary, but writing your own regexes to strip HTML is a sure path to madness (and it won't work, either). Follow the path of wisdom, and use one of the many good HTML-parsing libraries.
Lucas' example is also broken because "sub" is not a method of a Python string. You'd have to "import re", then call re.sub(pattern, repl, string). But that's neither here nor there, as the correct answer to your question does not involve writing any regexes.
Looking at the amount of sense people are demonstrating in other answers here, I'd say that using a regex probably isn't the best idea for your situation. Go for something tried and tested, and treat my previous answer as a demonstration that regexes need not be that scary.