I am trying to parse a website for
blahblahblah
I DONT CARE ABOUT THIS EITHER
blahblahblah
(there are many of these, and I want all of them in some tokenized form). The problem is that "a href" actually has two spaces, not just one (there are some that are "a href" with one space that I do NOT want to retrieve), so using tree.xpath('//a/#href') doesn't quite work. Does anyone have any suggestions on what to do?
Thanks!
This code works as expected :
from lxml import etree
file = "file:///path/to/file.html" # can be a http URL too
doc = etree.parse(file)
print doc.xpath('//a/#href')[0]
Edit : it's not possible AFAIK to do what you want with lxml.
You can use a regex instead.
Don't know about LXML, but you can definitely use BeautifulSoup, find all <a> on the page, and than create a for loop, where you will check if <a href=...> matches your regex pattern, if it match, than scrap url.
"(there are some that are "a href" with one space that I do NOT want to retrieve)"
I think this means that you only want to locate elements where there is more than one space between the a and the href. XML allows any amount of whitespace between tag name and attribute (spaces, tabs, new lines are all allowed). The whitespace is discarded by the time the text is parsed and the document tree is created. LXML and XPATH are working with Node objects in the Document tree, not the original text that was parsed to make the tree.
One option is to use regular expressions to find the text you want. But really, since this is perfectly valid XML/HTML, why bother to remove a few spaces?
Use an xpath expression to find all the nodes, then iterate through all those nodes looking for a match, you can obtain a string representation of the node with :
etree.tostring(node)
For futher reference : http://lxml.de/tutorial.html#elements-carry-attributes-as-a-dict
Related
I am trying to remove HTML tags from text in Python. The issue is with the format of the tags present. Ex:
[click internet options div on the right]
div - is the HTML tag
Expected:
[click internet options on the right]
It does not have the format like <> etc. Currently I manually created a list of HTML tags and removing it using the "not in". Is there a better way to clean this. P.S I am not asking for the code as such, any suggestions on the approach would be great.
You can use a regular expression, but you will need a list of the HTML tags you want to remove. Take a look to re.sub documentation, it will help you to write your regexp, like this one:
re.sub(r"(div|section|aside)", "", toCheck)
The first parameter is the pattern, the second the replacement (in this case nothing) and then, the third, the string to check.
Let's use the word technology for my example.
I want to search all text on a webpage. For each text, I want to find each element tags containing a string with the word "technology" and print only the contents of the element tag containing the word. Please help me figure this out.
words = soup.body.get_text()
for word in words:
i = word.soup.find_all("technology")
print(i)
You should use the search by text which can be accomplished by using the text argument (which was renamed to string in the modern BeautifulSoup versions), either via a function and substring in a string check:
for element in soup.find_all(text=lambda text: text and "technology" in text):
print(element.get_text())
Or, via a regular expression pattern:
import re
for element in soup.find_all(text=re.compile("technology")):
print(element.get_text())
Since you are looking for data inside of an 'HTML structure' and not a typical data structure, you are going to have to nearly write an HTML parser for this job. Python doesn't normally know that "some string here" relates to another string wrapped in brackets somewhere else.
There may be a library for this, but I have a feeling that there isn't :(
I need to match different script tags which
for example like this
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script type="text/javascript">
jQuery(document).ready(function()
{
jQuery("#gift_cards").tooltip({ effect: \'slide\'});
});
</script>
<script>dasdfsfsdf</script>
Also i need to get the tags only and the src content in groups
I created a regex
(<\s*?script[\s\S]*?(?:src=['"](\S+?)['"])?\B[\S\s]*?>)([\s\S]*?)(</script>)
This is not matching the last script tag
Whats wrong with it?
EDIT:
Removing the \B does match all the script tags but then i donot get the contents of the src attribute in a separate group. What I need to do is from a group of script tags of two categories
One with an src attribute with the path to the actual script
Second without src attribute with normal inline javascript
I need to remove the script opening and closing tags but keep the content inside of the tag
If its of the first type I still need to remove the tags but keep the path in a seperate table
Hope that clarifies it much more
As iCodez' link so entertainingly shows, HTML should not be parsed by regex, as HTML is not a regular language. Instead, try using a parser such as BeautifulSoup. Make sure you also install lxml and html5lib as well for best performance and access to all the features.
pip install lxml html5lib beautifulsoup4
should do the trick.
Provided that I agree with all the remarks about not parsing HTML with RegExp and also provided that I myself indulge in such evil practice when I'm confident that the documents I will process are regular enough, try removing the \B, in my test it matches all three scripts.
What is for, by the way, this "non boundary"? I'm not sure I understood why you inserted it. If it was necessary for some reason I do not grasp please tell me and we'll try to find another way.
Edit:
In order to retain the src content try
(<\s*?script[\s\S]*?(?:(?:src=[\'"](.*?)[\'"])(?:[\S\s]*?))?>)([\s\S]*?)(</script>)
This works for me, check against your other samples.
Consider that your first [\s\S]*? already matches everything till > when you do not have a "src" attribute, so the second one only makes sense if "src" is there and you want to match other possible attributes.
For giggles, here's a super-simple way that I figured out by complete accident (as a js string which would be fed to the RegExp constructor:
'src=(=|=")' + yourPathHere + '[^<]<\/script>'
where yourPathHere has had forward slashes escaped; so, as a pure RE, something like:
/src=(=|=")/scripts/someFolder/script.js[^<]</script>/
which I'm using in a gulp task whilst I'm trying to figure out gulp streams :[]
How to find all words except the ones in tags using RE module?
I know how to find something, but how to do it opposite way? Like I write something to search for, but acutally I want to search for every word except everything inside tags and tags themselves?
So far I managed this:
f = open (filename,'r')
data = re.findall(r"<.+?>", f.read())
Well it prints everything inside <> tags, but how to make it find every word except thats inside those tags?
I tried ^, to use at the start of pattern inside [], but then symbols as . are treated literally without special meaning.
Also I managed to solve this, by splitting string, using '''\= <>"''', then checking whole string for words that are inside <> tags (like align, right, td etc), and appending words that are not inside <> tags in another list. But that a bit ugly solution.
Is there some simple way to search for every word except anything that's inside <> and these tags themselves?
So let say string 'hello 123 <b>Bold</b> <p>end</p>'
with re.findall, would return:
['hello', '123', 'Bold', 'end']
Using regex for this kind of task is not the best idea, as you cannot make it work for every case.
One of solutions that should catch most of such words is regex pattern
\b\w+\b(?![^<]*>)
If you want to avoid using a regular expression, BeautifulSoup makes it very easy to get just the text from an HTML document:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
text = "".join(soup.findAll(text=True))
From there, you can get the list of words with split:
words = text.split()
Something like re.compile(r'<[^>]+>').sub('', string).split() should do the trick.
You might want to read this post about processing context-free languages using regular expressions.
Strip out all the tags (using your original regex), then match words.
The only weakness is if there are <s in the strings other than as tag delimiters, or the HTML is not well formed. In that case, it is better to use an HTML parser.
I want to extract data from such regex:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
I've found related question extract contents of regex
but in my case I shoud iterate somehow.
As paprika mentioned in his/her comment, you need to identify the desired parts of any matched text using ()'s to set off the capture groups. To get the contents from within the td tags, change:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
to:
<td>([a-zA-Z]+)</td><td>([\d]+.[\d]+)</td><td>([\d]+)</td><td>([\d]+.[\d]+)</td>
^^^^^^^^^ ^^^^^^^^^^^ ^^^^^ ^^^^^^^^^^^
group 1 group 2 group 3 group 4
And then access the groups by number. (Just the first line, the line with the '^'s and the one naming the groups are just there to help you see the capture groups as specified by the parentheses.)
dataPattern = re.compile(r"<td>[a-zA-Z]+</td>... etc.")
match = dataPattern.find(htmlstring)
field1 = match.group(1)
field2 = match.group(2)
and so on. But you should know that using re's to crack HTML source is one of the paths toward madness. There are many potential surprises that will lurk in your input HTML, that are perfectly working HTML, but will easily defeat your re:
"<TD>" instead of "<td>"
spaces between tags, or between data and tags
" " spacing characters
Libraries like BeautifulSoup, lxml, or even pyparsing will make for more robust web scrapers.
As the poster clarified, the <td> tags should be removed from the string.
Note that the string you've shown us is just that: a string. Only if used in the context of regular expression functions is it a regular expression (a regexp object can be compiled from it).
You could remove the <td> tags as simply as this (assuming your string is stored in s):
s.replace('<td>','').replace('</td>','')
Watch out for the gotchas however: this is really of limited use in the context of real HTML, just as others pointed out.
Further, you should be aware that whatever regular expression [string] is left, what you can parse with that is probably not what you want, i.e. it's not going to automatically match anything that it matched before without <td> tags!