I am using beautifulsoup to scrape different data in websites.
I am trying to scrape the source, but not all the source, just the substring which is important for me.
For example, in this item, I would like to pick just the string between / and .png (which in this case is "nyt") and to save it in a list.
<image width="185" height="26"
xmlns:xlink="http://www.w3.org/1999/xlink"
xlink:href="https://a1.nyt.com/assets/shell/20160613-034030/images/foundation/logos/nyt-logo-185x26.svg" src="https://a1.nyt.com/assets/shell/20160613-034030/images/foundation/logos/nyt.png" border="0"></image>
I have been trying with several regular expressions like re.search('[a-z]*.png',src).group(0) but nothing works well.
Can anyone tell me what would be the right way to scrape that info??
If you want to find the name of the png inside of the src attribute you can use this regular expression:
src=\s*(\"|\')[^"']+?([^/]+?)\.png\1
You will have to capture the second group in Python in this case.
Click on the pythex link to try it out.
Here is the explanation:
src=\s* literal to find all "src=" literals followed by any number of optional spaces
(\"|\') group with either a double or single quote.
[^"']+? anything that is not a double or single quote (non greedy).
([^/]+?) anything that is not a a forward slash (non greedy).
\.png literal ".png"
\1 back reference to the first group (\"|\')
Related
I would like to use a regular expression to extract the following text from an HTML file: ">ABCDE</A></td><td>
I need to extract: ABCDE
Could anybody please help me with the regular expression that I should use?
Leaning on this, https://stackoverflow.com/a/40908001/11450166
(?<=(<A>))[A-Za-z]+(?=(<\/A>))
With that expression, supposing that your tag is <A> </A>, works fine.
This other match with your input form.
(?<=(>))[A-Za-z]+(?=(<\/A>))
You can try using this regular expression in your specific example:
/">(.*)<\/A><\/td><td>/g
Tested on string:
Lorem ipsum">ABCDE</A></td><td>Lorem ipsum<td></td>Lorem ipsum
extracts:
">ABCDE</A></td><td>
Then it's all a matter of extracting the substring from each match using any programming language. This can be done removing first 2 characters and last 13 characters from the matching string from regex, so that you can extract ABCDE only.
I also tried:
/">([^<]*)<\/A><\/td><td>/g
It has same effect, but it won't include matches that include additional HTML code. As far as I understand it, ([^<]*) is a negating set that won't match < characters in that region, so it won't catch other tag elements inside that region. This could be useful for more fine control over if you're trying to search some specific text and you need to filter nested HTML code.
I am trying to parse a website for
blahblahblah
I DONT CARE ABOUT THIS EITHER
blahblahblah
(there are many of these, and I want all of them in some tokenized form). The problem is that "a href" actually has two spaces, not just one (there are some that are "a href" with one space that I do NOT want to retrieve), so using tree.xpath('//a/#href') doesn't quite work. Does anyone have any suggestions on what to do?
Thanks!
This code works as expected :
from lxml import etree
file = "file:///path/to/file.html" # can be a http URL too
doc = etree.parse(file)
print doc.xpath('//a/#href')[0]
Edit : it's not possible AFAIK to do what you want with lxml.
You can use a regex instead.
Don't know about LXML, but you can definitely use BeautifulSoup, find all <a> on the page, and than create a for loop, where you will check if <a href=...> matches your regex pattern, if it match, than scrap url.
"(there are some that are "a href" with one space that I do NOT want to retrieve)"
I think this means that you only want to locate elements where there is more than one space between the a and the href. XML allows any amount of whitespace between tag name and attribute (spaces, tabs, new lines are all allowed). The whitespace is discarded by the time the text is parsed and the document tree is created. LXML and XPATH are working with Node objects in the Document tree, not the original text that was parsed to make the tree.
One option is to use regular expressions to find the text you want. But really, since this is perfectly valid XML/HTML, why bother to remove a few spaces?
Use an xpath expression to find all the nodes, then iterate through all those nodes looking for a match, you can obtain a string representation of the node with :
etree.tostring(node)
For futher reference : http://lxml.de/tutorial.html#elements-carry-attributes-as-a-dict
So im just experimenting, trying to parse through the web using python and i thought i would try to make a script that would search for my favorite links to watch shows online. Im trying to now have my program search through sidereel.com for a good link to my desired show and return to me the links. I know that the site saves the links in the following format:
watch-freeseries.mu'then some long string that i need to ignore followed by '14792088'
So what i need to be able to do is to find this string in the txt file of the site and return to me only the 8 numbers at the end of the string. I not sure how i can get to the numbers and i need them because they are the link number. Any help would be much appreciated
You could use a regular expression to do this fairly easily.
>>> import re
>>> text = "watch-freeseries.mu=lklsflamflkasfmsaldfasmf14792088"
>>> expr = re.compile("watch\-freeseries\.mu.*?(\d{8})")
>>> expr.findall(text)
['14792088']
A breakdown of the expression:
watch\-freeseries\.mu - Match the start of the expected expression. Escape any possible special characters by preceding them with \.
.*? - Match any character. . means any character and * means that appear one after the other an infinite amount of times. The ? is to perform a non-greedy match so that the match will not overlap if two or more urls show up in the same string.
(\d{8}) - Match and save the last 8 digits
Note: If you're trying to parse links out of a webpage there are easier ways. I've seen many recommendations on StackOverflow for the BeautifulSoup package in particular. I've never used it myself so YMMV.
I want to extract data from such regex:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
I've found related question extract contents of regex
but in my case I shoud iterate somehow.
As paprika mentioned in his/her comment, you need to identify the desired parts of any matched text using ()'s to set off the capture groups. To get the contents from within the td tags, change:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
to:
<td>([a-zA-Z]+)</td><td>([\d]+.[\d]+)</td><td>([\d]+)</td><td>([\d]+.[\d]+)</td>
^^^^^^^^^ ^^^^^^^^^^^ ^^^^^ ^^^^^^^^^^^
group 1 group 2 group 3 group 4
And then access the groups by number. (Just the first line, the line with the '^'s and the one naming the groups are just there to help you see the capture groups as specified by the parentheses.)
dataPattern = re.compile(r"<td>[a-zA-Z]+</td>... etc.")
match = dataPattern.find(htmlstring)
field1 = match.group(1)
field2 = match.group(2)
and so on. But you should know that using re's to crack HTML source is one of the paths toward madness. There are many potential surprises that will lurk in your input HTML, that are perfectly working HTML, but will easily defeat your re:
"<TD>" instead of "<td>"
spaces between tags, or between data and tags
" " spacing characters
Libraries like BeautifulSoup, lxml, or even pyparsing will make for more robust web scrapers.
As the poster clarified, the <td> tags should be removed from the string.
Note that the string you've shown us is just that: a string. Only if used in the context of regular expression functions is it a regular expression (a regexp object can be compiled from it).
You could remove the <td> tags as simply as this (assuming your string is stored in s):
s.replace('<td>','').replace('</td>','')
Watch out for the gotchas however: this is really of limited use in the context of real HTML, just as others pointed out.
Further, you should be aware that whatever regular expression [string] is left, what you can parse with that is probably not what you want, i.e. it's not going to automatically match anything that it matched before without <td> tags!
I am using beautifuly soup to find all href tags.
links = myhtml.findAll('a', href=re.compile('????'))
I need to find all links that have 'abc123' in the href text.
I need help with the regex , see ??? in my code snippet.
If 'abc123' is literally what you want to search for, anywhere in the href, then re.compile('abc123') as suggested by other answers is correct. If the actual string you want to match contains punctuation, e.g. 'abc123.com', then use instead
re.compile(re.escape('abc123.com'))
The re.escape part will "escape" any punctuation so that it's taken literally, just like alphanumerics are; without it, some punctuation gets interpreted in various ways by RE's engine, for example the dot ('.') in the above example would be taken as "any single character whatsoever", so re.compile('abc123.com') would match, e.g. 'abc123zcom' (and many other strings of a similar nature).
"abc123" should give you what you want
if that doesn't work, than BS is probably using re.match in which case you would want ".*abc123.*"
If you want all the links with exactly 'abc123' you can simply put:
links = myhtml.findAll('a', href=re.compile('abc123'))