Regular expression to extract info from HTML file - python

I would like to use a regular expression to extract the following text from an HTML file: ">ABCDE</A></td><td>
I need to extract: ABCDE
Could anybody please help me with the regular expression that I should use?

Leaning on this, https://stackoverflow.com/a/40908001/11450166
(?<=(<A>))[A-Za-z]+(?=(<\/A>))
With that expression, supposing that your tag is <A> </A>, works fine.
This other match with your input form.
(?<=(>))[A-Za-z]+(?=(<\/A>))

You can try using this regular expression in your specific example:
/">(.*)<\/A><\/td><td>/g
Tested on string:
Lorem ipsum">ABCDE</A></td><td>Lorem ipsum<td></td>Lorem ipsum
extracts:
">ABCDE</A></td><td>
Then it's all a matter of extracting the substring from each match using any programming language. This can be done removing first 2 characters and last 13 characters from the matching string from regex, so that you can extract ABCDE only.
I also tried:
/">([^<]*)<\/A><\/td><td>/g
It has same effect, but it won't include matches that include additional HTML code. As far as I understand it, ([^<]*) is a negating set that won't match < characters in that region, so it won't catch other tag elements inside that region. This could be useful for more fine control over if you're trying to search some specific text and you need to filter nested HTML code.

Related

Regular Expression to find substring in html image src

I am using beautifulsoup to scrape different data in websites.
I am trying to scrape the source, but not all the source, just the substring which is important for me.
For example, in this item, I would like to pick just the string between / and .png (which in this case is "nyt") and to save it in a list.
<image width="185" height="26"
xmlns:xlink="http://www.w3.org/1999/xlink"
xlink:href="https://a1.nyt.com/assets/shell/20160613-034030/images/foundation/logos/nyt-logo-185x26.svg" src="https://a1.nyt.com/assets/shell/20160613-034030/images/foundation/logos/nyt.png" border="0"></image>
I have been trying with several regular expressions like re.search('[a-z]*.png',src).group(0) but nothing works well.
Can anyone tell me what would be the right way to scrape that info??
If you want to find the name of the png inside of the src attribute you can use this regular expression:
src=\s*(\"|\')[^"']+?([^/]+?)\.png\1
You will have to capture the second group in Python in this case.
Click on the pythex link to try it out.
Here is the explanation:
src=\s* literal to find all "src=" literals followed by any number of optional spaces
(\"|\') group with either a double or single quote.
[^"']+? anything that is not a double or single quote (non greedy).
([^/]+?) anything that is not a a forward slash (non greedy).
\.png literal ".png"
\1 back reference to the first group (\"|\')

python xpath remove unicode chars

I have this text in html page
<div class="phone-content">
‪050 2836142‪
</div>
I extract it like this:
I am using xpath to extract the value inside that div live this
normalize-space(.//div[#class='fieldset-content']/span[#class='listing-reply-phone']/div[#class='phone-content']/text())
I got this result:
"\u202a050 2836142\u202a"
anyone knows who to tell the xpath in python to remove that unicode chars?
If you're looking for an XPath solution: to remove all characters but those from a given set, you can use two nested translate(...) calls following this pattern:
translate($string, translate($string, ' 0123456789', ''), '')
This will remove all characters that are not the space character or a digit. You will have to replace both occurrences of $string by the complete XPath expression to fetch that string.
It might be more reasonable though to apply that outside XPath using more advanced string manipulation features. Those of XPath 1.0 are very limited.

Scraping data with Python LXML XPath

I am trying to parse a website for
blahblahblah
I DONT CARE ABOUT THIS EITHER
blahblahblah
(there are many of these, and I want all of them in some tokenized form). The problem is that "a href" actually has two spaces, not just one (there are some that are "a href" with one space that I do NOT want to retrieve), so using tree.xpath('//a/#href') doesn't quite work. Does anyone have any suggestions on what to do?
Thanks!
This code works as expected :
from lxml import etree
file = "file:///path/to/file.html" # can be a http URL too
doc = etree.parse(file)
print doc.xpath('//a/#href')[0]
Edit : it's not possible AFAIK to do what you want with lxml.
You can use a regex instead.
Don't know about LXML, but you can definitely use BeautifulSoup, find all <a> on the page, and than create a for loop, where you will check if <a href=...> matches your regex pattern, if it match, than scrap url.
"(there are some that are "a href" with one space that I do NOT want to retrieve)"
I think this means that you only want to locate elements where there is more than one space between the a and the href. XML allows any amount of whitespace between tag name and attribute (spaces, tabs, new lines are all allowed). The whitespace is discarded by the time the text is parsed and the document tree is created. LXML and XPATH are working with Node objects in the Document tree, not the original text that was parsed to make the tree.
One option is to use regular expressions to find the text you want. But really, since this is perfectly valid XML/HTML, why bother to remove a few spaces?
Use an xpath expression to find all the nodes, then iterate through all those nodes looking for a match, you can obtain a string representation of the node with :
etree.tostring(node)
For futher reference : http://lxml.de/tutorial.html#elements-carry-attributes-as-a-dict

Python Regex 'not' to identify pattern within <a></a>

I am dealing a problem to write a python regex 'not'to identify a certain pattern within href tags.
My aim is to replace all occurrences of DSS[a-z]{2}[0-9]{2} with a href link as shown below,but without replacing the same pattern occurring inside href tags
Present Regex:
replaced = re.sub("[^http://*/s](DSS[a-z]{2}[0-9]{2})", "\\1", input)
I need to add this new regex using an OR operator to the existing one I have
EDIT:
I am trying to use regex just for a simple operation. I want to replace the occurrences of the pattern anywhere in the html using a regex except occurring within<a><\a>.
The answer to any question having regexp and HTML in the same sentence is here.
In Python, the best HTML parser is indeed Beautilf Soup.
If you want to persist with regexp, you can try a negative lookbehind to avoid anything precessed by a ". At your own risk.

Python regex: how to extract inner data from regex

I want to extract data from such regex:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
I've found related question extract contents of regex
but in my case I shoud iterate somehow.
As paprika mentioned in his/her comment, you need to identify the desired parts of any matched text using ()'s to set off the capture groups. To get the contents from within the td tags, change:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
to:
<td>([a-zA-Z]+)</td><td>([\d]+.[\d]+)</td><td>([\d]+)</td><td>([\d]+.[\d]+)</td>
^^^^^^^^^ ^^^^^^^^^^^ ^^^^^ ^^^^^^^^^^^
group 1 group 2 group 3 group 4
And then access the groups by number. (Just the first line, the line with the '^'s and the one naming the groups are just there to help you see the capture groups as specified by the parentheses.)
dataPattern = re.compile(r"<td>[a-zA-Z]+</td>... etc.")
match = dataPattern.find(htmlstring)
field1 = match.group(1)
field2 = match.group(2)
and so on. But you should know that using re's to crack HTML source is one of the paths toward madness. There are many potential surprises that will lurk in your input HTML, that are perfectly working HTML, but will easily defeat your re:
"<TD>" instead of "<td>"
spaces between tags, or between data and tags
" " spacing characters
Libraries like BeautifulSoup, lxml, or even pyparsing will make for more robust web scrapers.
As the poster clarified, the <td> tags should be removed from the string.
Note that the string you've shown us is just that: a string. Only if used in the context of regular expression functions is it a regular expression (a regexp object can be compiled from it).
You could remove the <td> tags as simply as this (assuming your string is stored in s):
s.replace('<td>','').replace('</td>','')
Watch out for the gotchas however: this is really of limited use in the context of real HTML, just as others pointed out.
Further, you should be aware that whatever regular expression [string] is left, what you can parse with that is probably not what you want, i.e. it's not going to automatically match anything that it matched before without <td> tags!

Categories

Resources