How can I parse HTML code with "html written" URL in Python? - python

I am starting to program in Python, and have been reading a couple of posts where they say that I should use an HTML parser to get an URL from a text rather than re.
I have the source code which I got from page.read() with the urllib and urlopen.
Now, my problem is that the parser is removing the url part from the text.
Also, if I had read correctly, var = page.read(), var is stored as a string?
How can I tell it to give me the text between 2 "tags"? The URL is always in between flv= and ; so and as such it doesn't start with href which is what the parsers look for, and it doesn't contain http:// either.
I have read many posts, but it seems they all look for ``href in the code.
Do I have it all completely wrong?
Thank you!

You could consider implementing your own search / grab. In psuedocode, it would look a little like this:
find location of 'flv=' in HTML = location_start
find location of ';' in HTML = location_end
grab everything in between: HTML[location_start : location_end]
You should be able to implement this in python.
Good luck!

Related

How do I get a specific string from a string between two specified pieces of information

I apologize for the confusing title. I looked around and I know how to get a string between two specified characters, but I am unsure on how to get a string between a phrase and character, such as src="the information i want". In this case I want my starting point to be src=", and endpoint to be the first " after the start point. How would I go about specifying these parameters in the get method?
Below is the output of what I am asking for help with. Rather than have to manually copy and paste the second URL, I want to assign that string to a variable to automate the process.
>>> %Run myProject.py
enter URL
https://www.instagram.com/p/CAYGHWFFp-x/
<video class="tWeCl" playsinline="" poster="https://scontent-iad3-1.cdninstagram.com/v/t51.2885-15/e35/100101005_584997515466659_2719890114744519125_n.jpg?_nc_ht=scontent-iad3-1.cdninstagram.com&_nc_cat=111&_nc_ohc=DI3B3wg_vaQAX_MvEcQ&oh=06b611ef41299d4f0278467fb1d74e94&oe=5EC66079"
preload="none" src="https://scontent-iad3-1.cdninstagram.com/v/t50.2886-16/98205256_176119867089312_5443572653160790508_n.mp4?_nc_ht=scontent-iad3-1.cdninstagram.com&_nc_cat=100&_nc_ohc=JtZXc2HiQ9kAX_097NE&oe=5EC68ACC&oh=ac92032cb89fa1dfbcb5f2fa9016c9ba" type="video/mp4"></video>
enter the URL
Thank you so much!
You can use Beautiful Soup to parse this content. Then you can look for video elements, and read their src attribute.
from bs4 import BeautifulSoup
soup = BeautifulSoup(text, 'html.parser')
for video in soup.find_all('video'):
print(video.get('src'))
Output
https://scontent-iad3-1.cdninstagram.com/v/t50.2886-1698205256_176119867089312_5443572653160790508_n.mp4?_nc_ht=scontent-iad3-1.cdninstagram.com&_nc_cat=100&_nc_ohc=JtZXc2HiQ9kAX_097NE&oe=5EC68ACC&oh=ac92032cb89fa1dfbcb5f2fa9016c9ba

Losing data when scraping with Python?

UPDATE(4/10/2018):
So I found that my problem was that the information wasn't available in the source code which means I have to use Selenium.
UPDATE:
I played around with this problem a bit more. What I did was instead or running soup, I just took pageH, decoded it into a string and made a text file out of it, and I found that the '{{ optionTitle }}' or '{{priceFormat (showPrice, session.currency)}}' were from the template section separately stated in the HTML file. Which I THINK means that I was just looking at the wrong place. I am still unsure but that's what I think.
So now I have a new question. After having looked at the text file, I am now realizing that the information necessary is not even in pageH. At the place where it should give me the information I am looking for, it says instead:
<bread-crumbs :location="location" :product-name="product.productName"></bread-crumbs>
<product-info ref="productInfo" :basic="product" :location="location" :prod-info="prodInfo"></product-info>
What does this mean?/Is there a way to get through this to get to the information?
ORIGINAL QUESTION:
I am trying to collect the names/prices for products off of a website. I am unsure if the data is being lost because of the html parser or because of BeautifulSoup but what is happening is that once I do get to the position I want to be in, what is returned instead of the specific name/price is '{{ optionTitle }}' or '{{priceFormat (showPrice, session.currency)}}'. After I get the url using pageH = urllib.request.urlopen(), the code that gives this result is:
pageS = soup(pageH, "html.parser")
pageB = pageS.body
names = pageB.findAll("h4")
optionTitle = names[3].get_text()
optionPrice = names[5].get_text()
Because this didn't work, I tried going about it a different way and looked for more specific tags, but the section of the code that mattered just does not show. It completely disappears. Is there something I can do to get the specific names/prices or is this a security measure that I cannot work through?
The {{}} syntax looks like Angular. Try Requests-HTML to do the rendering (by using render())and get the content afterward. Example shows below:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://python-requests.org/')
r.html.render()
r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'

find some value in javascript, in the response form

I have an url www.example.com/test
so by using robobrowsker to visit this url, I find some js in response and it contains something like this
var token = _.unescape("<input name="__RequestVerificationToken" type="hidden" value="wi5U8xXijdXRrPR4aG84OAjSLsuS1YqTV4X7VLDnWeuwr72D39H-KXBsyG7eZEZPT7YXW7GF26IiQBrW0vcEZd5Bqrjof_CVEUFRTDPS4rx68Opmi6juZXnGDEtb9nsBXxM4Why2WNlflqFM6purXw2" />");
aw.antiforgeryToken[$(token).attr('name')] = $(token).val();
I want to get 'wi5U8xXijdXRrPR4aG84OAjSLsuS1YqTV4X7VLDnWeuwr72D39H-KXBsyG7eZEZPT7YXW7GF26IiQBrW0vcEZd5Bqrjof_CVEUFRTDPS4rx68Opmi6juZXnGDEtb9nsBXxM4Why2WNlflqFM6purXw2'
I tried this
browser=RoboBrowser()
browser.open('https://www.example.com/test')
result=browser.find('script',{'name':'__RequestVerificationToken'})
This gives 'None'
so how can I do this ?
thanks
br.find works on html, and as the stuff you want is inside a JS call so we can't use it.
so other options are
use rejex (wiz. a bit hardcoded in my opinion)
By finding the parent node in which the node which eventually contains the data you want is present, and then find that string i.e. 'wi5U8xXijdXRrPR4aG84OAjSLsuS1YqTV4X7VLDnWeuwr72D39H-KXBsyG7eZEZPT7YXW7GF26IiQBrW0vcEZd5Bqrjof_CVEUFRTDPS4rx68Opmi6juZXnGDEtb9nsBXxM4Why2WNlflqFM6purXw2' via regex
lxml.html (xpath)
it is other way which I MAY prefer is lxml.html or import html from lxml one and the same thing
here is a bit of representation of it.
data = lmxl.html(parsedData)
stuff = data.xpath('XPATH to you data')
you can find more here Can I parse xpath using python, selenium and lxml? and have a look in docs
as well
I hope I was helpful.
cheers.

How to copy all the text from url (like [Ctrl+A][Ctrl+C] with webbrowser) in python?

I know there is the easy way to copy all the source of url, but it's not my task. I need exactly save just all the text (just like webbrowser user copy it) to the *.txt file.
Is it unavoidable to parse source code html for it, or there is a better way?
I think it is impossible if you don't parse at all. I guess you could use HtmlParser http://docs.python.org/2/library/htmlparser.html and just keep the data tags, but you will most likely get many other elements than you want.
To get exactly the same as [Ctrl-C] would be very difficult to avoid parsing because of things like the style="display: hidden;" which would hide the text, which again will result in full parsing of html, javascript and css of both the document and resource files.
Parsing is required. Don't know if there's a library method. A simple regex:
text = sub(r"<[^>]+>", " ", html)
this requires many improvements, but it's a starting point.
With python, the BeautifulSoup module is great for parsing HTML, and well worth a look. To get the text from a webpage, it's just a case of:
#!/usr/env python
#
import urllib2
from bs4 import BeautifulSoup
url = 'http://python.org'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
# you can refine this even further if needed... ie. soup.body.div.get_text()
text = soup.body.get_text()
print text

Python strategy for extracting text from malformed html pages

I'm trying to extract text from arbitrary html pages. Some of the pages (which I have no control over) have malformed html or scripts which make this difficult. Also I'm on a shared hosting environment, so I can install any python lib, but I can't just install anything I want on the server.
pyparsing and html2text.py also did not seem to work for malformed html pages.
Example URL is http://apnews.myway.com/article/20091015/D9BB7CGG1.html
My current implementation is approximately the following:
# Try using BeautifulSoup 3.0.7a
soup = BeautifulSoup.BeautifulSoup(s)
comments = soup.findAll(text=lambda text:isinstance(text,Comment))
[comment.extract() for comment in comments]
c=soup.findAll('script')
for i in c:
i.extract()
body = bsoup.body(text=True)
text = ''.join(body)
# if BeautifulSoup can't handle it,
# alter html by trying to find 1st instance of "<body" and replace everything prior to that, with "<html><head></head>"
# try beautifulsoup again with new html
if beautifulsoup still does not work, then I resort to using a heuristic of looking at the 1st char, last char (to see if they looks like its a code line # < ; and taking a sample of the line and then check if the tokens are english words, or numbers. If to few of the tokens are words or numbers, then I guess that the line is code.
I could use machine learning to inspect each line, but that seems a little expensive and I would probably have to train it (since I don't know that much about unsupervised learning machines), and of course write it as well.
Any advice, tools, strategies would be most welcome. Also I realize that the latter part of that is rather messy since if I get a line that is determine to contain code, I currently throw away the entire line, even if there is some small amount of actual English text in the line.
Try not to laugh, but:
class TextFormatter:
def __init__(self,lynx='/usr/bin/lynx'):
self.lynx = lynx
def html2text(self, unicode_html_source):
"Expects unicode; returns unicode"
return Popen([self.lynx,
'-assume-charset=UTF-8',
'-display-charset=UTF-8',
'-dump',
'-stdin'],
stdin=PIPE,
stdout=PIPE).communicate(input=unicode_html_source.encode('utf-8'))[0].decode('utf-8')
I hope you've got lynx!
Well, it depends how good the solution has to be. I had a similar problem, importing hundreds of old html pages into a new website. I basically did
# remove all that crap around the body and let BS fix the tags
newhtml = "<html><body>%s</body></html>" % (
u''.join( unicode( tag ) for tag in BeautifulSoup( oldhtml ).body.contents ))
# use html2text to turn it into text
text = html2text( newhtml )
and it worked out, but of course the documents could be so bad that even BS can't salvage much.
BeautifulSoup will do bad with malformed HTML. What about some regex-fu?
>>> import re
>>>
>>> html = """<p>This is paragraph with a bunch of lines
... from a news story.</p>"""
>>>
>>> pattern = re.compile('(?<=p>).+(?=</p)', re.DOTALL)
>>> pattern.search(html).group()
'This is paragraph with a bunch of lines\nfrom a news story.'
You can then assembly a list of valid tags from which you want to extract information.

Categories

Resources