I am building a screen clipping app.
So far:
I can get the html mark up of the part of the web page the user has selected including images and videos.
I then send them to a server to process the html with BeautifulSoup to sanitize the html and convert all relative paths if any to absolute paths
Now I need to render the part of the page. But I have no way to render the styling. Is there any library to help me in this matter or any other way in python ?
One way would be to fetch the whole webpage with urllib2 and remove the parts of the body I don't need and then render it.
But there must be a more pythonic way :)
Note: I don't want a screenshot. I am trying to render proper html with styling.
Thanks :)
Download the complete webpage, extract the style elements and the stylesheet link elements and download the files referenced the latter. That should give you the CSS used on the page.
Related
I am trying to scrape the addresses of Dunkin' locations using this website: https://www.dunkindonuts.com/en/locations?location=10001. However, when trying to access the list of each Dunkin' on the web page, it shows up as comment. How do I access the list? I've never done web scraping before.
Here's my current code, I'm expecting a list of Dunkin' stores which I can then extract the addresses from.
requests.get() will return the raw HTML for a web page. This is only the beginning of the journey when you view this page in the browser. Your browser will parse that HTML to create the DOM. It will load other resources, such as images and scripts from other files. Then it will execute those scripts. In the modern web, those scripts will modify the DOM to give the page that you finally see in the browser. requests alone doesn't give you all that.
One solution is to use a library that loads the HTML into a browser and does all of the magic. selenium is one such library.
I am currently scraping a website for work to be able to sort the data locally, however when I do this the code seems to be incomplete, and I feel may be changing whilst I scroll on the website to add more content. Can this happen ? And if so, how can I ensure I am able to scrape the whole website for processing?
I only currently know some python and html for web scraping, looking into what other elements may be affecting this issue (javascript or ReactJS etc).
I am expecting to get a list of 50 names when scraping the website, but it only returns 13. I have downloaded the whole HTML file to go through it and none of the other names seem to exist in the file, i.e. why I think the file may be changing dynamically
Yes, the content of the HTML can be dynamic, and Javascript loading should be the most essential . For Python, scrapy+splash maybe a good choice to get started.
Depending on how the data is handled, you can have different methods to handle dyamic content HTML
i'm working to extract main contents from web page without removing anything like image in python, yet most library just gives me back the text itself or cleaned dom elements.
i need dom elements themselves that contain main contents of article including image.
Is there any library for that purpose?
Thanks
If you mean getting whole dom node with img src="" then i believe beautifulsoup4 can do that.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
but with actual image i dont know you have to make separate request for image.
Or you can use selenium https://pypi.python.org/pypi/selenium, It will use your browser (Firefox, Chrome) so can do anything with extracting web contents
I want to be able to get the list of all URLs that a browser will do a GET request for when we try to open a page. For eg: if we try to open cnn.com, there are multiple URLs within the first HTTP response which the browser recursively requests for.
I'm not trying to render a page but I'm trying to obtain a list of all the urls that are requested when a page is rendered. Doing a simple scan of the http response content wouldn't be sufficient as there could potentially be images in the css which are downloaded. Is there anyway I can do this in python?
It's likely that you'll have to render the page (not necessarily display it though) to be sure you're getting a complete list of all resources. I've used PyQT and QtWebKit in similar situations. Especially when you start counting resources included dynamically with javascript, trying to parse and load pages recursively with BeautifulSoup just isn't going to work.
Ghost.py is an excellent client to get you started with PyQT. Also, check out the QWebView docs and the QNetworkAccessManager docs.
Ghost.py returns a tuple of (page, resources) when opening a page:
from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://my.web.page')
resources includes all of the resources loaded by the original URL as HttpResource objects. You can retrieve the URL for a loaded resource with resource.url.
I guess you will have to create a list of all known file extensions that you do NOT want, and then scan the content of the http response, checking with "if substring not in nono-list:"
The problem is all href's ending with TLDs, forwardslashes, url-delivered variables and so on, so i think it would be easier to check for stuff you know you dont want.
Background:-
Am using a JS editor on my site. Now when I copy paste external text content on it, there are a few invalid/incomplete html tags that get pasted. ( But it is not made visible on the editor)
Problem:-
Now when this data is posted, the alignment of the entire page gets screwed. How can I detect and change incomplete html tags if any. Should I use a html parser for this purpose ?
As you can see the edit and delete buttons have come out of the div. (The description or data has been copy pasted)
You could try the fantastic Beautiful Soup library for all your parsing needs. Buy now!