Kindly have a short look here. https://www.cbp.gov/contact/find-broker-by-port/4901. Trying to scrape the list of all brokers, port wise. My question is directed to the approach that needs to be taken when multiple clicks(forward/back) are needed to arrive at a single/multiple data item(s). Could you point me to some reading material on this or any other solution you deem fit. Many Thanks.
You can use selenium for automating multiple clicks (forward/back) as needed, and also for identifying specific data item.
below you have a very good example.
[1] https://selenium-python.readthedocs.io/getting-started.html
Update: Another approach if the website is static is to use requests with beautifulsoup here is an example https://medium.com/#itylergarrett.tag/learning-web-scraping-with-python-requests-beautifulsoup-936e6445312
Related
I know it's a broad question, but I'm looking for ideas to go about doing this. Not looking for the exact coded answer, but a rough gameplan of how to go about this!
I'm trying to scrape a blog site to check for new blog posts, and if so, to return the URL of that particular blog post.
There are 2 parts to this question, namely
Finding out if the website has been updated
Finding what is the difference (new content)
I'm wondering what are the approaches I could go about doing this. I have been using Selenium for quite a bit, and am aware that with the Selenium driver I could check for 1. with driver.page_source.
Is there a better way to do both 1 and 2 together, and if possible even across various different blog sites (thinking whether it is possible to write more general code applied to various blogposts at once, not a customs script for each post)?
Bonus: Is there a way to do a "diff" on the before and after of the code to see the difference, and extract necessary information from there?
Thanks so much in advance!
If you're looking for a way to know if pages have been added or deleted, you can either look at directly, or build yourself a copy of a sitemap.xml file. If they do not have a sitemap.xml, you can crawl the menu and navigation for the site and build up your own from that. Sitemap files have a 'last modified' entry. If you know the interval you are scraping on, you can calculate rather quickly if the change occurred within the interval. This is good for site-wide changes.
Alternatively, you can also check the site-header to determine the last modified time for the page. Apply the same interval check as the site-map and go from there.
You can always check the last modified value in the web sites header:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified
I am trying to read review data data from alexaskillstore.com website using BeautifulSoup. For this, I am specifying the target url as https://www.alexaskillstore.com/Business-Leadership-Series/B078LNGS5T, where the string after Business-Leadership-Series/ keeps changing for all the different skills.
I want to know how can I input a regular expression or similar code to my input url so that I am able to read every link that starts with https://www.alexaskillstore.com/Business-Leadership-Series/.
You can't. The web is client-server based, so unless the server is kind enough to map the content for you, you have no way to know which URLs will be responsive and which won't.
You may be able to scrape some index page(s) to find the keys (B078LNGS5T and the like) you need. Once you have them all, actually generating the URLs is a simple matter of string substitution.
My question is more of a generic one rather than single code based. I´m scraping some websites, and have been doing so with selenium+Bsoup4 (it´s javascript based).
Up until now I didn´t think of any other way, but browsing through threads I started to get in touch more with the elements suh as
element(by.id("id"));
element(by.css("#id"));
element(by.xpath("//*[#id='id']"))
So my question is, is it really necessary to get plain text and use find_all to locate the info you need when you can do the same with xpath or css? I mean, what is the difference in terms of coding?
And another one, in terms of speed, which way is faster? Or robust for all that matters.
Many thanks in advance
I was just wondering if it is possible to scrape information form this website that contained in a flash file.(http://www.tomtom.com/lib/doc/licensing/coverage/)
I am trying to get the all the text from the different components of this website.
Can anyone suggest a good starting point in python or any simpler method.
I believe the following blog post answers your question well. The author had the same need, to scrape Flash content using Python. And the same problem came up. He realized that he just needed to instantiate a browser (even just an in-memory one that did not even display to the screen) and then scrape its output. I think this could be a successful approach for what you need, and he makes it easy to understand.
http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/
I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T
I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.
You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.
In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.
Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html