Python multiple web pages scraping with same starting url string - python

I am trying to read review data data from alexaskillstore.com website using BeautifulSoup. For this, I am specifying the target url as https://www.alexaskillstore.com/Business-Leadership-Series/B078LNGS5T, where the string after Business-Leadership-Series/ keeps changing for all the different skills.
I want to know how can I input a regular expression or similar code to my input url so that I am able to read every link that starts with https://www.alexaskillstore.com/Business-Leadership-Series/.

You can't. The web is client-server based, so unless the server is kind enough to map the content for you, you have no way to know which URLs will be responsive and which won't.
You may be able to scrape some index page(s) to find the keys (B078LNGS5T and the like) you need. Once you have them all, actually generating the URLs is a simple matter of string substitution.

Related

How to find the correct URL when you made some choices on the web page?

I'm very new to learn about web scraping. By using xpath selector i am trying to get the knowledge on that webpage : https://seffaflik.epias.com.tr/transparency/uretim/planlama/kgup.xhtml
But the point is, whenever you change the date or the powerplant name, URL does not change therefore when you fetch the response, you are getting always the same and wrong answer. Is there a way to find the correct URL or anything else related to HTML Markup etc ?
For a scraping operation like this, you'll need to do a bit more than just load the document and then grab the content. The document in-question relies on JavaScript to load new information from some other resource after the user has defined a particular set of parameters and updated the form.
After loading the document, you'll need to define your search parameters. You can do this via JavaScript injection or via your browser's console. For example, if you were trying to define the value for the first date field, you could use
document.querySelectorAll('#j_idt199 input')[1].value = "Some/New/Date";
Repeat this process for the other fields you wish to define in your search, and then run the following code to programmatically execute your search:
document.querySelector('#j_idt199 button').click();
After that, you can either grab the information you want using plain JS query selectors, or you can implement a scraping library like artoo.js to help you interpret the data and export it.

Filling forms on different websites using Selenium and Python

I'm a beginner to Python and trying to start my very first project, which revolves around creating a program to automatically fill in pre-defined values in forms on various websites.
Currently, I'm struggling to find a way to identify web elements using the text shown on the website. For example, website A's email field shows "Email:" while another website might show "Fill in your email". In such cases, finding the element using ID or name would not be possible (unless I write a different set of code for each website) as they vary from website to website.
So, my question is, is it possible to write a code where it will scan all the fields -> check the text -> then fill in the values based on the texts that are associated with each field?
It is possible if you know the markup of the page, and you can write code to parse this page. In this case you should use xpath, lxml, beautiful soup, selenium etc. You can look at many manuals on google or youtube, just type "python scraping"
But if you want to write a program able to understand random page on a random site and understand what it should do, it is very difficult, it's a complex task with using machine learning. I guess this task is completely not for beginners.

The HTML code I scrape seems to be incomplete in comparison to the full website. Could the HTML be changing dynamically?

I am currently scraping a website for work to be able to sort the data locally, however when I do this the code seems to be incomplete, and I feel may be changing whilst I scroll on the website to add more content. Can this happen ? And if so, how can I ensure I am able to scrape the whole website for processing?
I only currently know some python and html for web scraping, looking into what other elements may be affecting this issue (javascript or ReactJS etc).
I am expecting to get a list of 50 names when scraping the website, but it only returns 13. I have downloaded the whole HTML file to go through it and none of the other names seem to exist in the file, i.e. why I think the file may be changing dynamically
Yes, the content of the HTML can be dynamic, and Javascript loading should be the most essential . For Python, scrapy+splash maybe a good choice to get started.
Depending on how the data is handled, you can have different methods to handle dyamic content HTML

Parsing a webpage for indexing

I am trying to understand/optimize the logic for indexing a site. I am new to HTML/JS side of things and so am learning as I go. While indexing a site, I recursively go deeper into the site based on the links on each page. One problem is pages have repeating URLs and text like the header and footer. For the URLs I have a list of URLs I have already processed. Is there something I can do for identifying the text that repeats on each page? I hope my explanation is clear enough. I currently have the code (in python) to get a list of useful URLs for that site. Now I am trying to index the content of these pages. Is there a preferred logic to identify or skip repeating text on these pages (like headers, footers, other blurb). I am using BeautifulSoup + the requests module.
I am not quite sure if this is what you are hoping for, but readability is a popular service that just parses the "useful" content from a page. This is the service that is integrated into safari for ios.
It intelligently gets the worthwhile content of the page while ignorning things like footer/header/ads/etc
There are open source ports for python/ruby/php and probably other languages.
https://github.com/buriy/python-readability

Read all pages within a domain

I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T
I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.
You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.
In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.
Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html

Categories

Resources