Web Scraping in R / Python - python

I need to extract data from https://eservices.dha.gov.ae/DHASearch/UIPages/ProfessionalSearch.aspx?PageLang=En. I need 4 columns -"name","gender", "Titles" ,"Hospital Name", "Contact details". The "Titles" info will be shown when you click on a name. Another problem I am facing is to extract info from multiple pages. In total, there are 10071 records. I need info of all these records. Currently I am using rvest package in R but it's throwing error. See the code below -
library(rvest)
session = html_session("https://eservices.dha.gov.ae/DHASearch/UIPages/ProfessionalSearch.aspx")
form = html_form(session)[[1]]
Error : Subscript out of bounds
I am open to solution in Python. I am novice in using beautifulsoup in Python. Any help would be highly appreciated!

If you have the right to scrape all this personal information then the best way to go about it would be to use selenium in python and a web driver to navigate the pages by calling the js function call used for each paginated page and pull the page source for each of them. This is probably your best bet seen as the data is loaded using Javascript calls.

Related

I'm trying to scrap Stocks Data , Get requests method is working but is not giving data

First of all,
Thank you for your help.
I'm trying to create a stock data scraper based on the Frankfurt website.
I'm trying to get the historical prices section of the website.
I inspected the elements with chrome and I found an API in the network section.
Here is the response and preview on the network section.
I tried to use the code below :
When I run my script :
The code response seems good, but there is nothing in return.
I also tried to use BeautifulSoup to get my ways but I can't find where is located the data.
Here is what I tried :
Thank you for your time!
The reason why that the resquest return nothing is because that the data you want to scrape are rendered by javascript.
So first check out if the web data are rendered by javascript, if it is, try to use selenium or puppeteer to get those data.

Scraping data from a webpage based on VIEWSTATES

I'm attempting to scrape the details of all documents on this link.
The problem I'm facing is that the site is created using ASP.NET and the Viewstates aren't me to access the data directly, and I tried a mixture of beautifulSoup, Scrapy and Selenium, but to no avail. The data consists of 12782 documents whose pdf download link I need to extract from the page that redirects from each entry of the returned results on the aforementioned page.
The site also has an API here, but the catch here is that it only returns 2000 data points at any given point of time, so the ~12k data points is out of question.
Can someone help me with ANY ONE of the following:
Create a scraper to get the pdf links
Generate a query to get all the data from the API
Any recurrence relation that helps me generate links to get the queries for the API
Using the requests section in the API to get all the records at the same time delivered to your email
Ideally, a solution in python would be great, but if you can help me get a csv file of all the links, that would also work. Thanks in advance!
I ended up solving the problem by using the request functionality which was located here.
It took in a particular query and my email address and sent me the entire data dump I needed. From that data dump, I could use all the pdf links.

Url request does not parse every information in HTML using Python

I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!
There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.

Suitable Python modules for navigating a website

I am looking for a python module that will let me navigate searchbars, links etc of a website.
For context I am looking to do a little webscraping of this website [https://www.realclearpolitics.com/]
I simply want to take information on each state (polling data etc) in relation to the 2020 election and organize it all in a collection of a database.
Obviously there are a lot of states to go through and each is on a seperate webpage. So im looking for a method in python in which i could quickly navigate the site and take the data of each page etc aswell as update and add to existing data. So finding a method of quickly navigating links and search bars with my inputted data would be very helpful.
Any suggestions would be greatly appreciated.
# a simple list that contains the names of each state
states = ["Alabama", "Alaska" ,"Arizona", "....."]
for state in states:
#code to look up the state in the searchbar of website
#figures being taken from website etc
break
Here is the rough idea i have
There are many options to accomplish this with Python. As #LD mentioned, you can use Selenium. Selenium is a good option if you need to interact with a websites UI via a headless browser. E.g clicking a button, entering text into a search bar, etc. If your needs aren't that complex, for instance if you just need to quickly scrape all the raw content from a web page and process it, than you should use the requests module from Python's standard library.
For processing raw content from a crawl, I would recommend beautiful soup.
Hope that helps!

How to scrape dynamic content from a website?

So I'm using scrapy to scrape a data from Amazon books section. But somehow I got to know that it has some dynamic data. I want to know how dynamic data can be extracted from the website. Here's something I've tried so far:
import scrapy
from ..items import AmazonsItem
class AmazonSpiderSpider(scrapy.Spider):
name = 'amazon_spider'
start_urls = ['https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6']
def parse(self, response):
items = AmazonsItem()
products_name = response.css('.s-access-title::attr("data-attribute")').extract()
for product_name in products_name:
print(product_name)
next_page = response.css('li.a-last a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Now I was using SelectorGadget to select a class which I have to scrape but in case of a dynamic website, it doesn't work.
So how do I scrape a website which has dynamic content?
what exactly is the difference between dynamic and static content?
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
how would I know that data is dynamically created?
So how do I scrape a website which has dynamic content?
there are a few options:
Use Selenium, which allows you to simulate opening a browser, letting the page render, then pull the html source code
Sometimes you can look at the XHR and see if you can fetch the data directly (like from an API)
Sometimes the data is within the <script> tags of the html source. You could search through those and use json.loads() once you manipulate the text into a json format
what exactly is the difference between dynamic and static content?
Dynamic means the data is generated from a request after the initial page request. Static means all the data is there at the original call to the site
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
Refer to your first question
how would I know that data is dynamically created?
You'll know it's dynamically created if you see it in the dev tools page source, but not in the html page source you first request. You can also see if the data is generated by additional requests in the dev tool and looking at Network -> XHR
Lastly
Amazon does offer an API to access the data. Try looking into that as well
If you want to load dynamic content, you will need to simulate a web browser. When you make an HTTP request, you will only get the text returned by that request, and nothing more. To simulate a web browser, and interact with data on the browser, use the selenium package for Python:
https://selenium-python.readthedocs.io/
So how do I scrape a website which has dynamic content?
Websites that have dynamic content have their own APIs from where they are pulling data. That data is not even fixed it will be different if you will check it after some time. But, it does not mean that you can't scrape a dynamic website. You can use automated testing frameworks like Selenium or Puppeteer.
what exactly is the difference between dynamic and static content?
As I have explained this in your first question, the static data is fixed and will remain the same forever but the dynamic data will be periodically updated or changes asynchronously.
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
for that, you can use libraries like BeautifulSoup in python and cheerio in Nodejs. Their docs are quite easy to understand and I will highly recommend you to read them thoroughly.
You can also follow this tutorial
how would I know that data is dynamically created?
While reloading the page open the network tab in chrome dev tools. You will see a lot of APIs are working behind to provide the relevant data according to the page you are trying to access. In that case, the website is dynamic.
So how do I scrape a website which has dynamic content?
To scrape the dynamic content from websites, we are required to let the web page load completely, so that the data can be injected into the page.
What exactly is the difference between dynamic and static content?
Content in static websites is fixed content that is not processed on the server and is directly returned by using prebuild source code files.
Dynamic websites load the contents by processing them on the server side in runtime. These sites can have different data every time you load the page, or when the data is updated.
How would I know that data is dynamically created?
You can open the Dev Tools and open the Networks tab. Over there once you refresh the page, you can look out for the XHR requests or requests to the APIs. If some requests like those exist, then the site is dynamic, else it is static.
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
To extract the dynamic content from the websites we can use Selenium (python - one of the best options) :
Selenium - an automated browser simulation framework
You can load the page, and use the CSS selector to match the data on the page. Following is an example of how you can use it.
import time
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6")
time.sleep(4)
titles = driver.find_elements_by_css_selector(
".a-size-medium.a-color-base.a-text-normal")
print(titles[0].text)
In case you don't want to use Python, there are other open-source options like Puppeteer and Playwright, as well as complete scraping platforms such as Bright Data that have built-in capabilities to extract dynamic content automatically.

Categories

Resources