Can't Scrape Dynamically Loaded HTML Table in an Aspx Website

Can't Scrape Dynamically Loaded HTML Table in an Aspx Website - python

I am trying to scrape some data from the Arizona Medical Board. I search for Anesthesiology in the specialty dropdown list and I find that the table (with the links to the profiles I want to scrape) are dynamically loaded into the website. I notice when hitting the 'specialty search' button, a POST request is made to the server and the html table is actually returned from the server. I have tried simulating this post request to see if I get receive this html table and then try to parse it with bs4. Is this possible, and if so, am I even on the right track?
I have tried to included the form data I found in the network tab of the developer tools but I am not sure if this is the right data, or if I am forgetting some data here or in the header.
Please let me know if I need to clarify, I understand this may not be worded the best. Thank you!
import requests
# import re
import formdata
session = requests.Session()
url = "https://azbomprod.azmd.gov/GLSuiteWeb/Clients/AZBOM/public/WebVerificationSearch.aspx?q=azmd&t=20220622123512"
headers = {'User-Agent': 'My-Agent-Placeholder'}
res = session.get(url, headers=headers)
print("Response: {}".format(res))
payload = {
"__VIEWSTATE": formdata.state,
"__VIEWSTATEGENERATOR": formdata.generator,
"__EVENTVALIDATION" : formdata.validation,
"ctl00%24ContentPlaceHolder1%24Name": 'rbName1',
"ctl00%24ContentPlaceHolder1%24Name": "rbName1",
"ctl00%24ContentPlaceHolder1%24txtLastName" : '',
"ctl00%24ContentPlaceHolder1%24txtFirstName" : '',
"ctl00%24ContentPlaceHolder1%24License": "rbLicense1",
"ctl00%24ContentPlaceHolder1%24txtLicNum": '',
"ctl00%24ContentPlaceHolder1%24Specialty": "rbSpecialty1",
"ctl00%24ContentPlaceHolder1%24ddlSpecialty": '12155',
"ctl00%24ContentPlaceHolder1%24ddlCounty": '15910',
"ctl00%24ContentPlaceHolder1%24txtCity": '',
"__EVENTTARGET": "ctl00%24ContentPlaceHolder1%24btnSpecial",
"__EVENTARGUMENT": ''
}
# params = {"q": "azmd",
# "t": "20220622123512"}
# #url = "https://azbomprod.azmd.gov/GLSuiteWeb/Clients/AZBOM/Public/Results.aspx"
res = session.post(url, data=payload, headers=headers)
print("Post response: {}".format(res))
print(res.text)
# res = requests.get('https://azbomprod.azmd.gov/GLSuiteWeb/Clients/AZBOM/Public/Results.aspx', headers=headers)

Try:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0"
}
url = "https://azbomprod.azmd.gov/GLSuiteWeb/Clients/AZBOM/public/WebVerificationSearch.aspx?q=azmd&t=20220622082816"
with requests.session() as s:
soup = BeautifulSoup(s.get(url, headers=headers).content, "html.parser")
data = {}
for inp in soup.select("input"):
data[inp.get("name")] = inp.get("value", "")
data["ctl00$ContentPlaceHolder1$Name"] = "rbName1"
data["ctl00$ContentPlaceHolder1$License"] = "rbLicense1"
data["ctl00$ContentPlaceHolder1$Specialty"] = "rbSpecialty1"
data["ctl00$ContentPlaceHolder1$ddlSpecialty"] = "12155"
data["ctl00$ContentPlaceHolder1$ddlCounty"] = "15910"
data["__EVENTTARGET"] = "ctl00$ContentPlaceHolder1$btnSpecial"
data["__EVENTARGUMENT"] = ""
soup = BeautifulSoup(
s.post(url, data=data, headers=headers).content, "html.parser"
)
for row in soup.select("tr:has(a)"):
name = row.select("td")[-1].text
link = row.a["href"]
print("{:<35} {}".format(name, link))
Prints:
Abad-Pelsang, Elma A. https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1620623&licID=121089&licType=1
Abadi, Bilal Ibrahim https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1755530&licID=525771&licType=1
Abbasian, Mohammad https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1635449&licID=492537&licType=1
Abdel-Al, Naglaa Z. https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1637612&licID=175204&licType=1
Abedi, Babak https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1641219&licID=169009&licType=1
Abel, Martin D. https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1624271&licID=510929&licType=1
Abenstein, John P. https://azbomprod.azmd.gov/glsuiteweb/clients/azbom/Public/Profile.aspx?entID=1622930&licID=502482&licType=1
...and so on.

Related

Why doesn't the post request return back any data

I want to get my results from my college's website with python, I typed this script:
import requests
import time
from bs4 import BeautifulSoup
# Make a request to the website
url = 'http://app1.helwan.edu.eg/Commerce/HasasnUpMlist.asp'
response = requests.get(url)
# Parse the response and create a BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')
# Find the input field we need to fill with our ID
input_field = soup.find('input', {'name': 'x_st_settingno', 'id': 'x_st_settingno'})
input_field['value'] = 8936 # Fill in our ID
# Find the submit button and click it
submit_button = soup.find('input', {'name': 'Submit', 'id': 'Submit'})
data = {input_field['name']: input_field['value'], submit_button['name']: submit_button['value']}
response2 = requests.post(url, data=data)
# Parse the response and create a BeautifulSoup object
soup2 = BeautifulSoup(response2.text, 'html.parser')
print(`soup.find('form:nth-of-type(2) table tbody tr:first-of-type td b font')`)
But it always returns None. I do not know why?
The print(soup.find('form:nth-of-type(2) table tbody tr:first-of-type td b font')) part is just the head of the table that contains the link to my results, If I looked for the link it returns None as well.
What am I doing wrong, I am not good with web scraping I just started learning, I hope you can help me guys.

import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0'
}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
params = {
"Submit": "%C8%CD%CB",
"x_dep": "",
"x_gro": "",
"x_sec": "",
"x_st_name": "",
"x_st_settingno": "8936",
"z_dep": "=",
"z_gro": "=",
"z_sec": "LIKE",
"z_st_name": "LIKE",
"z_st_settingno": "="
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.content, 'lxml')
res = urljoin(url, soup.select_one('.ewTableRow span.aspmaker a')[
'href'])
r = req.get(res)
df = pd.read_html(r.content)
print(df)
main('http://app1.helwan.edu.eg/Commerce/HasasnUpMlist.asp')

Python - Scrapping Woocommerce does not bring text from price

i am working in a price update control between the web from my work and the Tango database (our management/administration system).
Because of that, i have to scrap prices from our web site iwth Python. But
i am having troubles while scraping woocommerce price text. I tried to scrape with requests html and with BeautifulSoup libraries but both brings (direct from source) the "bdi" price text as $0.00:
For example: https://hierroscasanova.com.ar/producto/cano-estructural-redondo/?attribute_pa_medida-1=3&attribute_pa_espesor=2-85&attribute_pa_unidad=kg
Script de requests_html:
from requests_html import HTMLSession
import csv
import time
link = 'https://hierroscasanova.com.ar/producto/cano-estructural-redondo/?attribute_pa_medida-1=3&attribute_pa_espesor=2-85&attribute_pa_unidad=kg'
s = HTMLSession()
r = s.get(link)
#print(r.text)
title = r.html.find('h1', first=True).full_text
price = r.html.find('span.woocommerce-Price-amount.amount bdi')[0].full_text
print(price)
price = r.html.find('span.woocommerce-Price-amount.amount bdi')[1].full_text
print(price)
Result:
$0.00
$0.00
Script de BeautifulSoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://hierroscasanova.com.ar/producto/cano-estructural-redondo/?attribute_pa_medida-1=3&attribute_pa_espesor=2-85&attribute_pa_unidad=kg")
soup = BeautifulSoup(page.text, "html.parser")
print(soup)
Result:
<span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">$</span>0.00</bdi>
PS: i noticed that when the full web site is download it brings all the data and prices (not $0.00), so i do not know why are the libraries failling.
<div class="woocommerce-variation-price"><span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">$</span>325.54</bdi></span> <small class="woocommerce-price-suffix">( IVA incluido )</small></span></div>
Thanks you very much!

You can do it with Selenium. But i show you how to do it with json and bs4.
First we need product id:
def get_id(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, features='lxml')
data_product_id = soup.find('form', class_='variations_form').get('data-product_id')
return data_product_id
Then with this ID, we can get price:
def get_price(product_id, payload):
url = "https://hierroscasanova.com.ar/?wc-ajax=get_variation"
payload = f"{payload}&product_id={product_id}"
headers = {
'accept': '*/*',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'
}
response = requests.request("POST", url, headers=headers, data=payload)
json_data = json.loads(response.text)
return json_data['display_price']
Now remains to prepare the parameters for the link, and we can check:
attribute_pa_medida = '1=3'
attribute_pa_espesor = '2-85'
attribute_pa_unidad = 'kg'
attributes = f'attribute_pa_medida-{attribute_pa_medida}&attribute_pa_espesor={attribute_pa_espesor}&attribute_pa_unidad={attribute_pa_unidad}'
url = f'https://hierroscasanova.com.ar/producto/cano-estructural-redondo/?{attributes}'
print(get_price(get_id(url), attributes))
UPD full code:
import requests
import json
from bs4 import BeautifulSoup
def get_id(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, features='lxml')
data_product_id = soup.find('form', class_='variations_form').get('data-product_id')
return data_product_id
def get_price(product_id, payload):
url = "https://hierroscasanova.com.ar/?wc-ajax=get_variation"
payload = f"{payload}&product_id={product_id}"
headers = {
'accept': '*/*',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'
}
response = requests.request("POST", url, headers=headers, data=payload)
json_data = json.loads(response.text)
return json_data['display_price']
attribute_pa_medida = '1=3'
attribute_pa_espesor = '2-85'
attribute_pa_unidad = 'kg'
attributes = f'attribute_pa_medida-{attribute_pa_medida}&attribute_pa_espesor={attribute_pa_espesor}&attribute_pa_unidad={attribute_pa_unidad}'
url = f'https://hierroscasanova.com.ar/producto/cano-estructural-redondo/?{attributes}'
print(get_price(get_id(url), attributes))

How to scrape website tables with varying values depending on the selections

I am trying to scrape:
https://id.investing.com/commodities/gold-historical-data
table from 2010-2020, but the problem is the link between the default date and the date that I chose is still the same. So how can I tell python to scrape data from 2010-2020? please help me I'm using python 3.
This is my code:
import requests, bs4
url = 'https://id.investing.com/commodities/gold-historical-data'
headers = {"User-Agent":"Mozilla/5.0"}
response = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(response.text, 'lxml')
tables = soup.find_all('table')
print(soup)
with open('emasfile.csv','w') as csv:
for row in tables[1].find_all('tr'):
line = ""
for td in row.find_all(['td', 'th']):
line += '"' + td.text + '",'
csv.write(line + '\n')

This page uses JavaScript with AJAX to get data from
https://id.investing.com/instruments/HistoricalDataAjax
It sends POST requests with extra data - start date and end date ("st_date", "end_date")
You can try to use 01/01/2010, 12/31/2020 but I used for-loop to get every year separatelly.
I get all information from DevTool (tab 'Network') in Chrome/Firefox.
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://id.investing.com/instruments/HistoricalDataAjax'
payload = {
"curr_id": "8830",
"smlID": "300004",
"header": "Data+Historis+Emas+Berjangka",
"st_date": "01/30/2020",
"end_date": "12/31/2020",
"interval_sec": "Daily",
"sort_col": "date",
"sort_ord": "DESC",
"action":"historical_data"
}
headers = {
#"Referer": "https://id.investing.com/commodities/gold-historical-data",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0",
"X-Requested-With": "XMLHttpRequest"
}
fh = open('output.csv', 'w')
csv_writer = csv.writer(fh)
for year in range(2010, 2021):
print('year:', year)
payload["st_date"] = f"01/01/{year}"
payload["end_date"] = f"12/31/{year}"
r = requests.post(url, data=payload, headers=headers)
#print(r.text)
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find('table')
for row in table.find_all('tr')[1:]: # [1:] to skip header
row_data = [item.text for item in row.find_all('td')]
print(row_data)
csv_writer.writerow(row_data)
fh.close()

How do I search within a website using the 'requests' module?

I want to search for different company names on the website. Website link: https://www.firmenwissen.de/index.html
On this website, I want to use the search engine and search companies. Here is the code I am trying to use:
from bs4 import BeautifulSoup as BS
import requests
import re
companylist = ['ABEX Dachdecker Handwerks-GmbH']
url = 'https://www.firmenwissen.de/index.html'
payloads = {
'searchform': 'UFT-8',
'phrase':'ABEX Dachdecker Handwerks-GmbH',
"mainSearchField__button":'submit'
}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
html = requests.post(url, data=payloads, headers=headers)
soup = BS(html.content, 'html.parser')
link_list= []
links = soup.findAll('a')
for li in links:
link_list.append(li.get('href'))
print(link_list)
This code should bring me the next page with company information. But unfortunately, it returns only the home page. How can I do this?

Change your initial url you are doing search for. Grab the appropriate hrefs only and add to a set to ensure no duplicates (or alter selector to return only one match if possible); add those items to a final set for looping to ensure only looping required number of links. I have used Session on assumption you will repeat for many companies.
Iterate over the set using selenium to navigate to each company url and extract whatever info you need.
This is an outline.
from bs4 import BeautifulSoup as BS
import requests
from selenium import webdriver
d = webdriver.Chrome()
companyList = ['ABEX Dachdecker Handwerks-GmbH','SUCHMEISTEREI GmbH']
url = 'https://www.firmenwissen.de/ergebnis.html'
baseUrl = 'https://www.firmenwissen.de'
headers = {'User-Agent': 'Mozilla/5.0'}
finalLinks = set()
## searches section; gather into set
with requests.Session() as s:
for company in companyList:
payloads = {
'searchform': 'UFT-8',
'phrase':company,
"mainSearchField__button":'submit'
}
html = s.post(url, data=payloads, headers=headers)
soup = BS(html.content, 'lxml')
companyLinks = {baseUrl + item['href'] for item in soup.select("[href*='firmeneintrag/']")}
# print(soup.select_one('.fp-result').text)
finalLinks = finalLinks.union(companyLinks)
for item in finalLinks:
d.get(item)
info = d.find_element_by_css_selector('.yp_abstract_narrow')
address = d.find_element_by_css_selector('.yp_address')
print(info.text, address.text)
d.quit()
Just the first links:
from bs4 import BeautifulSoup as BS
import requests
from selenium import webdriver
d = webdriver.Chrome()
companyList = ['ABEX Dachdecker Handwerks-GmbH','SUCHMEISTEREI GmbH', 'aktive Stuttgarter']
url = 'https://www.firmenwissen.de/ergebnis.html'
baseUrl = 'https://www.firmenwissen.de'
headers = {'User-Agent': 'Mozilla/5.0'}
finalLinks = []
## searches section; add to list
with requests.Session() as s:
for company in companyList:
payloads = {
'searchform': 'UFT-8',
'phrase':company,
"mainSearchField__button":'submit'
}
html = s.post(url, data=payloads, headers=headers)
soup = BS(html.content, 'lxml')
companyLink = baseUrl + soup.select_one("[href*='firmeneintrag/']")['href']
finalLinks.append(companyLink)
for item in set(finalLinks):
d.get(item)
info = d.find_element_by_css_selector('.yp_abstract_narrow')
address = d.find_element_by_css_selector('.yp_address')
print(info.text, address.text)
d.quit()

How to crawl ASP.NET web applications

I'm writing an application which needs to communicate with a ASP.NET website that doesn't have an API.
I'm using Python 3.5.2 with Requests 2.18.4 to achieve my purpose.
The problem is the site is using _dopostback() so I achieved my goal using Selenium but it was slow and it opened a browser for every request that I wanted to make.
First I analyzed the POST data sent to the website for login using BurpSuite. The contained some hidden fields from the webpage HTML and the username, password, captcha and a radio button.
A successful login attempt will return 302 HTTP response code and redirect the browser to the profile page.
I extract the hidden fields and get the captcha image and enter it manually myself and POST the data to the website but it returns 200 response telling an error occurred.
This is the Code:
import re
import requests
from urllib.request import urlretrieve
from collections import OrderedDict
from bs4 import BeautifulSoup as BS
from PIL import Image
def get_and_show_captcha(session):
soup = BS(current_response.text, "html.parser")
image_url = soup.find("img", {"id": 'RadCaptcha1_CaptchaImageUP'})["src"]
Image.open(urlretrieve(base_url + image_url)[0]).show()
base_url = 'http://example.com/'
section_login = 'login.aspx'
proxies = {
'http': 'http://127.0.0.1:8080',
}
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0',
'DNT': '1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Upgrade-Insecure-Requests': '1',
}
session = requests.Session()
session.headers.update(headers)
r = session.get('http://example.com/login.aspx', proxies=proxies)
login_post_dict = OrderedDict()
login_post_dict['__EVENTTARGET'] = ''
login_post_dict['__EVENTARGUMENT'] = ''
login_post_dict['__VIEWSTATE'] = ''
login_post_dict['__VIEWSTATEGENERATOR'] = ''
login_post_dict['__PREVIOUSPAGE'] = ''
login_post_dict['__EVENTVALIDATION'] = ''
login_post_dict['txtUserName'] = 'USERNAME'
login_post_dict['txtPassword'] = 'PASSWORD'
get_and_show_captcha(session=session)
capthca = input('CAPTCHA: ')
login_post_dict['rcTextBox1'] = capthca
login_post_dict['RadCaptcha1_ClientState'] = ''
login_post_dict['SOMERADIO'] = 'RADIOBUTTON'
login_post_dict['btn_enter'] = 'ENTER'
login_post_dict['txtUserName2'] = ''
login_post_dict['txt_email'] = ''
soup = BS(r.text, 'html.parser')
capture_id = re.compile("id=\"(\w*)\"")
capture_value = re.compile("value=\"(.*)\"")
for html_input in soup.find_all('input', attrs={'type': 'hidden'}):
for hidden_id in capture_id.findall(str(html_input)):
if hidden_id != r"RadCaptcha1_ClientState":
login_post_dict[hidden_id] = ''
for hidden_id_value in capture_value.findall(str(html_input)):
if hidden_id_value is '':
continue
login_post_dict[hidden_id] = hidden_id_value
session.headers.update({'Referer': base_url + section_login})
print(login_post_dict)
session.post(base_url + section_login, data=login_post_dict, proxies=proxies)
I send the script data through BurpSuit in order to see whats being sent exactly.
Any solution?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't Scrape Dynamically Loaded HTML Table in an Aspx Website - python

Related

Why doesn't the post request return back any data

Python - Scrapping Woocommerce does not bring text from price

How to scrape website tables with varying values depending on the selections

How do I search within a website using the 'requests' module?

How to crawl ASP.NET web applications

Categories

Resources