Web scraping with urllib.request - data is not refreshing - python

I am trying to read a table on a website. The first (initial) read is correct, but the subsequent requests in the loop are out of date (the information doesn't change even though the website changes). Any suggestions?
The link shown in the code is not the actual website that I am looking at. Also, I am going through proxy server.
I don't get an error, just out of date information.
Here is my code:
import time
import urllib.request
from pprint import pprint
from html_table_parser.parser import HTMLTableParser
import pandas as pd
def url_get_contents(url):
#making request to the website
req = urllib.request.Request(url=url)
f = urllib.request.urlopen(req)
return f.read()
link='https://www.w3schools.com/html/html_tables.asp'
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
stored_page=p.tables[0]
while True:
try:
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
print('now: ',p.tables[0] )
time.sleep(120)
continue
# To handle exceptions
except Exception as e:
print("error")

Related

How to scrape multi page website with python? [duplicate]

This question already has answers here:
Web scraping with Python [closed]
(10 answers)
Closed last month.
I need to scrape the following table: https://haexpeditions.com/advice/list-of-mount-everest-climbers/
How to do it with python?
The site uses this API to fetch the table data, so you could request it from there.
(I used cloudscraper because it's easier than trying to figure out how to set the right set of requests headers to avoid getting a 406 error response; and using the try..except...print approach (instead of just doing tableData = [dict(...) for row in api_req.json()] directly) helps understand what went wrong in case of error [without actually raising any errors that might break the program execution.])
# import cloudscraper
api_url = 'https://haexpeditions.com/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&table_id=1084&target_action=get-all-data&default_sorting=old_first&ninja_table_public_nonce=2491a56a39&chunk_number=0'
api_req = cloudscraper.create_scraper().get(api_url)
try: jData, jdMsg = api_req.json(), f'- {len(api_req.json())} rows from'
except Exception as e: jData, jdMsg = [], f'failed to get data - {e} \nfrom'
print(api_req.status_code, api_req.reason, jdMsg, api_req.url)
tableData = [dict([(k, v) for k, v in row['value'].items()] + [
(f'{k}_options', v) for k, v in row['options'].items()
]) for row in jData]
At this point tableData is a list of dictionaries but you can build a DataFrame from it with pandas and save it to a CSV file with .to_csv.
# import pandas
pandas.DataFrame(tableData).set_index('number').to_csv('list_of_mount_everest_climbers.csv')
The API URL can be either copied from the browser network logs or extracted from the script tag containing it in the source HTML of the page.
The shorter way would be to just split the HTML string:
# import cloudscraper
pg_url = 'https://haexpeditions.com/advice/list-of-mount-everest-climbers/'
pg_req = cloudscraper.create_scraper().get(pg_url)
api_url = pg_req.text.split('"data_request_url":"', 1)[-1].split('"')[0]
api_url = api_url.replace('\\', '')
print(pg_req.status_code,pg_req.reason,'from',pg_req.url,'\napi_url:',api_url)
However, it's a little risky in case "data_request_url":" appears in any other context in the HTML aside from the one that we want. So, another way would be to parse with bs4 and json.
# import cloudscraper
# from bs4 import BeautifulSoup
# import json
pg_url = 'https://haexpeditions.com/advice/list-of-mount-everest-climbers/'
sel = 'div.footer.footer-inverse>div.bottom-bar+script[type="text/javascript"]'
api_url = 'https://haexpeditions.com/wp-admin/admin-ajax.php...' ## will be updated
pg_req = cloudscraper.create_scraper().get(pg_url)
jScript = BeautifulSoup(pg_req.content).select_one(sel)
try:
sjData = json.loads(jScript.get_text().split('=',1)[-1].strip())
api_url = sjData['init_config']['data_request_url']
auMsg = f'\napi_url: {api_url}'
except Exception as e: auMsg = f'failed to extract API URL - {type(e)} {e}'
print(pg_req.status_code,pg_req.reason,'from',pg_req.url,'\napi_url:',auMsg)
(I would consider the second method more reliable even though it's a bit longer.)

How to get a link on a website using Python that updates dynamically?

I am trying to download the most recent zip file from the ERCOT Website (https://www.ercot.com/mp/data-products/compliance-and-disclosure/?id=NP3-965-ER). However, the link of the zip file has a doclookup id that changes everytime. The id is also populated dynamically. I have tried using beautifulsoup to get the link, but since it's being loaded dynamically it is not providing any links. Any feedback or solutions will be appreciated. enter image description here
Using the exposed api:
import json
import pandas as pd
import pendulum
import requests
def get_document_id(type_id: int) -> int:
url = (
"https://www.ercot.com/misapp/servlets/IceDocListJsonWS?"
f"reportTypeId={type_id}&"
f"_={pendulum.now().format('X')}"
)
with requests.Session() as request:
response = request.get(url, timeout=10)
if response.status_code != 200:
print(response.raise_for_status())
data = json.loads(response.text)
return pd.json_normalize(data=data["ListDocsByRptTypeRes"], record_path="DocumentList").head(1)["Document.DocID"].squeeze()
id_number = get_document_id(13052)
print(id_number)
869234127

How to print selected text from JSON file using Python

I'm new to python and have undertaken my first project to automate something for my role (I'm in the network space, so forgive me if this is terrible!).
I'm required to to download a .json file from the below link:
https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519
My script goes through and retrieves the manual download link.
The reason I'm getting the URL in this way, is that the download link changes every fortnight when MS update the file.
My preference is to extract the "addressPrefixes" contents from the names of "AzureCloud.australiacentral", "AzureCloud.australiacentral2", "AzureCloud.australiaeast", "AzureCloud.australiasoutheast".
I'm then wanting to strip out characters of " & ','.
Each of the subnet ranges should then reside on a new line and be placed in a text file.
If I perform the below, I'm able to get the output that I am wanting.
Am I correct in thinking that I can use a for loop to achieve this? If so, would it be better to use a Python dictionary as opposed to using JSON formatted output?
# Script to check Azure IPs
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Import Modules for script
import requests
import re
import json
import urllib.request
search = 'https://download.*?\.json'
ms_dl_centre = "https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519"
requests_get = requests.get(ms_dl_centre)
json_url_search = re.search(search, requests_get.text)
json_file = json_url_search.group(0)
with urllib.request.urlopen(json_file) as url:
contents = json.loads(url.read().decode())
print(json.dumps(contents['values'][1]['properties']['addressPrefixes'], indent = 0)) #use this to print contents from json entry 1
I'm not convinced that using re to parse HTML is a good idea. BeautifulSoup is more suited to the task. Upon inspection of the HTML response I note that there's a span element of class file-link-view1 that seems to uniquely identify the URL to the JSON download. Assuming that to be a robust approach (i.e. Microsoft don't change the way the download URL is presented) then this is how I'd do it:-
import requests
from bs4 import BeautifulSoup
namelist = ["AzureCloud.australiacentral", "AzureCloud.australiacentral2",
"AzureCloud.australiaeast", "AzureCloud.australiasoutheast"]
baseurl = 'https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519'
with requests.Session() as session:
response = session.get(baseurl)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
downloadurl = soup.find('span', class_='file-link-view1').find('a')['href']
response = session.get(downloadurl)
response.raise_for_status()
json = response.json()
for n in json['values']:
if n['name'] in namelist:
print(n['name'])
for ap in n['properties']['addressPrefixes']:
print(ap)
#andyknight, thanks for your direction. I'd up vote you but as I'm a noob, it doesn't permit from doing so.
I've taken the basis of your python script and added in some additional components.
I removed the print statement for the region name in the .txt file, as this is file is referenced by a firewall, which is looking for IP addresses.
I've added in Try/Except/Else for portion of the script, to identify if there is ever an error with reaching the URL, or other unspecified error. I've leveraged logging to send an email based on the status of the script. If an exception is thrown I get an email with traceback information, otherwise I receive an email advising the script was successful.
This writes out the specific prefixes for AU regions into a .txt file.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import requests
import logging
import logging.handlers
from bs4 import BeautifulSoup
smtp_handler = logging.handlers.SMTPHandler(mailhost=("sanitised.smtp[.]xyz", 25),
fromaddr="UpdateIPs#sanitised[.]xyz",
toaddrs="FriendlyAdmin#sanitised[.]xyz",
subject=u"Check Azure IP Script completion status.")
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
logger.addHandler(smtp_handler)
namelist = ["AzureCloud.australiacentral", "AzureCloud.australiacentral2",
"AzureCloud.australiaeast", "AzureCloud.australiasoutheast"]
baseurl = 'https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519'
with requests.Session() as session:
response = session.get(baseurl)
try:
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
downloadurl = soup.find('span', class_='file-link-view1').find('a')['href']
response = session.get(downloadurl)
response.raise_for_status()
json = response.json()
for n in json['values']:
if n['name'] in namelist:
for ap in n['properties']['addressPrefixes']:
with open('Check_Azure_IPs.txt', 'a') as file:
file.write(ap + "\n")
except requests.exceptions.HTTPError as e:
logger.exception(
"URL is no longer valid, please check the URL that's defined in this script with MS, as this may have changed.\n\n")
except Exception as e:
logger.exception("Unknown error has occured, please review script")
else:
logger.info("Script has run successfully! Azure IPs have been updated.")
Please let me know if you think there is a better way to handle this, otherwise this is marked as answered. I appreciate your help greatly!

Can I execute python code on a list simultaneous instead of sequential?

First of all thank you for taking your time to read through this post. I'd like to begin that I'm very new to programming in general and that I seek advice to solve a problem.
I'm trying to create a script that checks if the content of a html page has been changed. I'm doing this to monitor certain website pages for changes. I've managed to find a script and I have made some alterations that it will go through a list of URL's checking if the page has been changed. The problem here is that its checking the page sequential. This means that it will go through the list checking the URL's one by one while I want the script to run the URL's parallel. I'm also using a while loop to continue checking the pages because even if a change took place it will still have to monitor the page. I could write a thousand more words on explaining what i'm trying to do so therefor have a look at the code:
import requests
import time
import smtplib
from email.message import EmailMessage
import hashlib
from urllib.request import urlopen
url = ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]
i = 0
response = urlopen(url[i]).read()
currentHash = hashlib.sha224(response).hexdigest()
while True:
try:
response = urlopen(url[i]).read()
currentHash = hashlib.sha224(response).hexdigest()
print('checking')
time.sleep(10)
response = urlopen(url[i]).read()
newHash = hashlib.sha224(response).hexdigest()
i +=1
if newHash == currentHash:
continue
else:
print('Change detected')
print (url[i])
time.sleep(10)
continue
except Exception as e:
i = 0
print('resetting increment')
continue
What you want to do is called multi-threading.
Conceptually this is how it works:
import hashlib
import time
from urllib.request import urlopen
import threading
# Define a function for the thread
def f(url):
initialHash = None
while True:
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
if not initialHash:
initialHash = currentHash
if currentHash != initialHash:
print('Change detected')
print (url)
time.sleep(10)
continue
return
# Create two threads as follows
for url in ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]:
t = threading.Thread(target=f, args=(url,))
t.start()
Running example of OP code Using Thread Executor
Code
import concurrent.futures
import time
import requests
import hashlib
from urllib.request import urlopen
def check_change(url):
'''
Checks for a change in web page contents by comparing current to previous hash
'''
try:
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
time.sleep(10)
response = urlopen(url).read()
newHash = hashlib.sha224(response).hexdigest()
if newHash != currentHash:
return "Change to:", url
else:
return None
except Exception as e:
return "Error", e, url
page_urls = ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]
while True:
# We can use a Thread Execution Manager to ensure threads are clean up properly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = []
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(check_change, url): url for url in page_urls}
for future in concurrent.futures.as_completed(future_to_url):
# Output result of each thread upon it's completion
url = future_to_url[future]
try:
status = future.result()
if status:
print(*status)
else:
print(f'No change to: {url}')
except Exception as exc:
print('Site %r generated an exception: %s' % (url, exc))
time.sleep(10) # Wait 10 seconds before rechecking sites
Output
Change to: https://www.google.com
Change to: https://www.google.be
Change to: https://www.youtube.be
Change to: https://www.google.be
Change to: https://www.google.com
Change to: https://www.youtube.be
Change to: https://www.google.be
Change to: https://www.google.com
...

Unable to extract the table from API using python

I am trying to extract a table using an API but I am unable to do so. I am pretty sure that I am not using it correctly, and any help would be appreciated.
Actually I am trying to extract a table from this API but unable to figure out the right way on how to do it. This is what is mentioned in the website. I want to extract Latest_full_data table.
This is my code to get the table but I am getting error:
import urllib
import requests
import urllib.request
locu_api = 'api_Key'
def locu_search(query):
api_key = locu_api
url = 'https://www.quandl.com/api/v3/databases/WIKI/metadata?api_key=' + api_key
response = urllib.request.urlopen(url).read()
json_obj = str(response, 'utf-8')
datanew = json.loads(json_obj)
return datanew
When I do print(datanew). Update: Even if I change it to return data new, error is still the same.
I am getting this below error:
name 'datanew' is not defined
I had the same issues with urrlib before. If possible, try to use requests it's a better designed and working library in my opinion. Also, it is capable of reading JSON with a single function so no need to run it through multiple lines Sample code here:
import requests
locu_api = 'api_Key'
def locu_search():
url = 'https://www.quandl.com/api/v3/databases/WIKI/metadata?api_key=' + api_key
return requests.get(url).json()
locu_search()
Edit:
The endpoint that you are calling might not be the correct one. I think you are looking for the following one:
import requests
api_key = 'your_api_key_here'
def locu_search(dataset_code):
url = f'https://www.quandl.com/api/v3/datasets/WIKI/{dataset_code}/metadata.json?api_key={api_key}'
req = requests.get(url)
return req.json()
data = locu_search("FB")
This will return with all the metadata regarding a company. In this case Facebook.
Maybe it doesn't apply to your specific problem, but what I normally do is the following:
import requests
def get_values(url):
response = requests.get(url).text
values = json.loads(response)
return values

Categories

Resources