How to scrape multi page website with python? [duplicate] - python

This question already has answers here:
Web scraping with Python [closed]
(10 answers)
Closed last month.
I need to scrape the following table: https://haexpeditions.com/advice/list-of-mount-everest-climbers/
How to do it with python?

The site uses this API to fetch the table data, so you could request it from there.
(I used cloudscraper because it's easier than trying to figure out how to set the right set of requests headers to avoid getting a 406 error response; and using the try..except...print approach (instead of just doing tableData = [dict(...) for row in api_req.json()] directly) helps understand what went wrong in case of error [without actually raising any errors that might break the program execution.])
# import cloudscraper
api_url = 'https://haexpeditions.com/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&table_id=1084&target_action=get-all-data&default_sorting=old_first&ninja_table_public_nonce=2491a56a39&chunk_number=0'
api_req = cloudscraper.create_scraper().get(api_url)
try: jData, jdMsg = api_req.json(), f'- {len(api_req.json())} rows from'
except Exception as e: jData, jdMsg = [], f'failed to get data - {e} \nfrom'
print(api_req.status_code, api_req.reason, jdMsg, api_req.url)
tableData = [dict([(k, v) for k, v in row['value'].items()] + [
(f'{k}_options', v) for k, v in row['options'].items()
]) for row in jData]
At this point tableData is a list of dictionaries but you can build a DataFrame from it with pandas and save it to a CSV file with .to_csv.
# import pandas
pandas.DataFrame(tableData).set_index('number').to_csv('list_of_mount_everest_climbers.csv')
The API URL can be either copied from the browser network logs or extracted from the script tag containing it in the source HTML of the page.
The shorter way would be to just split the HTML string:
# import cloudscraper
pg_url = 'https://haexpeditions.com/advice/list-of-mount-everest-climbers/'
pg_req = cloudscraper.create_scraper().get(pg_url)
api_url = pg_req.text.split('"data_request_url":"', 1)[-1].split('"')[0]
api_url = api_url.replace('\\', '')
print(pg_req.status_code,pg_req.reason,'from',pg_req.url,'\napi_url:',api_url)
However, it's a little risky in case "data_request_url":" appears in any other context in the HTML aside from the one that we want. So, another way would be to parse with bs4 and json.
# import cloudscraper
# from bs4 import BeautifulSoup
# import json
pg_url = 'https://haexpeditions.com/advice/list-of-mount-everest-climbers/'
sel = 'div.footer.footer-inverse>div.bottom-bar+script[type="text/javascript"]'
api_url = 'https://haexpeditions.com/wp-admin/admin-ajax.php...' ## will be updated
pg_req = cloudscraper.create_scraper().get(pg_url)
jScript = BeautifulSoup(pg_req.content).select_one(sel)
try:
sjData = json.loads(jScript.get_text().split('=',1)[-1].strip())
api_url = sjData['init_config']['data_request_url']
auMsg = f'\napi_url: {api_url}'
except Exception as e: auMsg = f'failed to extract API URL - {type(e)} {e}'
print(pg_req.status_code,pg_req.reason,'from',pg_req.url,'\napi_url:',auMsg)
(I would consider the second method more reliable even though it's a bit longer.)

Related

List of all US ZIP Codes using uszipcode

I've been trying to fetch all US Zipcodes for a web scraping project for my company.
I'm trying to use uszipcode library for doing it automatically rather than manually from the website im intersted in but cant figure it out.
this is my manual attempt:
from bs4 import BeautifulSoup
import requests
url = 'https://www.unitedstateszipcodes.org'
headers = {'User-Agent': 'Chrome/50.0.2661.102'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
all_zipcodes = []
# Extract all
for data in soup.find_all('div', class_='state-list'):
for a in data.find_all('a'):
if a is not None:
hrefs.append(a.get('href'))
hrefs.remove(None)
def get_zipcode_list():
"""
get_zipcode_list gets the GET response from the web archives server using CDX API
:return: CDX API output in json format.
"""
for state in hrefs:
state_url = url + state
state_page = requests.get(state_url, headers=headers)
states_soup = BeautifulSoup(state_page.text, 'html.parser')
div = states_soup.find(class_='list-group')
for a in div.findAll('a'):
if str(a.string).isdigit():
all_zipcodes.append(a.string)
return all_zipcodes
This takes alot of time and would like to know how to do the same in more efficient way using uszipcodes
You may try to search by pattern ''
s = SearchEngine()
l = s.by_pattern('', returns=1000000)
print(len(l))
More details in docs and in their basic tutorial
engine = SearchEngine()
allzips = {}
for i in range(100000): #Get zipcode info for every possible 5-digit combination
zipcode = str(i).zfill(5)
try: allzips[zipcode] = engine.by_zipcode(zipcode).to_dict()
except: pass
#Convert dictionary to DataFrame
allzips = pd.DataFrame(allzips).T.reset_index(drop = True)
Since zip codes are only 5-digits, you can iterate up to 100k and see which zip codes don't return an error. This solution gives you a DataFrame with all the stored information for each saved zip code
The regex that zip code in US have is [0-9]{5}(?:-[0-9]{4})?
you can simply check with re module
import re
regex = r"[0-9]{5}(?:-[0-9]{4})?"
if re.match(zipcode, regex):
print("match")
else:
print("not a match")
You can download the list of zip codes from the official source) and then parse it if it's for one-time use and you don't need any other metadata associated with each of the zip codes like the one which uszipcodes provides.
The uszipcodes also has another database which is quite big and should have all the data you need.
from uszipcode import SearchEngine
zipSearch = SearchEngine(simple_zipcode=False)
allZipCodes = zipSearch.by_pattern('', returns=200000)
print(len(allZipCodes)

Parsing Json in python 3, get email from API

I'm trying to do a little code that gets the emails (and other things in the future) from an API. But I'm getting "TypeError: list indices must be integers or slices, not str" and I don't know what to do about it. I've been looking at other questions here but I still don't get it. I might be a bit slow when it comes to this.
I've also been watching some tutorials on the tube, and done the same as them, but still getting different errors. I run Python 3.5.
Here is my code:
from urllib.request import urlopen
import json, re
# Opens the url for the API
url = 'https://jsonplaceholder.typicode.com/posts/1/comments'
r = urlopen(url)
# This should put the response from API in a Dict
result= r.read().decode('utf-8')
data = json.loads(result)
#This shuld get all the names from the the Dict
for name in data['name']: #TypeError here.
print(name)
I know that I could regex the text and get the result that I want.
Code for that:
from urllib.request import urlopen
import re
url = 'https://jsonplaceholder.typicode.com/posts/1/comments'
r = urlopen(url)
result = r.read().decode('utf-8')
f = re.findall('"email": "(\w+\S\w+)', result)
print(f)
But that seems like the wrong way to do this.
Can someone please help me understand what I'm doing wrong here?
data is a list of dicts, that's why you are getting TypeError while iterating on it.
The way to go is something like this:
for item in data: # item is {"name": "foo", "email": "foo#mail..."}
print(item['name'])
print(item['email'])
#PiAreSquared's comment is correct, just a bit more explanation here:
from urllib.request import urlopen
import json, re
# Opens the url for the API
url = 'https://jsonplaceholder.typicode.com/posts/1/comments'
r = urlopen(url)
# This should put the response from API in a Dict
result= r.read().decode('utf-8')
data = json.loads(result)
# your data is a list of elements
# and each element is a dict object, so you can loop over the data
# to get the dict element, and then access the keys and values as you wish
# see below for some example
for element in data: #TypeError here.
name = element['name']
email = element['email']
# if you want to get all names, you should do
names = [element['name'] for element in data]
# same to get all emails
emails = [email['email'] for email in data]

Unable to extract the table from API using python

I am trying to extract a table using an API but I am unable to do so. I am pretty sure that I am not using it correctly, and any help would be appreciated.
Actually I am trying to extract a table from this API but unable to figure out the right way on how to do it. This is what is mentioned in the website. I want to extract Latest_full_data table.
This is my code to get the table but I am getting error:
import urllib
import requests
import urllib.request
locu_api = 'api_Key'
def locu_search(query):
api_key = locu_api
url = 'https://www.quandl.com/api/v3/databases/WIKI/metadata?api_key=' + api_key
response = urllib.request.urlopen(url).read()
json_obj = str(response, 'utf-8')
datanew = json.loads(json_obj)
return datanew
When I do print(datanew). Update: Even if I change it to return data new, error is still the same.
I am getting this below error:
name 'datanew' is not defined
I had the same issues with urrlib before. If possible, try to use requests it's a better designed and working library in my opinion. Also, it is capable of reading JSON with a single function so no need to run it through multiple lines Sample code here:
import requests
locu_api = 'api_Key'
def locu_search():
url = 'https://www.quandl.com/api/v3/databases/WIKI/metadata?api_key=' + api_key
return requests.get(url).json()
locu_search()
Edit:
The endpoint that you are calling might not be the correct one. I think you are looking for the following one:
import requests
api_key = 'your_api_key_here'
def locu_search(dataset_code):
url = f'https://www.quandl.com/api/v3/datasets/WIKI/{dataset_code}/metadata.json?api_key={api_key}'
req = requests.get(url)
return req.json()
data = locu_search("FB")
This will return with all the metadata regarding a company. In this case Facebook.
Maybe it doesn't apply to your specific problem, but what I normally do is the following:
import requests
def get_values(url):
response = requests.get(url).text
values = json.loads(response)
return values

Scraping json content from a site ordered in pages

I'm trying to scrape a site, when I run the following code without region_id=[any number from one to 32] I get a [500], but if I set region_id=1 I'll get only a first page by default (on the url it is pagina=&), pages are up to 500; is there a command or parameter for retrieving every page (every possible value of pagina=), avoiding for loops?
import requests
url = "http://www.enciclovida.mx/explora-por-region/especies-por-grupo?utf8=%E2%9C%93&grupo_id=Plantas&region_id=&parent_id=&pagina=&nombre="
resp = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
data = resp.json()
Even without a for loop, you are still going to need iteration. You could do it with recursion or map as I've done below, but the iteration is still there. This solution has the advantage that everything is a generator, so only when you ask for a page's json from all_data will url be formatted, the request made, checked and converted to json. I added a filter to make sure you got a valid response before trying to get the json out. It still makes every request sequentially, but you could replace map with a parallel implementation quite easily.
import requests
from itertools import product, starmap
from functools import partial
def is_valid_resp(resp):
return resp.status_code == requests.codes.ok
def get_json(resp):
return resp.json()
# There's a .format hiding on the end of this really long url,
# with {} in appropriate places
url = "http://www.enciclovida.mx/explora-por-region/especies-por-grupo?utf8=%E2%9C%93&grupo_id=Plantas&region_id={}&parent_id=&pagina={}&nombre=".format
regions = range(1, 33)
pages = range(1, 501)
urls = starmap(url, product(regions, pages))
moz_get = partial(requests.get, headers={'User-Agent':'Mozilla/5.0'})
responses = map(moz_get, urls)
valid_responses = filter(is_valid_response, responses)
all_data = map(get_json, valid_responses)
# all_data is a generator that will give you each page's json.

How to scrape data from JSON/Javascript of web page?

I'm new to Python, just get started with it today.
My system environment are Python 3.5 with some libraries on Windows10.
I want to extract football player data from site below as CSV file.
Problem: I can not extract data from soup.find_all('script')[17] to my expected CSV format. How to extract those data as I want ?
My code is shown as below.
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
req = Request('http://www.futhead.com/squad-building-challenges/squads/343', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser') #not sure if i need to use lxml
soup.find_all('script')[17] #My target data is in 17th
My expected output would be similar to this
position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
As #josiah Swain said, it's not going to be pretty. For this sort of thing it's more recommended to use JS as it can understand what you have.
Saying that, python is awesome and here is you solution!
#Same imports as before
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
#And one more
import json
# The code you had
req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser')
# Store the script
script = soup.find_all('script')[17]
# Extract the oneline that stores all that JSON
uncleanJson = [line for line in script.text.split('\n')
if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]
# The easiest way to strip away all that yucky JS to get to the JSON
cleanJSON = uncleanJson.lstrip() \
.replace('squad.register_players($.parseJSON(\'', '') \
.replace('\'));','')
# Extract out that useful info
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']]
for p in json.loads(cleanJSON)
if p['player'] is not None]
print('position,slot_position,slug')
for line in data:
print(','.join(line))
The result I get for copying and pasting this into python is:
position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
CAM,CAM,soner-aydogdu
RM,RM,petar-grbic
GK,GK,fatih-ozturk
CDM,CDM,eray-ataseven
LB,LB,kadir-keles
CB,CB,caner-osmanpasa
CB,CB,mustafa-yumlu
RM,RM,ioan-adrian-hora
GK,GK,bora-kork
Edit: On reflection this is not the easiest code to read for a beginner. Here is a easier to read version
# ... All that previous code
script = soup.find_all('script')[17]
allScriptLines = script.text.split('\n')
uncleanJson = None
for line in allScriptLines:
# Remove left whitespace (makes it easier to parse)
cleaner_line = line.lstrip()
if cleaner_line.startswith('squad.register_players($.parseJSON'):
uncleanJson = cleaner_line
cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')
print('position,slot_position,slug')
for player in json.loads(cleanJSON):
if player['player'] is not None:
print(player['position'],player['data']['slot_position'],player['data']['slug'])
So my understanding is that beautifulsoup is better for HTML parsing, but you are trying to parse javascript nested in the HTML.
So you have two options
Simply create a function that takes the result of soup.find_all('script')[17], loop and search the string manually for the data and extract it. You can even use ast.literal_eval(string_thats_really_a_dictionary) to make it even easier. This is may not be the best a approach but if you are new to python you might want to do it this just for practice.
Use the json library like in this example. or alternatively like this way. This is probably the better way to do.

Categories

Resources