Python Web Scraping not working - python

I'm new to Python and am trying to do some web scraping. I'm trying to get things like Deck Name, Username, Elixir Cost, and Card from a website about the game Clash Royale. I am taking the data and then sending it into a folder called "Data" in my project directory. The files are being created fine but I am getting empty brackets [] in each .json file. I don't know what I am doing wrong. Any help would be greatly appreciated. Thanks! Code is below:
from bs4 import BeautifulSoup
import requests
import uuid
import json
import os.path
from multiprocessing.dummy import Pool as Threadpool
def getdata(url):
save_path=r'/Users/crazy4byu/PycharmProjects/Final/Data'
clashlist=[]
html = requests.get(url).text
soup = BeautifulSoup(html,'html5lib')
clash = soup.find_all('div',{'class':'row result'})
for clashr in clash:
clashlist.append(
{
'Deck Name':clashr.find('a').text,
'User':clashr.find('td',{'class':'user center'}).text,
'Elixir Cost':clashr.find('span',{'class':'elixir_cost'}).text,
'Card':clashr.find('span',{'class':None}).text
}
)
decks = soup.find_all('div',{'class':' row result'})
for deck in decks:
clashlist.append(
{
'Deck Name':clashr.find('a').text,
'User':clashr.find('td',{'class':'user center'}).text,
'Elixir Cost':clashr.find('span',{'class':'elixir_cost'}).text,
'Card':clashr.find('span',{'class':None}).text
}
)
with open(os.path.join(save_path,'data_'+str(uuid.uuid1())+'.json'),'w') as outfile:
json.dump(clashlist,outfile)
if'__main__' == __name__:
urls=[]
urls.append(r'http://clashroyaledeckbuilder.com/clashroyale/deckViewer/highestRated')
for i in range(20,990,10):
urls.append(r'http://clashroyaledeckbuilder.com/clashroyale/deckViewer/highestRated'+str(i))
pool = Threadpool(25)
pool.map(getdata, urls)
pool.close()
pool.join()

Related

Web scraping with urllib.request - data is not refreshing

I am trying to read a table on a website. The first (initial) read is correct, but the subsequent requests in the loop are out of date (the information doesn't change even though the website changes). Any suggestions?
The link shown in the code is not the actual website that I am looking at. Also, I am going through proxy server.
I don't get an error, just out of date information.
Here is my code:
import time
import urllib.request
from pprint import pprint
from html_table_parser.parser import HTMLTableParser
import pandas as pd
def url_get_contents(url):
#making request to the website
req = urllib.request.Request(url=url)
f = urllib.request.urlopen(req)
return f.read()
link='https://www.w3schools.com/html/html_tables.asp'
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
stored_page=p.tables[0]
while True:
try:
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
print('now: ',p.tables[0] )
time.sleep(120)
continue
# To handle exceptions
except Exception as e:
print("error")

Python BS4 unwrap() scraped xml data

I'm a journalist working on a project using web scrapping to pull data from the county jail site. I'm still teaching myself python and am trying to get a list of charges and the bail that was assigned for that charge. The site uses xml, and I've been able to pull the data for charges and bail and write it to a csv file but I'm having trouble using the unwrap() function to remove tags. I've tried it out in a few places and can't seem to figure out its usage. I'd really like to do this in the code and not just have to run a find and replace in the spreadsheet.
from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime
url="https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
xml = requests.get(url)
response = requests.get(url)
if response.status_code == 200:
print("Connecting to jail website:")
print("Connected - Response code:", response)
print("Scraping Started at ", datetime.now())
soup = BeautifulSoup(xml.content, 'lxml')
charges = soup.find_all('ol')
bail_amt = soup.find_all('ob')
with open('charges-bail.csv', 'a', newline='') as csvfile:
chargesbail = csv.writer(csvfile, delimiter=',')
chargesbail.writerow([charges.unwrap(), bail_amt.unwrap()])
CSV File
"[<ol>BREAKING AND OR ENTERING (F)</ol>, <ol>POSS STOLEN GOODS/PROP (F)</ol>, <...
There is no need to use the unwrap() function, you just need to access the text within an element. I suggest you search on <of> which is above both the <ol> and <ob> entries. Doing this will avoid your lists of ol and ob entries getting out of sync as not all entries have an ob.
Try the following:
from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime
url = "https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
print("Connecting to jail website:")
req_xml = requests.get(url)
print("Connected - Response code:", req_xml)
if req_xml.status_code == 200:
with open('charges-bail.csv', 'a', newline='') as csvfile:
chargesbail = csv.writer(csvfile)
print("Scraping Started at ", datetime.now())
soup = BeautifulSoup(req_xml.content, 'lxml')
for of in soup.find_all('of'):
if of.ob:
ob = of.ob.text
else:
ob = ''
chargesbail.writerow([of.ol.text, ob])
Which would give you an output CSV file starting:
BREAKING AND OR ENTERING (F),
LARCENY AFTER BREAK/ENTER,
POSS STOLEN GOODS/PROP (F),5000
HABEAS CORPUS,100000
ELECTRONIC HOUSE ARREST VIOLAT,25000
The code of.ob.text is shorthand for: from the of find the first ob entry and then return the text contained inside or:
of.find('ob').get_text()
To only write rows when both are present, you could change it to:
for of in soup.find_all('of'):
if of.ob and of.ob.get_text(strip=True):
chargesbail.writerow([of.ol.text, of.ob.get_text(strip=True)])

How to print selected text from JSON file using Python

I'm new to python and have undertaken my first project to automate something for my role (I'm in the network space, so forgive me if this is terrible!).
I'm required to to download a .json file from the below link:
https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519
My script goes through and retrieves the manual download link.
The reason I'm getting the URL in this way, is that the download link changes every fortnight when MS update the file.
My preference is to extract the "addressPrefixes" contents from the names of "AzureCloud.australiacentral", "AzureCloud.australiacentral2", "AzureCloud.australiaeast", "AzureCloud.australiasoutheast".
I'm then wanting to strip out characters of " & ','.
Each of the subnet ranges should then reside on a new line and be placed in a text file.
If I perform the below, I'm able to get the output that I am wanting.
Am I correct in thinking that I can use a for loop to achieve this? If so, would it be better to use a Python dictionary as opposed to using JSON formatted output?
# Script to check Azure IPs
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Import Modules for script
import requests
import re
import json
import urllib.request
search = 'https://download.*?\.json'
ms_dl_centre = "https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519"
requests_get = requests.get(ms_dl_centre)
json_url_search = re.search(search, requests_get.text)
json_file = json_url_search.group(0)
with urllib.request.urlopen(json_file) as url:
contents = json.loads(url.read().decode())
print(json.dumps(contents['values'][1]['properties']['addressPrefixes'], indent = 0)) #use this to print contents from json entry 1
I'm not convinced that using re to parse HTML is a good idea. BeautifulSoup is more suited to the task. Upon inspection of the HTML response I note that there's a span element of class file-link-view1 that seems to uniquely identify the URL to the JSON download. Assuming that to be a robust approach (i.e. Microsoft don't change the way the download URL is presented) then this is how I'd do it:-
import requests
from bs4 import BeautifulSoup
namelist = ["AzureCloud.australiacentral", "AzureCloud.australiacentral2",
"AzureCloud.australiaeast", "AzureCloud.australiasoutheast"]
baseurl = 'https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519'
with requests.Session() as session:
response = session.get(baseurl)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
downloadurl = soup.find('span', class_='file-link-view1').find('a')['href']
response = session.get(downloadurl)
response.raise_for_status()
json = response.json()
for n in json['values']:
if n['name'] in namelist:
print(n['name'])
for ap in n['properties']['addressPrefixes']:
print(ap)
#andyknight, thanks for your direction. I'd up vote you but as I'm a noob, it doesn't permit from doing so.
I've taken the basis of your python script and added in some additional components.
I removed the print statement for the region name in the .txt file, as this is file is referenced by a firewall, which is looking for IP addresses.
I've added in Try/Except/Else for portion of the script, to identify if there is ever an error with reaching the URL, or other unspecified error. I've leveraged logging to send an email based on the status of the script. If an exception is thrown I get an email with traceback information, otherwise I receive an email advising the script was successful.
This writes out the specific prefixes for AU regions into a .txt file.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import requests
import logging
import logging.handlers
from bs4 import BeautifulSoup
smtp_handler = logging.handlers.SMTPHandler(mailhost=("sanitised.smtp[.]xyz", 25),
fromaddr="UpdateIPs#sanitised[.]xyz",
toaddrs="FriendlyAdmin#sanitised[.]xyz",
subject=u"Check Azure IP Script completion status.")
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
logger.addHandler(smtp_handler)
namelist = ["AzureCloud.australiacentral", "AzureCloud.australiacentral2",
"AzureCloud.australiaeast", "AzureCloud.australiasoutheast"]
baseurl = 'https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519'
with requests.Session() as session:
response = session.get(baseurl)
try:
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
downloadurl = soup.find('span', class_='file-link-view1').find('a')['href']
response = session.get(downloadurl)
response.raise_for_status()
json = response.json()
for n in json['values']:
if n['name'] in namelist:
for ap in n['properties']['addressPrefixes']:
with open('Check_Azure_IPs.txt', 'a') as file:
file.write(ap + "\n")
except requests.exceptions.HTTPError as e:
logger.exception(
"URL is no longer valid, please check the URL that's defined in this script with MS, as this may have changed.\n\n")
except Exception as e:
logger.exception("Unknown error has occured, please review script")
else:
logger.info("Script has run successfully! Azure IPs have been updated.")
Please let me know if you think there is a better way to handle this, otherwise this is marked as answered. I appreciate your help greatly!

start_request on scrpy.spyder does not seem to work fine

I hope you can give me some hints with my problem here.
I'm tryng to obtain an ordered data from a txt source. The code works fine till I print the data from the txt source, so it reads it. But onces I start a loop, reading each line from the txt file spydering it, and I "print(origdato)" to check if its working fine, but it does not.
Maybe is the loop, maybe is the request from spyder, I really dont know.
Could you please help me?
Here the code:
# packages
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.selector import Selector
import json
import datetime
# scraper class
class myfile(scrapy.Spider):
# scraper name
name= 'whatever'
base_url = 'https://www.whatever.com/'
headers = {'...'
}
custom_settings = {
'CONCURRENT_REQUEST_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 1,
}
current_page = 2
origdatos= []
def __init__(self):
content = ''
with open('origdatos.txt', 'r') as f:
for line in f.read():
content += line
# parse content
self.origdatos= content.split('\n')
# print(self.origdatos) // Till heree works fine
# crawler
def start_requests(self):
self.current_page = 2
# loop over datos
for origdato in self.origdatos:
print(origdato) #In this print Python does not show me data, so it appears the loop does not work properly, maybe
#driver
if __name__ == '__main__':
# run scraper
process = CrawlerProcess()
process.crawl(myfile)
process.start()
Maybe this is a formatting issue with your code. If it is formatted as displayed in your question that is. Try unindenting the start_requestsmethod in your code and see if it fixes the problem.
The following should work as well:
import scrapy
from scrapy.crawler import CrawlerProcess
class myfile(scrapy.Spider):
name = 'whatever'
def __init__(self):
with open('origdatos.txt', 'r') as f:
self.origdatos = f.readlines()
def start_requests(self):
for origdato in self.origdatos:
print(origdato)
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(myfile)
process.start()
However, this will still produce an error at the end of execution, because start_requests is supposed to return an iterable.

Webscrape multithread python 3

i have been dong a simple webscraping program to learn how to code and i made it work but i wanted to see how to make it faster. I wanted to ask how could i implement multi-threading to this program? all that the program does is open the stock symbols file and searches for the price for that stock online.
Here is my code
import urllib.request
import urllib
from threading import Thread
symbolsfile = open("Stocklist.txt")
symbolslist = symbolsfile.read()
thesymbolslist = symbolslist.split("\n")
i=0
while i<len (thesymbolslist):
theurl = "http://www.google.com/finance/getprices?q=" + thesymbolslist[i] + "&i=10&p=25m&f=c"
thepage = urllib.request.urlopen(theurl)
# read the correct character encoding from `Content-Type` request header
charset_encoding = thepage.info().get_content_charset()
# apply encoding
thepage = thepage.read().decode(charset_encoding)
print(thesymbolslist[i] + " price is " + thepage.split()[len(thepage.split())-1])
i= i+1
If you just iterate a function on a list, i recommend you the multiprocessing.Pool.map(function, list).
https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing%20map#multiprocessing.pool.Pool.map
You need to use asyncio. That's quite neat package that could also help you with scrapping. I have created a small snippet of how to integrate with linkedin with asyncio but you can adopt it to your needs quite easily.
import asyncio
import requests
def scrape_first_site():
url = 'http://example.com/'
response = requests.get(url)
def scrape_another_site():
url = 'http://example.com/other/'
response = requests.get(url)
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(None, scrape_first_site),
loop.run_in_executor(None, scrape_another_site)
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
Since default executor is ThreadPoolExecutor it will run each task in the separate thread. You can use ProcessPoolExecutor if you'd like to run tasks in process rather than threads (GIL related issues maybe).

Categories

Resources