Python BS4 unwrap() scraped xml data

Python BS4 unwrap() scraped xml data - python

I'm a journalist working on a project using web scrapping to pull data from the county jail site. I'm still teaching myself python and am trying to get a list of charges and the bail that was assigned for that charge. The site uses xml, and I've been able to pull the data for charges and bail and write it to a csv file but I'm having trouble using the unwrap() function to remove tags. I've tried it out in a few places and can't seem to figure out its usage. I'd really like to do this in the code and not just have to run a find and replace in the spreadsheet.
from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime
url="https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
xml = requests.get(url)
response = requests.get(url)
if response.status_code == 200:
print("Connecting to jail website:")
print("Connected - Response code:", response)
print("Scraping Started at ", datetime.now())
soup = BeautifulSoup(xml.content, 'lxml')
charges = soup.find_all('ol')
bail_amt = soup.find_all('ob')
with open('charges-bail.csv', 'a', newline='') as csvfile:
chargesbail = csv.writer(csvfile, delimiter=',')
chargesbail.writerow([charges.unwrap(), bail_amt.unwrap()])
CSV File
"[<ol>BREAKING AND OR ENTERING (F)</ol>, <ol>POSS STOLEN GOODS/PROP (F)</ol>, <...

There is no need to use the unwrap() function, you just need to access the text within an element. I suggest you search on <of> which is above both the <ol> and <ob> entries. Doing this will avoid your lists of ol and ob entries getting out of sync as not all entries have an ob.
Try the following:
from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime
url = "https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
print("Connecting to jail website:")
req_xml = requests.get(url)
print("Connected - Response code:", req_xml)
if req_xml.status_code == 200:
with open('charges-bail.csv', 'a', newline='') as csvfile:
chargesbail = csv.writer(csvfile)
print("Scraping Started at ", datetime.now())
soup = BeautifulSoup(req_xml.content, 'lxml')
for of in soup.find_all('of'):
if of.ob:
ob = of.ob.text
else:
ob = ''
chargesbail.writerow([of.ol.text, ob])
Which would give you an output CSV file starting:
BREAKING AND OR ENTERING (F),
LARCENY AFTER BREAK/ENTER,
POSS STOLEN GOODS/PROP (F),5000
HABEAS CORPUS,100000
ELECTRONIC HOUSE ARREST VIOLAT,25000
The code of.ob.text is shorthand for: from the of find the first ob entry and then return the text contained inside or:
of.find('ob').get_text()
To only write rows when both are present, you could change it to:
for of in soup.find_all('of'):
if of.ob and of.ob.get_text(strip=True):
chargesbail.writerow([of.ol.text, of.ob.get_text(strip=True)])

Related

Web scraping with urllib.request - data is not refreshing

I am trying to read a table on a website. The first (initial) read is correct, but the subsequent requests in the loop are out of date (the information doesn't change even though the website changes). Any suggestions?
The link shown in the code is not the actual website that I am looking at. Also, I am going through proxy server.
I don't get an error, just out of date information.
Here is my code:
import time
import urllib.request
from pprint import pprint
from html_table_parser.parser import HTMLTableParser
import pandas as pd
def url_get_contents(url):
#making request to the website
req = urllib.request.Request(url=url)
f = urllib.request.urlopen(req)
return f.read()
link='https://www.w3schools.com/html/html_tables.asp'
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
stored_page=p.tables[0]
while True:
try:
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
print('now: ',p.tables[0] )
time.sleep(120)
continue
# To handle exceptions
except Exception as e:
print("error")

How to print selected text from JSON file using Python

I'm new to python and have undertaken my first project to automate something for my role (I'm in the network space, so forgive me if this is terrible!).
I'm required to to download a .json file from the below link:
https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519
My script goes through and retrieves the manual download link.
The reason I'm getting the URL in this way, is that the download link changes every fortnight when MS update the file.
My preference is to extract the "addressPrefixes" contents from the names of "AzureCloud.australiacentral", "AzureCloud.australiacentral2", "AzureCloud.australiaeast", "AzureCloud.australiasoutheast".
I'm then wanting to strip out characters of " & ','.
Each of the subnet ranges should then reside on a new line and be placed in a text file.
If I perform the below, I'm able to get the output that I am wanting.
Am I correct in thinking that I can use a for loop to achieve this? If so, would it be better to use a Python dictionary as opposed to using JSON formatted output?
# Script to check Azure IPs
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Import Modules for script
import requests
import re
import json
import urllib.request
search = 'https://download.*?\.json'
ms_dl_centre = "https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519"
requests_get = requests.get(ms_dl_centre)
json_url_search = re.search(search, requests_get.text)
json_file = json_url_search.group(0)
with urllib.request.urlopen(json_file) as url:
contents = json.loads(url.read().decode())
print(json.dumps(contents['values'][1]['properties']['addressPrefixes'], indent = 0)) #use this to print contents from json entry 1

I'm not convinced that using re to parse HTML is a good idea. BeautifulSoup is more suited to the task. Upon inspection of the HTML response I note that there's a span element of class file-link-view1 that seems to uniquely identify the URL to the JSON download. Assuming that to be a robust approach (i.e. Microsoft don't change the way the download URL is presented) then this is how I'd do it:-
import requests
from bs4 import BeautifulSoup
namelist = ["AzureCloud.australiacentral", "AzureCloud.australiacentral2",
"AzureCloud.australiaeast", "AzureCloud.australiasoutheast"]
baseurl = 'https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519'
with requests.Session() as session:
response = session.get(baseurl)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
downloadurl = soup.find('span', class_='file-link-view1').find('a')['href']
response = session.get(downloadurl)
response.raise_for_status()
json = response.json()
for n in json['values']:
if n['name'] in namelist:
print(n['name'])
for ap in n['properties']['addressPrefixes']:
print(ap)

#andyknight, thanks for your direction. I'd up vote you but as I'm a noob, it doesn't permit from doing so.
I've taken the basis of your python script and added in some additional components.
I removed the print statement for the region name in the .txt file, as this is file is referenced by a firewall, which is looking for IP addresses.
I've added in Try/Except/Else for portion of the script, to identify if there is ever an error with reaching the URL, or other unspecified error. I've leveraged logging to send an email based on the status of the script. If an exception is thrown I get an email with traceback information, otherwise I receive an email advising the script was successful.
This writes out the specific prefixes for AU regions into a .txt file.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import requests
import logging
import logging.handlers
from bs4 import BeautifulSoup
smtp_handler = logging.handlers.SMTPHandler(mailhost=("sanitised.smtp[.]xyz", 25),
fromaddr="UpdateIPs#sanitised[.]xyz",
toaddrs="FriendlyAdmin#sanitised[.]xyz",
subject=u"Check Azure IP Script completion status.")
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
logger.addHandler(smtp_handler)
namelist = ["AzureCloud.australiacentral", "AzureCloud.australiacentral2",
"AzureCloud.australiaeast", "AzureCloud.australiasoutheast"]
baseurl = 'https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519'
with requests.Session() as session:
response = session.get(baseurl)
try:
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
downloadurl = soup.find('span', class_='file-link-view1').find('a')['href']
response = session.get(downloadurl)
response.raise_for_status()
json = response.json()
for n in json['values']:
if n['name'] in namelist:
for ap in n['properties']['addressPrefixes']:
with open('Check_Azure_IPs.txt', 'a') as file:
file.write(ap + "\n")
except requests.exceptions.HTTPError as e:
logger.exception(
"URL is no longer valid, please check the URL that's defined in this script with MS, as this may have changed.\n\n")
except Exception as e:
logger.exception("Unknown error has occured, please review script")
else:
logger.info("Script has run successfully! Azure IPs have been updated.")
Please let me know if you think there is a better way to handle this, otherwise this is marked as answered. I appreciate your help greatly!

Web scraping script for Anki

I'm very new to programming. Learning python to speed up my language learning with Anki. I Wanted to create web scraping script for Anki to create cards quicker. Here is my code: (It's not the final product, I enventually want to learn how to send to csv file so I can then import to Anki.)
from bs4 import BeautifulSoup
import requests
#get data from user
input("Type word ")
#get page
page = requests.get("https://fr.wiktionary.org/wiki/", params=word)
#make bs4 object
soup = BeautifulSoup(page.content, 'html.parser')
#find data from soup
IPA=soup.find(class_='API')
partofspeech=soup.find(class_='ligne-de-forme')
#open file
f=open("french.txt", "a")
#print text
print (IPA.text)
print (partofspeech.text)
#write to file
f.write(IPA.text)
f.write(partofspeech.text)
#close file
f.close()
It only returns the "word of the day" from Wikitionnaire and not the user's input. Any ideas?

You can follow up the following approach
(1) Read something in French, note the words or sentences you want to learn onto a paper.
(2) Write down these words/sentences onto a {text, json, markdown, ...} file.
(3) Read these world with Python with I/O handling.
(4) Use anki-connect that runs a web server to interface with your Anki account.
(5) Write a Python script to HTTP post your input word and scrape the answer on deepl.com for example.
(6) Combine these tools to add a session of learning onto Anki in one command.
(7) Happy learning !
Some code
Anki-connect
# https://github.com/FooSoft/anki-connect
# https://github.com/FooSoft/anki-connect/blob/master/actions/decks.md
import json
import urllib.request
def request(action, **params):
return {'action': action, 'params': params, 'version': 6}
def invoke(action, **params):
requestJson = json.dumps(request(action, **params)).encode('utf-8')
response = json.load(urllib.request.urlopen(urllib.request.Request('http://localhost:8765', requestJson)))
if len(response) != 2:
raise Exception('response has an unexpected number of fields')
if 'error' not in response:
raise Exception('response is missing required error field')
if 'result' not in response:
raise Exception('response is missing required result field')
if response['error'] is not None:
raise Exception(response['error'])
return response['result']
invoke('createDeck', deck='english-to-french')
result = invoke('deckNames')
print(f'got list of decks: {result}')
invoke('deleteDecks', decks=['english-to-french'], cardsToo=True)
result = invoke('deckNames')
print(f'got list of decks: {result}')
Web-scraping with scrapy
import scrapy
CODES = {
'fr': 'french',
'en': 'english'
}
URL_BASE = "https://www.linguee.com/%s-%s/translation/%s.html"
# these urls can come from another data file
# def get_data_from_file(filepath: string):
# with open('data.json', 'r') as f:
# lines = f.readlines()
#
# return [URL_BASE % (CODES['fr'], CODES['en'], line) for line in lines]
URLS = [
URL_BASE % (CODES['fr'], CODES['en'], 'lascive')
]
class BlogSpider(scrapy.Spider):
name = 'linguee_spider'
start_urls = URLS
def parse(self, response):
for span in response.css('span.tag_lemma'):
yield {'world': span.css('a.dictLink ::text').get()}
for div in response.css('div.translation'):
for span in div.css('span.tag_trans'):
yield {'translation': span.css('a.dictLink ::text').get()}
Shell script, wrapping up all
#!/bin/bash
# setup variables
DATE=$(date +"%Y-%m-%d-%H-%M")
SCRIPT_FILE="/path/to/folder/script.py"
OUTPUT_FILE="/path/to/folder/data/${DATE}.json"
echo "Running --- ${SCRIPT_FILE} --- at --- ${DATE} ---"
# activate virtualenv and run scrapy
source /path/to/folder/venv/bin/activate
scrapy runspider ${SCRIPT_FILE} -o ${OUTPUT_FILE}
echo "Saved results into --- ${OUTPUT_FILE} ---"
# reading data from scrapy output and creating an Anki card using anki-connect
python create_anki_card.py

Use item name stored in old for loop, inside a new for loop

This program I'm working on is going to search through multiple paths (located in a JSON list) of a URL and find one that's not being used (404 page).
The problem = I want to print what the path is when I come across a 404 (when I can find an error div). But I can't figure out a way to do so, since the item name seems unreachable.
### Libraries ###
from bs4 import BeautifulSoup
import grequests
import requests
import json
import time
### User inputs ###
namelist = input('Your namelist: ')
print('---------------------------------------')
result = input('Output file: ')
print('---------------------------------------')
### Scrape ###
names = json.loads(open(namelist + '.json').read())
reqs = (grequests.get('https://steamcommunity.com/id/' + name) for name in names)
resp=grequests.imap(reqs, grequests.Pool(10))
for r in resp:
soup = BeautifulSoup(r.text, 'lxml')
findelement = soup.find_all('div', attrs={'class':"error_ctn"})
if (findelement):
print(name)
else:
print('trying')

I think you can do this by modifying where your for loop is located. I'm not familiar with the libraries you're using so I've left a comment where you might need to modify the code further, but something along these lines should work:
names = json.loads(open(namelist + '.json').read())
for name in names:
req = grequests.get('https://steamcommunity.com/id/' + name)
# May need to modify this line since only passing one req, so are assured of only one response
resp=grequests.imap(req, grequests.Pool(10))
# There should only be one response now.
soup = BeautifulSoup(resp.text, 'lxml')
findelement = soup.find_all('div', attrs={'class':"error_ctn"})
if (findelement):
print(name)
else:
print('trying')

Python Web scraping: Too slow in execution: How to Optimize for speed

I have written a web scraping program in python. It is working correctly but takes 1.5 hrs to execute. I am not sure how to optimize the code.
The logic of the code is every country have many ASN's with the client name. I am getting all the ASN links (for e.g https://ipinfo.io/AS2856)
Using Beautiful soup and regex to get the data as JSON.
The output is just a simple JSON.
import urllib.request
import bs4
import re
import json
url = 'https://ipinfo.io/countries'
SITE = 'https://ipinfo.io'
def url_to_soup(url):
#bgp.he.net is filtered by user-agent
req = urllib.request.Request(url)
opener = urllib.request.build_opener()
html = opener.open(req)
soup = bs4.BeautifulSoup(html, "html.parser")
return soup
def find_pages(page):
pages = []
for link in page.find_all(href=re.compile('/countries/')):
pages.append(link.get('href'))
return pages
def get_each_sites(links):
mappings = {}
print("Scraping Pages for ASN Data...")
for link in links:
country_page = url_to_soup(SITE + link)
current_country = link.split('/')[2]
for row in country_page.find_all('tr'):
columns = row.find_all('td')
if len(columns) > 0:
#print(columns)
current_asn = re.findall(r'\d+', columns[0].string)[0]
print(SITE + '/AS' + current_asn)
s = str(url_to_soup(SITE + '/AS' + current_asn))
asn_code, name = re.search(r'(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s(&)]+)', s).groups()
#print(asn_code[2:])
#print(name)
country = re.search(r'.*href="/countries.*">(?P<COUNTRY>.*)?</a>', s).group("COUNTRY")
print(country)
registry = re.search(r'Registry.*?pb-md-1">(?P<REGISTRY>.*?)</p>', s, re.S).group("REGISTRY").strip()
#print(registry)
# flag re.S make the '.' special character match any character at all, including a newline;
mtch = re.search(r'IP Addresses.*?pb-md-1">(?P<IP>.*?)</p>', s, re.S)
if mtch:
ip = mtch.group("IP").strip()
#print(ip)
mappings[asn_code[2:]] = {'Country': country,
'Name': name,
'Registry': registry,
'num_ip_addresses': ip}
return mappings
main_page = url_to_soup(url)
country_links = find_pages(main_page)
#print(country_links)
asn_mappings = get_each_sites(country_links)
print(asn_mappings)
The output is as expected, but super slow.

You probably don't want to speed your scraper up. When you scrape a site, or connect in a way that humans don't (24/7), it's good practice to keep requests to a minium so that
You blend in the background noise
You don't (D)DoS the website in hope of finishing faster, while racking up costs for the wbesite owner
What you can do, however, is get the AS names and numbers from this website (see this SO answers), and recover the IPs using PyASN

I think what you need is to do multiple processes of the scraping . This can be done using the python multiprocessing package. Since multi threads programs do not work in python because of the GIL (Global Interpreter Lock). There are plenty of examples of how to do this. Here are some:
Multiprocessing Spider
Speed up Beautiful soup scraper

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BS4 unwrap() scraped xml data - python

Related

Web scraping with urllib.request - data is not refreshing

How to print selected text from JSON file using Python

Web scraping script for Anki

Use item name stored in old for loop, inside a new for loop

Python Web scraping: Too slow in execution: How to Optimize for speed

Categories

Resources