Python web scraping, program won't start - python

I'm a Java and C# developer and learning Python (web scraping, specific) at the moment. When I try to start my script (just double-clicking on it) it won't open. The terminal opens for a few milliseconds and then closes. What mistake did I make?
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
product_all_pages = []
for i in range(1,15):
response = requests.get("https://www.bol.com/nl/s/?page={i}&searchtext=hand+sanitizer&view=list")
content = response.content
parser = BeautifulSoup(content, 'html.parser')
body = parser.body
producten = body.find_all(class_="product-item--row js_item_root")
product_all_pages.extend(producten)
len(product_all_pages)
price = float(product_all_pages[1].meta.get('content'))
productname = product_all_pages[1].find(class_="product-title--inline").a.getText()
print(price)
print(productname)
productlijst = []
for item in product_all_pages:
if item.find(class_="product-prices").getText() == '\nNiet leverbaar\n':
price = None
else:
price = float(item.meta['content'])
product = item.find(class_="product-title--inline").a.getText()
productlijst.append([product, price])
print(productlijst[:3])
df = pd.DataFrame(productlijst, columns=["Product", "price"])
print(df.shape)
df["price"].describe()

Try running your code from command line, then you can see the debugging output. Your code throws an AttributeError because content contains no data. The problem is that the url is not formatted because you didn't initiate f-string formatting. This should work:
response = requests.get(f"https://www.bol.com/nl/s/?page={i}&searchtext=hand+sanitizer&view=list")

Related

Web scraping with urllib.request - data is not refreshing

I am trying to read a table on a website. The first (initial) read is correct, but the subsequent requests in the loop are out of date (the information doesn't change even though the website changes). Any suggestions?
The link shown in the code is not the actual website that I am looking at. Also, I am going through proxy server.
I don't get an error, just out of date information.
Here is my code:
import time
import urllib.request
from pprint import pprint
from html_table_parser.parser import HTMLTableParser
import pandas as pd
def url_get_contents(url):
#making request to the website
req = urllib.request.Request(url=url)
f = urllib.request.urlopen(req)
return f.read()
link='https://www.w3schools.com/html/html_tables.asp'
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
stored_page=p.tables[0]
while True:
try:
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
print('now: ',p.tables[0] )
time.sleep(120)
continue
# To handle exceptions
except Exception as e:
print("error")

Python BS4 unwrap() scraped xml data

I'm a journalist working on a project using web scrapping to pull data from the county jail site. I'm still teaching myself python and am trying to get a list of charges and the bail that was assigned for that charge. The site uses xml, and I've been able to pull the data for charges and bail and write it to a csv file but I'm having trouble using the unwrap() function to remove tags. I've tried it out in a few places and can't seem to figure out its usage. I'd really like to do this in the code and not just have to run a find and replace in the spreadsheet.
from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime
url="https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
xml = requests.get(url)
response = requests.get(url)
if response.status_code == 200:
print("Connecting to jail website:")
print("Connected - Response code:", response)
print("Scraping Started at ", datetime.now())
soup = BeautifulSoup(xml.content, 'lxml')
charges = soup.find_all('ol')
bail_amt = soup.find_all('ob')
with open('charges-bail.csv', 'a', newline='') as csvfile:
chargesbail = csv.writer(csvfile, delimiter=',')
chargesbail.writerow([charges.unwrap(), bail_amt.unwrap()])
CSV File
"[<ol>BREAKING AND OR ENTERING (F)</ol>, <ol>POSS STOLEN GOODS/PROP (F)</ol>, <...
There is no need to use the unwrap() function, you just need to access the text within an element. I suggest you search on <of> which is above both the <ol> and <ob> entries. Doing this will avoid your lists of ol and ob entries getting out of sync as not all entries have an ob.
Try the following:
from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime
url = "https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
print("Connecting to jail website:")
req_xml = requests.get(url)
print("Connected - Response code:", req_xml)
if req_xml.status_code == 200:
with open('charges-bail.csv', 'a', newline='') as csvfile:
chargesbail = csv.writer(csvfile)
print("Scraping Started at ", datetime.now())
soup = BeautifulSoup(req_xml.content, 'lxml')
for of in soup.find_all('of'):
if of.ob:
ob = of.ob.text
else:
ob = ''
chargesbail.writerow([of.ol.text, ob])
Which would give you an output CSV file starting:
BREAKING AND OR ENTERING (F),
LARCENY AFTER BREAK/ENTER,
POSS STOLEN GOODS/PROP (F),5000
HABEAS CORPUS,100000
ELECTRONIC HOUSE ARREST VIOLAT,25000
The code of.ob.text is shorthand for: from the of find the first ob entry and then return the text contained inside or:
of.find('ob').get_text()
To only write rows when both are present, you could change it to:
for of in soup.find_all('of'):
if of.ob and of.ob.get_text(strip=True):
chargesbail.writerow([of.ol.text, of.ob.get_text(strip=True)])

List of all US ZIP Codes using uszipcode

I've been trying to fetch all US Zipcodes for a web scraping project for my company.
I'm trying to use uszipcode library for doing it automatically rather than manually from the website im intersted in but cant figure it out.
this is my manual attempt:
from bs4 import BeautifulSoup
import requests
url = 'https://www.unitedstateszipcodes.org'
headers = {'User-Agent': 'Chrome/50.0.2661.102'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
all_zipcodes = []
# Extract all
for data in soup.find_all('div', class_='state-list'):
for a in data.find_all('a'):
if a is not None:
hrefs.append(a.get('href'))
hrefs.remove(None)
def get_zipcode_list():
"""
get_zipcode_list gets the GET response from the web archives server using CDX API
:return: CDX API output in json format.
"""
for state in hrefs:
state_url = url + state
state_page = requests.get(state_url, headers=headers)
states_soup = BeautifulSoup(state_page.text, 'html.parser')
div = states_soup.find(class_='list-group')
for a in div.findAll('a'):
if str(a.string).isdigit():
all_zipcodes.append(a.string)
return all_zipcodes
This takes alot of time and would like to know how to do the same in more efficient way using uszipcodes
You may try to search by pattern ''
s = SearchEngine()
l = s.by_pattern('', returns=1000000)
print(len(l))
More details in docs and in their basic tutorial
engine = SearchEngine()
allzips = {}
for i in range(100000): #Get zipcode info for every possible 5-digit combination
zipcode = str(i).zfill(5)
try: allzips[zipcode] = engine.by_zipcode(zipcode).to_dict()
except: pass
#Convert dictionary to DataFrame
allzips = pd.DataFrame(allzips).T.reset_index(drop = True)
Since zip codes are only 5-digits, you can iterate up to 100k and see which zip codes don't return an error. This solution gives you a DataFrame with all the stored information for each saved zip code
The regex that zip code in US have is [0-9]{5}(?:-[0-9]{4})?
you can simply check with re module
import re
regex = r"[0-9]{5}(?:-[0-9]{4})?"
if re.match(zipcode, regex):
print("match")
else:
print("not a match")
You can download the list of zip codes from the official source) and then parse it if it's for one-time use and you don't need any other metadata associated with each of the zip codes like the one which uszipcodes provides.
The uszipcodes also has another database which is quite big and should have all the data you need.
from uszipcode import SearchEngine
zipSearch = SearchEngine(simple_zipcode=False)
allZipCodes = zipSearch.by_pattern('', returns=200000)
print(len(allZipCodes)

Python Web scraping: Too slow in execution: How to Optimize for speed

I have written a web scraping program in python. It is working correctly but takes 1.5 hrs to execute. I am not sure how to optimize the code.
The logic of the code is every country have many ASN's with the client name. I am getting all the ASN links (for e.g https://ipinfo.io/AS2856)
Using Beautiful soup and regex to get the data as JSON.
The output is just a simple JSON.
import urllib.request
import bs4
import re
import json
url = 'https://ipinfo.io/countries'
SITE = 'https://ipinfo.io'
def url_to_soup(url):
#bgp.he.net is filtered by user-agent
req = urllib.request.Request(url)
opener = urllib.request.build_opener()
html = opener.open(req)
soup = bs4.BeautifulSoup(html, "html.parser")
return soup
def find_pages(page):
pages = []
for link in page.find_all(href=re.compile('/countries/')):
pages.append(link.get('href'))
return pages
def get_each_sites(links):
mappings = {}
print("Scraping Pages for ASN Data...")
for link in links:
country_page = url_to_soup(SITE + link)
current_country = link.split('/')[2]
for row in country_page.find_all('tr'):
columns = row.find_all('td')
if len(columns) > 0:
#print(columns)
current_asn = re.findall(r'\d+', columns[0].string)[0]
print(SITE + '/AS' + current_asn)
s = str(url_to_soup(SITE + '/AS' + current_asn))
asn_code, name = re.search(r'(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s(&)]+)', s).groups()
#print(asn_code[2:])
#print(name)
country = re.search(r'.*href="/countries.*">(?P<COUNTRY>.*)?</a>', s).group("COUNTRY")
print(country)
registry = re.search(r'Registry.*?pb-md-1">(?P<REGISTRY>.*?)</p>', s, re.S).group("REGISTRY").strip()
#print(registry)
# flag re.S make the '.' special character match any character at all, including a newline;
mtch = re.search(r'IP Addresses.*?pb-md-1">(?P<IP>.*?)</p>', s, re.S)
if mtch:
ip = mtch.group("IP").strip()
#print(ip)
mappings[asn_code[2:]] = {'Country': country,
'Name': name,
'Registry': registry,
'num_ip_addresses': ip}
return mappings
main_page = url_to_soup(url)
country_links = find_pages(main_page)
#print(country_links)
asn_mappings = get_each_sites(country_links)
print(asn_mappings)
The output is as expected, but super slow.
You probably don't want to speed your scraper up. When you scrape a site, or connect in a way that humans don't (24/7), it's good practice to keep requests to a minium so that
You blend in the background noise
You don't (D)DoS the website in hope of finishing faster, while racking up costs for the wbesite owner
What you can do, however, is get the AS names and numbers from this website (see this SO answers), and recover the IPs using PyASN
I think what you need is to do multiple processes of the scraping . This can be done using the python multiprocessing package. Since multi threads programs do not work in python because of the GIL (Global Interpreter Lock). There are plenty of examples of how to do this. Here are some:
Multiprocessing Spider
Speed up Beautiful soup scraper

How to connect to internet & load html using python/urllib in ubuntu?

I am new to programming, i had some problem with the code..
Here i have posted the code below.
Actually after running the program its shows some error...
ERROR:
It shows some traceback error
import urllib
proxies = {'http' : 'http://proxy:80'}
urlopener = urllib.FancyURLopener(proxies)
htmlpage = urlopener.open('http://www.google.com')
data = htmlpage.readlines()
print data
You need indent your python and replace readlines() with read()
import urllib
proxies = {'http' : 'http://proxy:80'}
urlopener = urllib.FancyURLopener(proxies)
htmlpage = urlopener.open('http://www.google.com')
data = htmlpage.read()
print data

Categories

Resources