I've been trying to fetch all US Zipcodes for a web scraping project for my company.
I'm trying to use uszipcode library for doing it automatically rather than manually from the website im intersted in but cant figure it out.
this is my manual attempt:
from bs4 import BeautifulSoup
import requests
url = 'https://www.unitedstateszipcodes.org'
headers = {'User-Agent': 'Chrome/50.0.2661.102'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
all_zipcodes = []
# Extract all
for data in soup.find_all('div', class_='state-list'):
for a in data.find_all('a'):
if a is not None:
hrefs.append(a.get('href'))
hrefs.remove(None)
def get_zipcode_list():
"""
get_zipcode_list gets the GET response from the web archives server using CDX API
:return: CDX API output in json format.
"""
for state in hrefs:
state_url = url + state
state_page = requests.get(state_url, headers=headers)
states_soup = BeautifulSoup(state_page.text, 'html.parser')
div = states_soup.find(class_='list-group')
for a in div.findAll('a'):
if str(a.string).isdigit():
all_zipcodes.append(a.string)
return all_zipcodes
This takes alot of time and would like to know how to do the same in more efficient way using uszipcodes
You may try to search by pattern ''
s = SearchEngine()
l = s.by_pattern('', returns=1000000)
print(len(l))
More details in docs and in their basic tutorial
engine = SearchEngine()
allzips = {}
for i in range(100000): #Get zipcode info for every possible 5-digit combination
zipcode = str(i).zfill(5)
try: allzips[zipcode] = engine.by_zipcode(zipcode).to_dict()
except: pass
#Convert dictionary to DataFrame
allzips = pd.DataFrame(allzips).T.reset_index(drop = True)
Since zip codes are only 5-digits, you can iterate up to 100k and see which zip codes don't return an error. This solution gives you a DataFrame with all the stored information for each saved zip code
The regex that zip code in US have is [0-9]{5}(?:-[0-9]{4})?
you can simply check with re module
import re
regex = r"[0-9]{5}(?:-[0-9]{4})?"
if re.match(zipcode, regex):
print("match")
else:
print("not a match")
You can download the list of zip codes from the official source) and then parse it if it's for one-time use and you don't need any other metadata associated with each of the zip codes like the one which uszipcodes provides.
The uszipcodes also has another database which is quite big and should have all the data you need.
from uszipcode import SearchEngine
zipSearch = SearchEngine(simple_zipcode=False)
allZipCodes = zipSearch.by_pattern('', returns=200000)
print(len(allZipCodes)
Related
This program I'm working on is going to search through multiple paths (located in a JSON list) of a URL and find one that's not being used (404 page).
The problem = I want to print what the path is when I come across a 404 (when I can find an error div). But I can't figure out a way to do so, since the item name seems unreachable.
### Libraries ###
from bs4 import BeautifulSoup
import grequests
import requests
import json
import time
### User inputs ###
namelist = input('Your namelist: ')
print('---------------------------------------')
result = input('Output file: ')
print('---------------------------------------')
### Scrape ###
names = json.loads(open(namelist + '.json').read())
reqs = (grequests.get('https://steamcommunity.com/id/' + name) for name in names)
resp=grequests.imap(reqs, grequests.Pool(10))
for r in resp:
soup = BeautifulSoup(r.text, 'lxml')
findelement = soup.find_all('div', attrs={'class':"error_ctn"})
if (findelement):
print(name)
else:
print('trying')
I think you can do this by modifying where your for loop is located. I'm not familiar with the libraries you're using so I've left a comment where you might need to modify the code further, but something along these lines should work:
names = json.loads(open(namelist + '.json').read())
for name in names:
req = grequests.get('https://steamcommunity.com/id/' + name)
# May need to modify this line since only passing one req, so are assured of only one response
resp=grequests.imap(req, grequests.Pool(10))
# There should only be one response now.
soup = BeautifulSoup(resp.text, 'lxml')
findelement = soup.find_all('div', attrs={'class':"error_ctn"})
if (findelement):
print(name)
else:
print('trying')
I am trying to extract a table using an API but I am unable to do so. I am pretty sure that I am not using it correctly, and any help would be appreciated.
Actually I am trying to extract a table from this API but unable to figure out the right way on how to do it. This is what is mentioned in the website. I want to extract Latest_full_data table.
This is my code to get the table but I am getting error:
import urllib
import requests
import urllib.request
locu_api = 'api_Key'
def locu_search(query):
api_key = locu_api
url = 'https://www.quandl.com/api/v3/databases/WIKI/metadata?api_key=' + api_key
response = urllib.request.urlopen(url).read()
json_obj = str(response, 'utf-8')
datanew = json.loads(json_obj)
return datanew
When I do print(datanew). Update: Even if I change it to return data new, error is still the same.
I am getting this below error:
name 'datanew' is not defined
I had the same issues with urrlib before. If possible, try to use requests it's a better designed and working library in my opinion. Also, it is capable of reading JSON with a single function so no need to run it through multiple lines Sample code here:
import requests
locu_api = 'api_Key'
def locu_search():
url = 'https://www.quandl.com/api/v3/databases/WIKI/metadata?api_key=' + api_key
return requests.get(url).json()
locu_search()
Edit:
The endpoint that you are calling might not be the correct one. I think you are looking for the following one:
import requests
api_key = 'your_api_key_here'
def locu_search(dataset_code):
url = f'https://www.quandl.com/api/v3/datasets/WIKI/{dataset_code}/metadata.json?api_key={api_key}'
req = requests.get(url)
return req.json()
data = locu_search("FB")
This will return with all the metadata regarding a company. In this case Facebook.
Maybe it doesn't apply to your specific problem, but what I normally do is the following:
import requests
def get_values(url):
response = requests.get(url).text
values = json.loads(response)
return values
I have written a web scraping program in python. It is working correctly but takes 1.5 hrs to execute. I am not sure how to optimize the code.
The logic of the code is every country have many ASN's with the client name. I am getting all the ASN links (for e.g https://ipinfo.io/AS2856)
Using Beautiful soup and regex to get the data as JSON.
The output is just a simple JSON.
import urllib.request
import bs4
import re
import json
url = 'https://ipinfo.io/countries'
SITE = 'https://ipinfo.io'
def url_to_soup(url):
#bgp.he.net is filtered by user-agent
req = urllib.request.Request(url)
opener = urllib.request.build_opener()
html = opener.open(req)
soup = bs4.BeautifulSoup(html, "html.parser")
return soup
def find_pages(page):
pages = []
for link in page.find_all(href=re.compile('/countries/')):
pages.append(link.get('href'))
return pages
def get_each_sites(links):
mappings = {}
print("Scraping Pages for ASN Data...")
for link in links:
country_page = url_to_soup(SITE + link)
current_country = link.split('/')[2]
for row in country_page.find_all('tr'):
columns = row.find_all('td')
if len(columns) > 0:
#print(columns)
current_asn = re.findall(r'\d+', columns[0].string)[0]
print(SITE + '/AS' + current_asn)
s = str(url_to_soup(SITE + '/AS' + current_asn))
asn_code, name = re.search(r'(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s(&)]+)', s).groups()
#print(asn_code[2:])
#print(name)
country = re.search(r'.*href="/countries.*">(?P<COUNTRY>.*)?</a>', s).group("COUNTRY")
print(country)
registry = re.search(r'Registry.*?pb-md-1">(?P<REGISTRY>.*?)</p>', s, re.S).group("REGISTRY").strip()
#print(registry)
# flag re.S make the '.' special character match any character at all, including a newline;
mtch = re.search(r'IP Addresses.*?pb-md-1">(?P<IP>.*?)</p>', s, re.S)
if mtch:
ip = mtch.group("IP").strip()
#print(ip)
mappings[asn_code[2:]] = {'Country': country,
'Name': name,
'Registry': registry,
'num_ip_addresses': ip}
return mappings
main_page = url_to_soup(url)
country_links = find_pages(main_page)
#print(country_links)
asn_mappings = get_each_sites(country_links)
print(asn_mappings)
The output is as expected, but super slow.
You probably don't want to speed your scraper up. When you scrape a site, or connect in a way that humans don't (24/7), it's good practice to keep requests to a minium so that
You blend in the background noise
You don't (D)DoS the website in hope of finishing faster, while racking up costs for the wbesite owner
What you can do, however, is get the AS names and numbers from this website (see this SO answers), and recover the IPs using PyASN
I think what you need is to do multiple processes of the scraping . This can be done using the python multiprocessing package. Since multi threads programs do not work in python because of the GIL (Global Interpreter Lock). There are plenty of examples of how to do this. Here are some:
Multiprocessing Spider
Speed up Beautiful soup scraper
So I'm trying to create a Python script that will take a search term or query, then search google for that term. It should then return 5 URL's from the result of the search term.
I spent many hours trying to get PyGoogle to work. But later found out Google no longer supports the SOAP API for search, nor do they provide new license keys. In a nutshell, PyGoogle is pretty much dead at this point.
So my question here is... What would be the most compact/simple way of doing this?
I would like to do this entirely in Python.
Thanks for any help
Use BeautifulSoup and requests to get the links from the google search results
import requests
from bs4 import BeautifulSoup
keyword = "Facebook" #enter your keyword here
search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + keyword
r = requests.get(search)
soup = BeautifulSoup(r.text, "html.parser")
container = soup.find('div',{'id':'search'})
url = container.find("cite").text
print(url)
What issues are you having with pygoogle? I know it is no longer supported, but I've utilized that project on many occasions and it would work fine for the menial task you have described.
Your question did make me curious though--so I went to Google and typed "python google search". Bam, found this repository. Installed with pip and within 5 minutes of browsing their documentation got what you asked:
import google
for url in google.search("red sox", num=5, stop=1):
print(url)
Maybe try a little harder next time, ok?
Here, link is the xgoogle library to do the same.
I tried similar to get top 10 links which also counts words in links we are targeting. I have added the code snippet for your reference :
import operator
import urllib
#This line will import GoogleSearch, SearchError class from xgoogle/search.py file
from xgoogle.search import GoogleSearch, SearchError
my_dict = {}
print "Enter the word to be searched : "
#read user input
yourword = raw_input()
try:
#This will perform google search on our keyword
gs = GoogleSearch(yourword)
gs.results_per_page = 80
#get google search result
results = gs.get_results()
source = ''
#loop through all result to get each link and it's contain
for res in results:
#print res.url.encode('utf8')
#this will give url
parsedurl = res.url.encode("utf8")
myurl = urllib.urlopen(parsedurl)
#above line will read url content, in below line we parse the content of that web page
source = myurl.read()
#This line will count occurrence of enterd keyword in our webpage
count = source.count(yourword)
#We store our result in dictionary data structure. For each url, we store it word occurent. Similar to array, this is dictionary
my_dict[parsedurl] = count
except SearchError, e:
print "Search failed: %s" % e
print my_dict
#sorted_x = sorted(my_dict, key=lambda x: x[1])
for key in sorted(my_dict, key=my_dict.get, reverse=True):
print(key,my_dict[key])
I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:
EDIT: I ended up just getting the link and feeding it to adobes online conversion tool (see the code below):
import mechanize
import urllib2
import re
from BeautifulSoup import *
adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html"
url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp"
def get_pdf(soup2):
link = soup2.findAll("a", "com_acronym")
new_link = []
amendments = []
for i in link:
if "REPORT" in i["href"]:
new_link.append(i["href"])
if new_link == None:
print "No A number"
else:
for i in new_link:
page = br.open(str(i)).read()
bs = BeautifulSoup(page)
text = bs.findAll("a")
for i in text:
if re.search("PDF", str(i)) != None:
pdf_link = "http://www.europarl.europa.eu/" + i["href"]
pdf = urllib2.urlopen(pdf_link)
name_pdf = "%s_%s.pdf" % (y,p)
localfile = open(name_pdf, "w")
localfile.write(pdf.read())
localfile.close()
br.open(adobe)
br.select_form(name = "convertFrm")
br.form["srcPdfUrl"] = str(pdf_link)
br["convertTo"] = ["html"]
br["visuallyImpaired"] = ["notcompatible"]
br.form["platform"] =["Macintosh"]
pdf_html = br.submit()
soup = BeautifulSoup(pdf_html)
page = range(1,2) #can be set to 400 to get every document for a given year
year = range(1999,2000) #can be set to 2011 to get documents from all years
for y in year:
for p in page:
br = mechanize.Browser()
br.open(url)
br.select_form(name = "byReferenceForm")
br.form["year"] = str(y)
br.form["sequence"] = str(p)
response = br.submit()
soup1 = BeautifulSoup(response)
test = soup1.find(text="No search result")
if test != None:
print "%s %s No page skipping..." % (y,p)
else:
print "%s %s Writing dossier..." % (y,p)
for i in br.links(url_regex="file.jsp"):
link = i
response2 = br.follow_link(link).read()
soup2 = BeautifulSoup(response2)
get_pdf(soup2)
In the get_pdf() function I would like to convert the pdf file to text in python so I can parse the text for information about the legislative procedure. can anyone explaon me how this can be done?
Thomas
Sounds like you found a solution, but if you ever want to do it without a web service, or you need to scrape data based on its precise location on the PDF page, can I suggest my library, pdfquery? It basically turns the PDF into an lxml tree that can be spit out as XML, or parsed with XPath, PyQuery, or whatever else you want to use.
To use it, once you had the file saved to disk you would return pdf = pdfquery.PDFQuery(name_pdf), or pass in a urllib file object directly if you didn't need to save it. To get XML out to parse with BeautifulSoup, you could do pdf.tree.tostring().
If you don't mind using JQuery-style selectors, there's a PyQuery interface with positional extensions, which can be pretty handy. For example:
balance = pdf.pq(':contains("Your balance is")').text()
strings_near_the_bottom_of_page_23 = [el.text for el in pdf.pq('LTPage[page_label=23] :in_bbox(0, 0, 600, 200)')]
It's not exactly magic. I suggest
downloading the PDF file to a temp directory,
calling out to an external program to extract the text into a (temp) text file,
reading the text file.
For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen or subprocess.call().