Web scraping script for Anki

Web scraping script for Anki - python

I'm very new to programming. Learning python to speed up my language learning with Anki. I Wanted to create web scraping script for Anki to create cards quicker. Here is my code: (It's not the final product, I enventually want to learn how to send to csv file so I can then import to Anki.)
from bs4 import BeautifulSoup
import requests
#get data from user
input("Type word ")
#get page
page = requests.get("https://fr.wiktionary.org/wiki/", params=word)
#make bs4 object
soup = BeautifulSoup(page.content, 'html.parser')
#find data from soup
IPA=soup.find(class_='API')
partofspeech=soup.find(class_='ligne-de-forme')
#open file
f=open("french.txt", "a")
#print text
print (IPA.text)
print (partofspeech.text)
#write to file
f.write(IPA.text)
f.write(partofspeech.text)
#close file
f.close()
It only returns the "word of the day" from Wikitionnaire and not the user's input. Any ideas?

You can follow up the following approach
(1) Read something in French, note the words or sentences you want to learn onto a paper.
(2) Write down these words/sentences onto a {text, json, markdown, ...} file.
(3) Read these world with Python with I/O handling.
(4) Use anki-connect that runs a web server to interface with your Anki account.
(5) Write a Python script to HTTP post your input word and scrape the answer on deepl.com for example.
(6) Combine these tools to add a session of learning onto Anki in one command.
(7) Happy learning !
Some code
Anki-connect
# https://github.com/FooSoft/anki-connect
# https://github.com/FooSoft/anki-connect/blob/master/actions/decks.md
import json
import urllib.request
def request(action, **params):
return {'action': action, 'params': params, 'version': 6}
def invoke(action, **params):
requestJson = json.dumps(request(action, **params)).encode('utf-8')
response = json.load(urllib.request.urlopen(urllib.request.Request('http://localhost:8765', requestJson)))
if len(response) != 2:
raise Exception('response has an unexpected number of fields')
if 'error' not in response:
raise Exception('response is missing required error field')
if 'result' not in response:
raise Exception('response is missing required result field')
if response['error'] is not None:
raise Exception(response['error'])
return response['result']
invoke('createDeck', deck='english-to-french')
result = invoke('deckNames')
print(f'got list of decks: {result}')
invoke('deleteDecks', decks=['english-to-french'], cardsToo=True)
result = invoke('deckNames')
print(f'got list of decks: {result}')
Web-scraping with scrapy
import scrapy
CODES = {
'fr': 'french',
'en': 'english'
}
URL_BASE = "https://www.linguee.com/%s-%s/translation/%s.html"
# these urls can come from another data file
# def get_data_from_file(filepath: string):
# with open('data.json', 'r') as f:
# lines = f.readlines()
#
# return [URL_BASE % (CODES['fr'], CODES['en'], line) for line in lines]
URLS = [
URL_BASE % (CODES['fr'], CODES['en'], 'lascive')
]
class BlogSpider(scrapy.Spider):
name = 'linguee_spider'
start_urls = URLS
def parse(self, response):
for span in response.css('span.tag_lemma'):
yield {'world': span.css('a.dictLink ::text').get()}
for div in response.css('div.translation'):
for span in div.css('span.tag_trans'):
yield {'translation': span.css('a.dictLink ::text').get()}
Shell script, wrapping up all
#!/bin/bash
# setup variables
DATE=$(date +"%Y-%m-%d-%H-%M")
SCRIPT_FILE="/path/to/folder/script.py"
OUTPUT_FILE="/path/to/folder/data/${DATE}.json"
echo "Running --- ${SCRIPT_FILE} --- at --- ${DATE} ---"
# activate virtualenv and run scrapy
source /path/to/folder/venv/bin/activate
scrapy runspider ${SCRIPT_FILE} -o ${OUTPUT_FILE}
echo "Saved results into --- ${OUTPUT_FILE} ---"
# reading data from scrapy output and creating an Anki card using anki-connect
python create_anki_card.py

Related

Python BS4 unwrap() scraped xml data

I'm a journalist working on a project using web scrapping to pull data from the county jail site. I'm still teaching myself python and am trying to get a list of charges and the bail that was assigned for that charge. The site uses xml, and I've been able to pull the data for charges and bail and write it to a csv file but I'm having trouble using the unwrap() function to remove tags. I've tried it out in a few places and can't seem to figure out its usage. I'd really like to do this in the code and not just have to run a find and replace in the spreadsheet.
from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime
url="https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
xml = requests.get(url)
response = requests.get(url)
if response.status_code == 200:
print("Connecting to jail website:")
print("Connected - Response code:", response)
print("Scraping Started at ", datetime.now())
soup = BeautifulSoup(xml.content, 'lxml')
charges = soup.find_all('ol')
bail_amt = soup.find_all('ob')
with open('charges-bail.csv', 'a', newline='') as csvfile:
chargesbail = csv.writer(csvfile, delimiter=',')
chargesbail.writerow([charges.unwrap(), bail_amt.unwrap()])
CSV File
"[<ol>BREAKING AND OR ENTERING (F)</ol>, <ol>POSS STOLEN GOODS/PROP (F)</ol>, <...

There is no need to use the unwrap() function, you just need to access the text within an element. I suggest you search on <of> which is above both the <ol> and <ob> entries. Doing this will avoid your lists of ol and ob entries getting out of sync as not all entries have an ob.
Try the following:
from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime
url = "https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
print("Connecting to jail website:")
req_xml = requests.get(url)
print("Connected - Response code:", req_xml)
if req_xml.status_code == 200:
with open('charges-bail.csv', 'a', newline='') as csvfile:
chargesbail = csv.writer(csvfile)
print("Scraping Started at ", datetime.now())
soup = BeautifulSoup(req_xml.content, 'lxml')
for of in soup.find_all('of'):
if of.ob:
ob = of.ob.text
else:
ob = ''
chargesbail.writerow([of.ol.text, ob])
Which would give you an output CSV file starting:
BREAKING AND OR ENTERING (F),
LARCENY AFTER BREAK/ENTER,
POSS STOLEN GOODS/PROP (F),5000
HABEAS CORPUS,100000
ELECTRONIC HOUSE ARREST VIOLAT,25000
The code of.ob.text is shorthand for: from the of find the first ob entry and then return the text contained inside or:
of.find('ob').get_text()
To only write rows when both are present, you could change it to:
for of in soup.find_all('of'):
if of.ob and of.ob.get_text(strip=True):
chargesbail.writerow([of.ol.text, of.ob.get_text(strip=True)])

How to print selected text from JSON file using Python

I'm new to python and have undertaken my first project to automate something for my role (I'm in the network space, so forgive me if this is terrible!).
I'm required to to download a .json file from the below link:
https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519
My script goes through and retrieves the manual download link.
The reason I'm getting the URL in this way, is that the download link changes every fortnight when MS update the file.
My preference is to extract the "addressPrefixes" contents from the names of "AzureCloud.australiacentral", "AzureCloud.australiacentral2", "AzureCloud.australiaeast", "AzureCloud.australiasoutheast".
I'm then wanting to strip out characters of " & ','.
Each of the subnet ranges should then reside on a new line and be placed in a text file.
If I perform the below, I'm able to get the output that I am wanting.
Am I correct in thinking that I can use a for loop to achieve this? If so, would it be better to use a Python dictionary as opposed to using JSON formatted output?
# Script to check Azure IPs
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Import Modules for script
import requests
import re
import json
import urllib.request
search = 'https://download.*?\.json'
ms_dl_centre = "https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519"
requests_get = requests.get(ms_dl_centre)
json_url_search = re.search(search, requests_get.text)
json_file = json_url_search.group(0)
with urllib.request.urlopen(json_file) as url:
contents = json.loads(url.read().decode())
print(json.dumps(contents['values'][1]['properties']['addressPrefixes'], indent = 0)) #use this to print contents from json entry 1

I'm not convinced that using re to parse HTML is a good idea. BeautifulSoup is more suited to the task. Upon inspection of the HTML response I note that there's a span element of class file-link-view1 that seems to uniquely identify the URL to the JSON download. Assuming that to be a robust approach (i.e. Microsoft don't change the way the download URL is presented) then this is how I'd do it:-
import requests
from bs4 import BeautifulSoup
namelist = ["AzureCloud.australiacentral", "AzureCloud.australiacentral2",
"AzureCloud.australiaeast", "AzureCloud.australiasoutheast"]
baseurl = 'https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519'
with requests.Session() as session:
response = session.get(baseurl)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
downloadurl = soup.find('span', class_='file-link-view1').find('a')['href']
response = session.get(downloadurl)
response.raise_for_status()
json = response.json()
for n in json['values']:
if n['name'] in namelist:
print(n['name'])
for ap in n['properties']['addressPrefixes']:
print(ap)

#andyknight, thanks for your direction. I'd up vote you but as I'm a noob, it doesn't permit from doing so.
I've taken the basis of your python script and added in some additional components.
I removed the print statement for the region name in the .txt file, as this is file is referenced by a firewall, which is looking for IP addresses.
I've added in Try/Except/Else for portion of the script, to identify if there is ever an error with reaching the URL, or other unspecified error. I've leveraged logging to send an email based on the status of the script. If an exception is thrown I get an email with traceback information, otherwise I receive an email advising the script was successful.
This writes out the specific prefixes for AU regions into a .txt file.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import requests
import logging
import logging.handlers
from bs4 import BeautifulSoup
smtp_handler = logging.handlers.SMTPHandler(mailhost=("sanitised.smtp[.]xyz", 25),
fromaddr="UpdateIPs#sanitised[.]xyz",
toaddrs="FriendlyAdmin#sanitised[.]xyz",
subject=u"Check Azure IP Script completion status.")
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
logger.addHandler(smtp_handler)
namelist = ["AzureCloud.australiacentral", "AzureCloud.australiacentral2",
"AzureCloud.australiaeast", "AzureCloud.australiasoutheast"]
baseurl = 'https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519'
with requests.Session() as session:
response = session.get(baseurl)
try:
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
downloadurl = soup.find('span', class_='file-link-view1').find('a')['href']
response = session.get(downloadurl)
response.raise_for_status()
json = response.json()
for n in json['values']:
if n['name'] in namelist:
for ap in n['properties']['addressPrefixes']:
with open('Check_Azure_IPs.txt', 'a') as file:
file.write(ap + "\n")
except requests.exceptions.HTTPError as e:
logger.exception(
"URL is no longer valid, please check the URL that's defined in this script with MS, as this may have changed.\n\n")
except Exception as e:
logger.exception("Unknown error has occured, please review script")
else:
logger.info("Script has run successfully! Azure IPs have been updated.")
Please let me know if you think there is a better way to handle this, otherwise this is marked as answered. I appreciate your help greatly!

Python Web scraping: Too slow in execution: How to Optimize for speed

I have written a web scraping program in python. It is working correctly but takes 1.5 hrs to execute. I am not sure how to optimize the code.
The logic of the code is every country have many ASN's with the client name. I am getting all the ASN links (for e.g https://ipinfo.io/AS2856)
Using Beautiful soup and regex to get the data as JSON.
The output is just a simple JSON.
import urllib.request
import bs4
import re
import json
url = 'https://ipinfo.io/countries'
SITE = 'https://ipinfo.io'
def url_to_soup(url):
#bgp.he.net is filtered by user-agent
req = urllib.request.Request(url)
opener = urllib.request.build_opener()
html = opener.open(req)
soup = bs4.BeautifulSoup(html, "html.parser")
return soup
def find_pages(page):
pages = []
for link in page.find_all(href=re.compile('/countries/')):
pages.append(link.get('href'))
return pages
def get_each_sites(links):
mappings = {}
print("Scraping Pages for ASN Data...")
for link in links:
country_page = url_to_soup(SITE + link)
current_country = link.split('/')[2]
for row in country_page.find_all('tr'):
columns = row.find_all('td')
if len(columns) > 0:
#print(columns)
current_asn = re.findall(r'\d+', columns[0].string)[0]
print(SITE + '/AS' + current_asn)
s = str(url_to_soup(SITE + '/AS' + current_asn))
asn_code, name = re.search(r'(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s(&)]+)', s).groups()
#print(asn_code[2:])
#print(name)
country = re.search(r'.*href="/countries.*">(?P<COUNTRY>.*)?</a>', s).group("COUNTRY")
print(country)
registry = re.search(r'Registry.*?pb-md-1">(?P<REGISTRY>.*?)</p>', s, re.S).group("REGISTRY").strip()
#print(registry)
# flag re.S make the '.' special character match any character at all, including a newline;
mtch = re.search(r'IP Addresses.*?pb-md-1">(?P<IP>.*?)</p>', s, re.S)
if mtch:
ip = mtch.group("IP").strip()
#print(ip)
mappings[asn_code[2:]] = {'Country': country,
'Name': name,
'Registry': registry,
'num_ip_addresses': ip}
return mappings
main_page = url_to_soup(url)
country_links = find_pages(main_page)
#print(country_links)
asn_mappings = get_each_sites(country_links)
print(asn_mappings)
The output is as expected, but super slow.

You probably don't want to speed your scraper up. When you scrape a site, or connect in a way that humans don't (24/7), it's good practice to keep requests to a minium so that
You blend in the background noise
You don't (D)DoS the website in hope of finishing faster, while racking up costs for the wbesite owner
What you can do, however, is get the AS names and numbers from this website (see this SO answers), and recover the IPs using PyASN

I think what you need is to do multiple processes of the scraping . This can be done using the python multiprocessing package. Since multi threads programs do not work in python because of the GIL (Global Interpreter Lock). There are plenty of examples of how to do this. Here are some:
Multiprocessing Spider
Speed up Beautiful soup scraper

My web crawler doesn't work with BeautifulSoup

I am trying to make a web crawler using Python. I am borrowing this code from Programming Collective intelligence book by Toby Segaran. Since the code from the book was outdated, I made some necessary changes but still the program doesn't execute as expected. Here is my code:
import urllib
from urllib import request
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import bs4
# Create a list of words to ignore
ignorewords=set(['the','of','to','and','a','in','is','it'])
class crawler:
# Initialize the crawler with the name of database
def __init__(self,dbname):
pass
def __del__(self): pass
def dbcommit(self):
pass
# Auxilliary function for getting an entry id and adding
# it if it's not present
def getentryid(self,table,field,value,createnew=True):
return None
# Index an individual page
def addtoindex(self,url,soup):
print('Indexing %s' % url)
# Extract the text from an HTML page (no tags)
def gettextonly(self,soup):
return None
# Separate the words by any non-whitespace character
def separatewords(self,text):
return None
# Return true if this url is already indexed
def isindexed(self,url):
return False
# Add a link between two pages
def addlinkref(self,urlFrom,urlTo,linkText):
pass
# Starting with a list of pages, do a breadth
# first search to the given depth, indexing pages
# as we go
def crawl(self,pages,depth=2):
pass
# Create the database tables
def createindextables(self):
pass
def crawl(self,pages,depth=2):
for i in range(depth):
newpages=set( )
for page in pages:
try:
c=request.urlopen(page)
except:
print("Could not open %s" % page)
continue
soup=BeautifulSoup(c.read())
self.addtoindex(page,soup)
links=soup('a')
for link in links:
if ('href' in dict(link.attrs)):
url=urljoin(page,link['href'])
if url.find("'")!=-1: continue
url=url.split('#')[0] # remove location portion
if url[0:4]=='http' and not self.isindexed(url):
newpages.add(url)
linkText=self.gettextonly(link)
self.addlinkref(page,url,linkText)
self.dbcommit( )
pages=newpages
pagelist=['http://google.com']
#pagelist=['file:///C:/Users/admin/Desktop/abcd.html']
crawler=crawler('')
crawler.crawl(pagelist)
the only output I get is
"Indexing http://google.com"
"Indexing http://google.com"
press any key to continue...
Everytime I put another link in page list I get same output as "Indexing xyz" where xyz is every link I put in pagelist. I also tried making a HTML file with lots of <a> tags but it didn't work too.

The problem is in your line link=soup('a'). If you want to find elements of class 'a', you should use the different methods named 'find_element_by...' (cf bs4 documentation)

Converting a pdf to text/html in python so I can parse it

I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:
EDIT: I ended up just getting the link and feeding it to adobes online conversion tool (see the code below):
import mechanize
import urllib2
import re
from BeautifulSoup import *
adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html"
url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp"
def get_pdf(soup2):
link = soup2.findAll("a", "com_acronym")
new_link = []
amendments = []
for i in link:
if "REPORT" in i["href"]:
new_link.append(i["href"])
if new_link == None:
print "No A number"
else:
for i in new_link:
page = br.open(str(i)).read()
bs = BeautifulSoup(page)
text = bs.findAll("a")
for i in text:
if re.search("PDF", str(i)) != None:
pdf_link = "http://www.europarl.europa.eu/" + i["href"]
pdf = urllib2.urlopen(pdf_link)
name_pdf = "%s_%s.pdf" % (y,p)
localfile = open(name_pdf, "w")
localfile.write(pdf.read())
localfile.close()
br.open(adobe)
br.select_form(name = "convertFrm")
br.form["srcPdfUrl"] = str(pdf_link)
br["convertTo"] = ["html"]
br["visuallyImpaired"] = ["notcompatible"]
br.form["platform"] =["Macintosh"]
pdf_html = br.submit()
soup = BeautifulSoup(pdf_html)
page = range(1,2) #can be set to 400 to get every document for a given year
year = range(1999,2000) #can be set to 2011 to get documents from all years
for y in year:
for p in page:
br = mechanize.Browser()
br.open(url)
br.select_form(name = "byReferenceForm")
br.form["year"] = str(y)
br.form["sequence"] = str(p)
response = br.submit()
soup1 = BeautifulSoup(response)
test = soup1.find(text="No search result")
if test != None:
print "%s %s No page skipping..." % (y,p)
else:
print "%s %s Writing dossier..." % (y,p)
for i in br.links(url_regex="file.jsp"):
link = i
response2 = br.follow_link(link).read()
soup2 = BeautifulSoup(response2)
get_pdf(soup2)
In the get_pdf() function I would like to convert the pdf file to text in python so I can parse the text for information about the legislative procedure. can anyone explaon me how this can be done?
Thomas

Sounds like you found a solution, but if you ever want to do it without a web service, or you need to scrape data based on its precise location on the PDF page, can I suggest my library, pdfquery? It basically turns the PDF into an lxml tree that can be spit out as XML, or parsed with XPath, PyQuery, or whatever else you want to use.
To use it, once you had the file saved to disk you would return pdf = pdfquery.PDFQuery(name_pdf), or pass in a urllib file object directly if you didn't need to save it. To get XML out to parse with BeautifulSoup, you could do pdf.tree.tostring().
If you don't mind using JQuery-style selectors, there's a PyQuery interface with positional extensions, which can be pretty handy. For example:
balance = pdf.pq(':contains("Your balance is")').text()
strings_near_the_bottom_of_page_23 = [el.text for el in pdf.pq('LTPage[page_label=23] :in_bbox(0, 0, 600, 200)')]

It's not exactly magic. I suggest
downloading the PDF file to a temp directory,
calling out to an external program to extract the text into a (temp) text file,
reading the text file.
For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen or subprocess.call().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping script for Anki - python

Related

Python BS4 unwrap() scraped xml data

How to print selected text from JSON file using Python

Python Web scraping: Too slow in execution: How to Optimize for speed

My web crawler doesn't work with BeautifulSoup

Converting a pdf to text/html in python so I can parse it

Categories

Resources