Parsing HTML using LXML Python - python

I'm trying to parse Oxford Dictionary in order to obtain the etymology of a given word.
class SkipException (Exception):
def __init__(self, value):
self.value = value
try:
doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good"))
except SkipException:
doc = ''
if doc:
table = []
trs = doc.xpath("//div[1]/div[2]/div/div/div/div[1]/section[5]/div/p")
I cannot seem to work out how to obtain the string of text I need. I know I lack some lines of code in the ones I have copied but I don't know how HTML nor LXML fully works. I would much appreciate if someone could provide me with the correct way to solve this.

You don't want to do web scraping, and especially when probably every dictionary has an API interface. In the case of Oxford create an account at https://developer.oxforddictionaries.com/. Get the API credentials from your account and do something like this:
import requests
import json
api_base = 'https://od-api.oxforddictionaries.com:443/api/v1/entries/{}/{}'
language = 'en'
word = 'parachute'
headers = {
'app_id': '',
'app_key': ''
}
url = api_base.format(language, word)
reply = requests.get(url, headers=headers)
if reply.ok:
reply_dict = json.loads(reply.text)
results = reply_dict.get('results')
if results:
headword = results[0]
entries = headword.get('lexicalEntries')[0].get('entries')
if entries:
entry = entries[0]
senses = entry.get('senses')
if senses:
sense = senses[0]
print(sense.get('short_definitions'))

Here's a sample to get you started scraping Oxford dictionary pages:
import lxml.html as lh
from urllib.request import urlopen
url = 'https://en.oxforddictionaries.com/definition/parachute'
html = urlopen(url)
root = lh.parse(html)
body = root.find("body")
elements = body.xpath("//span[#class='ind']")
for element in elements:
print(element.text)
To find the correct search string you need to format the html so you can see the structure. I used the html formatter at https://www.freeformatter.com/html-formatter.html. Looking at the formatted HTML, I could see the definitions were in the span elements with the 'ind' class attribute.

Related

Output non JSON data from regex web scraping to a JSON file

I'm using requests and regex to scrape data from an entire website and then save it to a JSON file, hosted on github so I and anyone else can access the data from other devices.
The first thing I tried was just to open every single page on the website and get all the data I want but I found that to be unnecessary so I decided to make two scripts, the first one finds the URL of every page on the site and the second one will be the one called which will then scrape the called URL. What I'm having trouble with right now is getting my data formatted correctly for the JSON file. Currently this is a sample of what the output looks like:
{
"Console":"/neo-geo-aes",
"Call ID":"62815",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle"
}{
"Console":"/neo-geo-cd",
"Call ID":"62817",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle-2"
}{
"Console":"/neo-geo-pocket-color",
"Call ID":"62578",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman"
}{
"Console":"/playstation",
"Call ID":"62580",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman-forever"
}
I've looked into this a lot and can't find a solution, here's the code in question:
import re
import requests
import json
##The base URL
URL = "https://www.pricecharting.com/"
r = requests.get(URL)
htmltext = r.text
##Find all system URLs
dataUrl = re.findall('(?<=<li><a href="\/console).*(?=">)', htmltext)
print(dataUrl)
##For each Item(number of consoles) find games
for i in range(len(dataUrl)):
##make console URL
newUrl = ("https://www.pricecharting.com/console" + dataUrl[i])
req = requests.get(newUrl)
newHtml = req.text
##Get item URLs
urlOne = re.findall('(?<=<a href="\/game).*(?=">)', newHtml)
itemId = re.findall('(?<=tr id="product-).*(?=" data)', newHtml)
##For every item in list(items per console)
out_list = []
for i in range(len(urlOne)):
##Make item URL
itemUrl = ("https://www.pricecharting.com/game" + urlOne[i])
callId = (itemId[i])
##Format for JSON
json_file_content = {}
json_file_content['Console'] = dataUrl[i]
json_file_content['Call ID'] = callId
json_file_content['URL'] = itemUrl
out_list.append(json_file_content)
data_json_filename = 'docs/result.json'
with open(data_json_filename, 'a') as data_json_file:
json.dump(out_list, data_json_file, indent=4)

Why is BeautifulSoup(...).find(...) returning None?

I have some problem with code (I use bs4):
elif 'temperature' in query:
speak("where?")
miejsce=takecommand().lower()
search = (f"Temperature in {miejsce}")
url = (f'https://www.google.com/search?q={search}')
r = requests.get(url)
data = BeautifulSoup(r.text , "html.parser")
temp = data.find("div", class_="BNeawe").text
speak(f"In {search} there is {temp}")
and the error is:
temp = data.find("div", class_="BNeawe").text
AttributeError: 'NoneType' object has no attribute 'text'
Could you help me please
data.find("div", class_="BNeawe") didnt return anything, so i believe google changed how it displays weather since you last ran this code successfully.
If you search for yourself 'Weather in {place}' then right click the weather widget and choose Inspect Element (browser dependent), you can look for yourself at where the data is in the page, and see which class the data is under.
It appears it was previously under the BNeawe class.
elif "temperature" in query or "temperatures" in query:
search = "Temperature in New York"
url = f"https://www.google.com/search?q={search}:"
r = requests.get(url)
data = BeautifulSoup(r.text, "html.parser")
temp = data.find("div", class_="BNeawe").text
speak(f"Currently, the temperature in your region is {temp}")
Try this one, you were experiencing your proble in line 5 which is '(r.text, "html.parser")'
try to avoid these comma space mistakes in the code...
Best practice would be to use directly api google / weather - If you wanna scrape,try to avoid selecting your elements by classes, cause they are often that dynamic.
Instead focus on id if possible or use HTML structure:
for p in list(soup.select_one('span:-soup-contains("weather.com")').parents):
if '°' in p.text:
print(p.next.get_text(strip=True))
break
Example
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/search?q=temperature"
response = requests.get(url, headers = {'User-Agent': 'Mozilla/5.0', 'Accept-Language':'en-US,en;q=0.5'}, cookies={'CONSENT':'YES+'})
soup = BeautifulSoup(response.text)
for p in list(soup.select_one('span:-soup-contains("weather.com")').parents):
if '°' in p.text:
print(p.next.get_text(strip=True))
break

Find Multiple Tags in Beutifulsoup4 and insert them into one string

A have a code that gets your pastebin's data
def user_key():
user_key_data = {'api_dev_key': 'my-dev-key',
'api_user_name': 'my-username',
'api_user_password': 'my-password'}
req = urllib.request.urlopen('https://pastebin.com/api/api_login.php',
urllib.parse.urlencode(user_key_data).encode('utf-8'),
timeout=7)
return req.read().decode()
def user_pastes()
data = data = {'api_dev_key': 'my_dev_key',
'api_user_key': user_key(),
'api_option': 'list'}
req = urllib.request.urlopen('https://pastebin.com/api/api_post.php',
urllib.parse.urlencode(data).encode('utf-8'), timeout=7)
return req.read().decode()
Every Paste has a unique html tag e.g. url, title, paste key, etc.
The Above code will print these out per paste.
I made a code that only takes certain tags. the paste url, paste title and the paste key
my_pastes = []
src = user_pastes()
soup = BeautifulSoup(src, 'html.parser')
for paste in soup.findAll(['paste_url', 'paste_title', 'paste_key']):
my_pastes.append(paste.text)
print(my_pastes)
What I want is to join the url, title and key per paste together into one string.
I tried using the .join method but it only joins the chars. (might not make sense but you'll see when you try it)
Unrelated to the problem.
What I'll do once they're joined. split them again and put them in a PyQt5 table
So This is kind of the answer but I'm still looking for a more simpler code
title = []
key = []
url = []
src = user_pastes()
soup = BeautifulSoup(src, 'html.parser')
for paste_title in soup.findAll('paste_title'):
title.append(paste_title.text)
for paste_key in soup.findAll('paste_key'):
key.append(paste_key.text)
for paste_url in soup.findAll('paste_url'):
url.append(paste_url.text)
for i in range(len(title)):
print(title[i], key[i], url[i])
Maybe from this answer you'll get the idea of what I want to achieve since the post was kind of confusing since I can't really express what I want

Got Key Error in getting url as text by using Python

I'm trying to get data and export to CSV which I have main URL page and second URL main page which I have imported the following of these:
from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urlparse, parse_qs
import csv
def get_page(url):
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
mainpage = response.read().decode('utf-8')
return mainpage
mainpage = get_page(www.website1.com)
mainpage_parser = BeautifulSoup(mainpage,'html.parser')
secondpage = get_page('www.website2.com')
secondpage_parser = BeautifulSoup(secondpage,'html.parser')
The patterns of the data are the same such as Title, Address; thus, the code I
use is "find" or "find_all" in each class; for example,
try:
name = page_parser.find("h1",{"class":"xxx"}).find("a").get_text()
print(name)
except:
print(name)
Which it worked.
However, I couldn't get the "lat" and "lon" from url link in this html class:
<img class="aaa" alt="map" data-track-id="static-map" width="97" height="142" src="https://www.website.com/aaaaaaa;height=284&lat=18.111&lon=98.111&level=15&returnImage=true">
The code I'm trying to get latitude and longitude is:
for gps in secondpage_parser.find_all('img',{"class":"aaa"}, src=True):
parsed_url = urlparse(gps['src'])
mykeys = ['lat', 'lon']
gpslocation = [parse_qs(parsed_url.query)[k][0] for k in mykeys]
print(gpslocation)
But it has Key Error on the "gpslocation = [parse_qs(parsed_url.query)[k][0] for k in mykeys]" line which it indicates "KeyError: 'lat'"
I would like to know which part here I have the mistake or how should I fix it. Please help.
This url has no query string but does have parameters (see what is the difference between URL parameters and query strings). So when you try to parse the query string you get an an empty dictionary. Hence the KeyError.
"https://www.website.com/aaaaaaa;height=284&lat=18.111&lon=98.111&level=15&returnImage=true"
# ^--- semicolon, not question mark
Result of print(parsed_url)
ParseResult(
scheme='https',
netloc='www.website.com',
path='/aaaaaaa',
params='height=284&lat=18.111&lon=98.111&level=15&returnImage=true',
query='',
fragment='')
The key here is to parse the parameters. To fix your code change parsed_url.query to parsed_url.params:
gpslocation = [parse_qs(parsed_url.params)[k][0] for k in mykeys]

How can I get some data from a video YouTube? (Python)

I want to know how can I get some data from a video Youtube like views, thumbnails or coments it has. I have been looking for in the Google's API but I can't understand it.
Thank you!
A different approach would be using urllib2 and getting the HTML code from the page and then filtering it.
import urllib2
source = 'https://www.youtube.com/watch?v=wDjeBNv6ip0'
response = urllib2.urlopen(source)
html = response.read() #Done, you have the whole HTML file in a gigantic string.
After that all you have to do is to filter it as you would do to a string.
Getting the number of views for instance:
wordBreak = ['<','>']
html = list(html)
i = 0
while i < len(html):
if html[i] in wordBreak:
html[i] = ' '
i += 1
#The block above is just to make the html.split() easier.
html = ''.join(html)
html = html.split()
dataSwitch = False
numOfViews = ''
for element in html:
if element == '/div':
dataSwitch = False
if dataSwitch:
numOfViews += str(element)
if element == 'class="watch-view-count"':
dataSwitch = True
print (numOfViews)
>>> 45.608.212 views
This was a simple example of getting the number of views but you can do that to everything on the page, including number of comments, likes, the content of the comment itself, etc.
i think this is the the part you are looking for (source):
def get_video_localization(youtube, video_id, language):
results = youtube.videos().list(
part="snippet",
id=video_id,
hl=language
).execute()
localized = results["items"][0]["snippet"]["localized"]
localized will now contain title, description, etc.

Categories

Resources