Finding information on a website without an external module

Finding information on a website without an external module - python

I am creating a program in Python where you search up a tv show/movie, and from IMDb, it gives you:
The title, year, rating, age rating, and synopsis of the movie.
I want to use no external modules at all, only the ones that come with Python 3.4.
I know I will have to use urllib, but I do not know where to go from there.
How would I do this?

This is an example taken from here:
import json
from urllib.parse import quote
from urllib.request import urlopen
def search(title):
API_URL = "http://www.omdbapi.com/?r=json&s=%s"
title = title.encode("utf-8")
url = API_URL % quote(title)
data = urlopen(url).read().decode("utf-8")
data = json.loads(data)
if data.get("Response") == "False":
print(data.get("Error", "Unknown error"))
return data.get("Search", [])
Then you can do:
>>> search("Idiocracy")
[{'Year': '2006', 'imdbID': 'tt0387808', 'Title': 'Idiocracy'}]

It's maybe too complex but:
I look at the webpage code. I look where the info I want is and then I extract the info.
import urllib.request
def search(title):
html = urllib.request.urlopen("http://www.imdb.com/find?q="+title).read().decode("utf-8")
f=html.find("<td class=\"result_text\"> <a href=\"",0)+34
openlink=""
while html[f]!="\"":
openlink+= html[f]
f+=1
html = urllib.request.urlopen("http://www.imdb.com"+openlink).read().decode("utf-8")
f = html.find("<meta property='og:title' content=\"",0)+35
titleyear=""
while html[f] !="\"":
titleyear+=html[f]
f+=1
f = html.find("title=\"Users rated this ",0)+24
rating = ""
while html[f] !="/":
rating+= html[f]
f+=1
f=html.find("<meta name=\"description\" content=\"",0)+34
shortdescription = ""
while html[f] !="\"":
shortdescription+=html[f]
f+=1
print (titleyear,rating,shortdescription)
return (titleyear,rating,shortdescription)
search("friends")
The number adding to f has to be just right, you count the lenght of the string you are searching, because find() returns you the position of the first letter in the string.
It looks bad, is there any other simpler way to do it?

Related

How to get all emails from a page individually

I am trying to get all emails from a specific page and separate them into an individual variable or even better a dictionary. This is some code.
import requests
import re
import json
from bs4 import BeautifulSoup
page = "http://www.example.net"
info = requests.get(page)
if info.status_code == 200:
print("Page accessed")
else:
print("Error accessing page")
code = info.content
soup = BeautifulSoup(code, 'lxml')
allEmails = soup.find_all("a", href=re.compile(r"^mailto:"))
print(allEmails)
sep = ","
allEmailsStr = str(allEmails)
print(type(allEmails))
print(type(allEmailsStr))
j = allEmailsStr.split(sep, 1)[0]
print(j)
Excuse the poor variable names because I put this together so it would be fine by itself. The output from the example website would be for example something like
[k, kolyma, location, balkans]
So if I ran the problem it would return only
[k
But if I wanted it to return every email on there individually how would I do that?

To get just the email str you can try:
emails = []
for email_link in allEmails:
emails.append(email_link.get("href").replace('mailto:', ''))
print(emails)

Based on your expected output, you can use the unwrap function of BeautifulSoup
allEmails = soup.find_all("a", href=re.compile(r"^mailto:"))
for Email in allEmails:
print(Email.unwrap()) #This will print the whole element along with tag
# k

UnicodeEncodeError when typing Chinese into url

import urllib.request
import bs4
key_word = input('What is the good you are searching for?')
price_low_limit = input('What are the lowest price restrictions?')
price_high_limit = input('What are the highest price restrictions?')
url_jd = 'https://search.jd.com/search?keyword={}&enc=utf-8&qrst=2&rt=1&stop=1&vt=2&wq={}&ev=exprice_{}-{}%5E&uc=0#J_searchWrap'.format(key_word, key_word, price_low_limit, price_high_limit)
response = urllib.request.urlopen(url_jd)
text = response.read().decode()
html = bs4.BeautifulSoup(text, 'html.parser')
total_item_j = []
for information in html.find_all('div', {'class': "gl-i-wrap"}):
for a in information.find_all('a', limit=1):
a_title = a['title']
a_href = a['href']
for prices in information.find_all('i', limit=1):
a = prices.text
item_j = {}
item_j['price'] = float(a)
item_j['name'] = a_title
item_j['url'] = a_href
total_item_j.append(item_j)
print(total_item_j)
This is a project I do in the school. I want to use this program to extract prices of goods I search. Currently, this code can work for English Input in python 3.7. However, if I search the good in Chinese, for example '巧克力' (Chocolate), it would turn out a Unicode Encode Error. Please, help me out.

You just want to ensure your string is encoded correctly. If you change key_word to:
key_word = u'巧克力'.encode('utf-8')
You'll find it works fine.
So your code would look like:
import urllib.request
import bs4
key_word = input('What is the good you are searching for?')
key_word = key_word.encode('utf-8')
...
More on unicode in python here

If you look at the stack trace, you'll see something like:
# Non-ASCII characters should have been eliminated earlier
--> 983 self._output(request.encode('ascii'))
ASCII encoding will fail for the characters in your key_word variable.
They should be URL-escaped first. Do that with:
key_word = urllib.parse.quote_plus(key_word)
And then prepare the url_jd string.

urlopen for loop with beautifulsoup

New user here. I'm starting to get the hang of Python syntax but keep getting thrown off by for loops. I understand each scenario I've reach on SO thus far (and my previous examples), but can't seem to come up with one for my current scenario.
I am playing around with BeautifulSoup to extract features from app stores as an exercise.
I created a list of both GooglePlay and iTunes urls to play around with.
list = {"https://play.google.com/store/apps/details?id=com.tov.google.ben10Xenodromeplus&hl=en",
"https://play.google.com/store/apps/details?id=com.doraemon.doraemonRepairShopSeasons&hl=en",
"https://play.google.com/store/apps/details?id=com.KnowledgeAdventure.SchoolOfDragons&hl=en",
"https://play.google.com/store/apps/details?id=com.turner.stevenrpg&hl=en",
"https://play.google.com/store/apps/details?id=com.indigokids.mimdoctor&hl=en",
"https://play.google.com/store/apps/details?id=com.rovio.gold&hl=en",
"https://itunes.apple.com/us/app/angry-birds/id343200656?mt=8",
"https://itunes.apple.com/us/app/doodle-jump/id307727765?mt=8",
"https://itunes.apple.com/us/app/tiny-wings/id417817520?mt=8",
"https://itunes.apple.com/us/app/flick-home-run-!/id454086751?mt=8",
"https://itunes.apple.com/us/app/bike-race-pro/id510461370?mt=8"}
To test out beautifulsoup (bs in my code), I used one app for each store:
gptest = bs(urllib.urlopen("https://play.google.com/store/apps/details?id=com.rovio.gold&hl=en"))
ios = bs(urllib.urlopen("https://itunes.apple.com/us/app/doodle-jump/id307727765?mt=8"))
I found an app's category on iTunes using:
print ios.find(itemprop="applicationCategory").get_text()
...and on Google Play:
print gptest.find(itemprop="genre").get_text()
With this newfound confidence, I wanted to try to iterate through my entire list and output these values, but then I realized I suck at for loops...
Here's my attempt:
def opensite():
for item in list:
bs(urllib.urlopen())
for item in list:
try:
if "itunes.apple.com" in row:
print "Category:", opensite.find(itemprop="applicationCategory").get_text()
else if "play.google.com" in row:
print "Category", opensite.find(itemprop="genre").get_text()
except:
pass
Note: Ideally I'd be passing a csv (called "sample" with one column "URL") so I believe my loop would start with
for row in sample.URL:
but I figured it was more helpful to show you a list rather than deal with a data frame.
Thanks in advance!

from __future__ import print_function #
try: #
from urllib import urlopen # Support Python 2 and 3
except ImportError: #
from urllib.request import urlopen #
from bs4 import BeautifulSoup as bs
for line in open('urls.dat'): # Read urls from file line by line
doc = bs(urlopen(line.strip()), 'html5lib') # Strip \n from url, open it and parse
if 'apple.com' in line:
prop = 'applicationCategory'
elif 'google.com' in line:
prop = 'genre'
else:
continue
print(doc.find(itemprop=prop).get_text())

Try this for reading urls from list:
from bs4 import BeautifulSoup as bs
import urllib2
import requests
list = {"https://play.google.com/store/apps/details?id=com.tov.google.ben10Xenodromeplus&hl=en",
"https://play.google.com/store/apps/details?id=com.doraemon.doraemonRepairShopSeasons&hl=en",
"https://play.google.com/store/apps/details?id=com.KnowledgeAdventure.SchoolOfDragons&hl=en",
"https://play.google.com/store/apps/details?id=com.turner.stevenrpg&hl=en",
"https://play.google.com/store/apps/details?id=com.indigokids.mimdoctor&hl=en",
"https://play.google.com/store/apps/details?id=com.rovio.gold&hl=en",
"https://itunes.apple.com/us/app/angry-birds/id343200656?mt=8",
"https://itunes.apple.com/us/app/doodle-jump/id307727765?mt=8",
"https://itunes.apple.com/us/app/tiny-wings/id417817520?mt=8",
"https://itunes.apple.com/us/app/flick-home-run-!/id454086751?mt=8",
"https://itunes.apple.com/us/app/bike-race-pro/id510461370?mt=8"}
def opensite():
for item in list:
bs(urllib2.urlopen(item),"html.parser")
source = requests.get(item)
text_new = source.text
soup = bs(text_new, "html.parser")
try:
if "itunes.apple.com" in item:
print item,"Category:",soup.find('span',{'itemprop':'applicationCategory'}).text
elif "play.google.com" in item:
print item,"Category:", soup.find('span',{'itemprop':'genre'}).text
except:
pass
opensite()
It will print like
https://itunes.apple.com/us/app/doodle-jump/id307727765?mt=8 Category: Games
https://play.google.com/store/apps/details?id=com.KnowledgeAdventure.SchoolOfDragons&hl=en Category: Role Playing
https://play.google.com/store/apps/details?id=com.tov.google.ben10Xenodromeplus&hl=en Category: Role Playing
https://itunes.apple.com/us/app/tiny-wings/id417817520?mt=8 Category: Games
https://play.google.com/store/apps/details?id=com.doraemon.doraemonRepairShopSeasons&hl=en Category: Role Playing
https://itunes.apple.com/us/app/angry-birds/id343200656?mt=8 Category: Games
https://play.google.com/store/apps/details?id=com.indigokids.mimdoctor&hl=en Category: Role Playing
https://itunes.apple.com/us/app/bike-race-pro/id510461370?mt=8 Category: Games
https://play.google.com/store/apps/details?id=com.rovio.gold&hl=en Category: Role Playing
https://play.google.com/store/apps/details?id=com.turner.stevenrpg&hl=en Category: Role Playing
https://itunes.apple.com/us/app/flick-home-run-!/id454086751?mt=8 Category: Games

For loops with user input on Python

Hello I'm learning how to parse HTML with BeautifulSoup. I would like to know if it is possible to use a user input in a for loop, as:
for (user input) in A
As A is a list of links so the user can choose to go for a link, using an input.
And then I use urllib to open that link and repeat the process.

You can use something like this:
import urllib2
from bs4 import BeautifulSoup
choice = ''
for url in urls:
print('Go to {}?'.format(url))
decision = input('Y/n ')
if decision == 'Y':
choice = url
break
if choice:
r = urllib2.urlopen(choice).read()
soup = BeautifulSoup(r, 'lxml')
# do something else

It wasn't exactly clear to me if you really wanted to "open" the link in a browser, so I included some code to do that. Is this maybe what you wanted from "digit a position"?
tl;dr
print("Which URL would you like to open?"
" (Please select an option between 1-{})".format(len(A)))
for index, link in enumerate(A):
print index+1, link
Full:
from bs4 import BeautifulSoup
import requests
import webbrowser
A = [
'https://www.google.com',
'https://www.stackoverflow.com',
'https://www.xkcd.com',
]
print("Which URL would you like to open?"
" (Please select an option between 1-{})".format(len(A)))
for index, link in enumerate(A):
print index+1, link
_input = input()
try:
option_index = int(_input) - 1
except ValueError:
print "{} is not a valid choice.".format(_input)
raise
try:
selection = A[option_index]
except IndexError:
print "{} is not a valid choice.".format(_input)
raise
webbrowser.open(selection)
response = requests.get(selection)
html_string = response.content
# Do parsing...

Thanks for your help. I achieved a solution on this.
Created two variables: count = input() and postion = input()
The count I have used in a for loop: for _ in range(c) - with this I can made a process repeat the number of times that the user want (on this assignement is 4).
The position (that for this assignement is predefined on 3), I use for list index, in a list with all url. So for open the url in position 3 I have:
url = links[p-1] (-1 for the reason that user inputs 3, but the list index starts with 0 (0,1,2...)
And then I can use urllib.request.urlopen.read()

BeautifulSoup Python script no longer working for mining a simple field

The script used to work, but no longer and I can't figure out why. I am trying to go to the link and extract/print the religion field. Using firebug, the religion field entry is within the 'tbody' then 'td' tag-structure. But now the script find "none" when searching for these tags. And I also look at the lxml by 'print Soup_FamSearch' and I couldn't see any 'tbody' and 'td' tags appeared on firebug.
Please let me know what I am missing?
import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time
from unicodedata import normalize
FamSearchURL = 'https://familysearch.org/pal:/MM9.1.1/KH21-211'
OpenFamSearchURL = urllib2.urlopen(FamSearchURL)
Soup_FamSearch = BeautifulSoup(OpenFamSearchURL, 'lxml')
OpenFamSearchURL.close()
tbodyTags = Soup_FamSearch.find('tbody')
trTags = tbodyTags.find_all('tr', class_='result-item ')
for trTags in trTags:
tdTags_label = trTag.find('td', class_='result-label ')
if tdTags_label:
tdTags_label_string = tdTags_label.get_text(strip=True)
if tdTags_label_string == 'Religion: ':
print trTags.find('td', class_='result-value ')

Find the Religion: label by text and get the next td sibling:
soup.find(text='Religion:').parent.find_next_sibling('td').get_text(strip=True)
Demo:
>>> import requests
>>> from bs4 import BeautifulSoup
>>>
>>> response = requests.get('https://familysearch.org/pal:/MM9.1.1/KH21-211')
>>> soup = BeautifulSoup(response.content, 'lxml')
>>>
>>> soup.find(text='Religion:').parent.find_next_sibling('td').get_text(strip=True)
Methodist
Then, you can make a nice reusable function and reuse:
def get_field_value(soup, field):
return soup.find(text='%s:' % field).parent.find_next_sibling('td').get_text(strip=True)
print get_field_value(soup, 'Religion')
print get_field_value(soup, 'Nationality')
print get_field_value(soup, 'Birthplace')
Prints:
Methodist
Canadian
Ontario

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding information on a website without an external module - python

Related

How to get all emails from a page individually

UnicodeEncodeError when typing Chinese into url

urlopen for loop with beautifulsoup

For loops with user input on Python

BeautifulSoup Python script no longer working for mining a simple field

Categories

Resources