python print() doesnt output what I expect - python

I made a small web-crawler in one function, upso_final.
If I print(upso_final()), I get 15 lists that include title, address, phone #. However, I want to print out only title, so I made variable title a global string. When I print it, I get only 1 title, the last one in the run. I want to get all 15 titles.
from __future__ import unicode_literals
import requests
from scrapy.selector import Selector
import scrapy
import pymysql
def upso_final(page=1):
def upso_from_page(url):
html = fetch_page(url)
sel = Selector(text=html)
global title,address,phone
title = sel.css('h1::text').extract()
address = sel.css('address::text').extract()
phone = sel.css('.mt1::text').extract()
return {
'title' : title,
'address' : address,
'phone' : phone
}
def upso_list_from_listpage(url):
html = fetch_page(url)
sel = Selector(text=html)
upso_list = sel.css('.title_list::attr(href)').extract()
return upso_list
def fetch_page(url):
r = requests.get(url)
return r.text
list_url = "http://yp.koreadaily.com/list/list.asp?page={0}&bra_code=LA&cat_code=L020502&strChar=&searchField=&txtAddr=&txtState=&txtZip=&txtSearch=&sort=N".format(page)
upso_lists = upso_list_from_listpage(list_url)
upsos = [upso_from_page(url) for url in upso_lists]
return upsos
upso_final()
print (title,address,phone)

The basic problem is that you're confused about passing values back from a function.
upso_from_page finds each of the 15 records in turn, placing the desired information in the global variables (generally a bad design). However, the only time you print any results is after you've found all 15. Since your logic has each record overwriting the previous one, you print only the last one you found.
It appears that upso_final accumulates the list and returns it, but you ignore that return value. Instead, try this in your main program:
upso_list = upso_final()
for upso in upso.list:
print (upso)
This should give you a 3-item dictionary for each upso record; from there, you can learn the referencing and format to your taste.
AN alternate solution is to print each record as you find it, from within upso_from_page, but your overall design suggests that's not what you want.

Related

Python function returning None despite having return statement

Below function is returning None despite having return statement. This seems to be a simple problem but i am unable to figure out the solution being a python beginner. The findurls function works perfectly but the second function -"murls" seems to have a problem.
def findurls(url):
s = requests.get(url, headers = headers)
txt = BeautifulSoup(s.text, 'lxml')
page = []
for link in txt.findAll('a'):
page.append(link.get('href'))
return s, page
def murls(page):
match = ['contact','contact us','contact-us','Contact Us','Contact us', 'Contact', 'Contact US','contactus','ContactUS','ContactUs']
matching = [n for n in match if any(n in i for i in page)]
return matching
details = murls(findurls("https://www.genre.com/"))
print(details)
The output generated by function findurls is as follows :-
['https://globalpage-prod.webex.com/join', 'http://www.genre.com/clientlogin/?c=n', 'http://www.genre.com/?c=n', '#nav', '#', 'https://www.genre.com/reinsurance-solutions/?c=n', 'https://www.genre.com/reinsurance-solutions/lifehealth/?c=n', 'https://www.genre.com/reinsurance-solutions/lifehealth/na/?c=n', 'https://www.genre.com/reinsurance-solutions/lifehealth/international/?c=n', 'https://www.genre.com/reinsurance-solutions/property-casualty/?c=n', 'https://www.genre.com/reinsurance-solutions/property-casualty/property-engineering-marine/?c=n', 'https://www.genre.com/reinsurance-solutions/property-casualty/auto-motor/?c=n', 'https://www.genre.com/reinsurance-solutions/property-casualty/surety-bond/?c=n', 'https://www.genre.com/reinsurance-solutions/property-casualty/casualty/?c=n', 'https://www.genre.com/reinsurance-solutions/lifehealth/?c=n', 'https://www.genre.com/reinsurance-solutions/property-casualty/?c=n', 'https://www.genre.com/reinsurance-solutions/lifehealth/na/?c=n', 'https://www.genre.com/reinsurance-solutions/lifehealth/international/?c=n', 'https://www.genre.com/reinsurance-solutions/property-casualty/property-engineering-marine/?c=n', 'https://www.genre.com/reinsurance-solutions/property-casualty/auto-motor/?c=n', 'https://www.genre.com/reinsurance-solutions/property-casualty/surety-bond/?c=n', 'https://www.genre.com/reinsurance-solutions/property-casualty/casualty/?c=n', 'https://www.genre.com/knowledge/?c=n', 'https://www.genre.com/knowledge/all/?c=n', 'https://www.genre.com/knowledge/publications/?c=n', 'https://www.genre.com/knowledge/blog/?c=n', 'https://www.genre.com/knowledge/multimedia/?c=n', 'https://www.genre.com/knowledge/all/?c=n', 'https://www.genre.com/knowledge/publications/?c=n', 'https://www.genre.com/knowledge/blog/?c=n', 'https://www.genre.com/knowledge/multimedia/?c=n', 'https://www.genre.com/contactus/?c=n', 'https://www.genre.com/careers/?c=n', 'https://www.genre.com/careers/job-posting/?c=n', 'https://www.genre.com/careers/recent-graduates/?c=n', 'https://www.genre.com/careers/internships/?c=n', 'https://www.genre.com/careers/job-posting/?c=n', 'https://www.genre.com/careers/recent-graduates/?c=n', 'https://www.genre.com/careers/internships/?c=n', 'https://www.genre.com/aboutus/?c=n', 'https://www.genre.com/aboutus/meet-genre/?c=n', 'https://www.genre.com/aboutus/senior-management-team/?c=n', 'https://www.genre.com/aboutus/financial-info/?c=n', 'https://www.genre.com/aboutus/press-releases/?c=n', 'https://www.genre.com/aboutus/privacy-at-genre/?c=n', 'https://www.genre.com/aboutus/meet-genre/?c=n', 'https://www.genre.com/aboutus/senior-management-team/?c=n', 'https://www.genre.com/aboutus/financial-info/?c=n', 'https://www.genre.com/aboutus/press-releases/?c=n', 'https://www.genre.com/aboutus/privacy-at-genre/?c=n', '/knowledge/blog/wildfire-season-is-here-underwriting-factors-and-tools-for-the-wildfire-peril-en.html', '/knowledge/blog/wildfire-season-is-here-underwriting-factors-and-tools-for-the-wildfire-peril-en.html', 'https://www.genre.com/knowledge/blog/wildfire-season-is-here-underwriting-factors-and-tools-for-the-wildfire-peril-en.html', 'https://www.genre.com/knowledge/blog/contributors/marc-dahling.html?contributorTabSearch=blogPosts', '/knowledge/blog/what-does-the-us-supreme-courts-recent-lgbtq-ruling-mean-for-businesses-and-epli-en.html', '/knowledge/blog/what-does-the-us-supreme-courts-recent-lgbtq-ruling-mean-for-businesses-and-epli-en.html', '/knowledge/blog/individual-disability-in-the-us-behind-the-numbers-en.html', '/knowledge/blog/individual-disability-in-the-us-behind-the-numbers-en.html', 'https://www.genre.com/knowledge/blog/individual-disability-in-the-us-behind-the-numbers-en.html', 'https://www.genre.com/knowledge/blog/contributors/steve-woods.html?contributorTabSearch=blogPosts', '/knowledge/publications/cmchina20-1-en.html', '/knowledge/publications/cmchina20-1-en.html', 'https://www.genre.com/knowledge/publications/cmchina20-1-en.html', 'https://www.genre.com/knowledge/blog/contributors/frank-wang.html?contributorTabSearch=blogPosts', '/knowledge/blog/contributors/', '/contactus/', 'https://cta-redirect.hubspot.com/cta/redirect/525060/3d7afa2a-d966-40c4-860a-07709aacf6cd', '#tab1', '#tab2', '#tab3', '#tab1', '/knowledge', 'https://www.genre.com/knowledge/publications/uwfocus20-1-luckmann-en.html', 'https://www.genre.com/knowledge/publications/uwfocus20-1-luckmann-en.html', 'https://www.genre.com/knowledge/blog/contributors/annika-luckmann.html', 'https://www.genre.com/knowledge/publications/uwfocus20-1-luckmann-en.html', 'https://www.genre.com/knowledge/blog/contributors/tim-fletcher.html', "javascript:trackRecommentedBlog('https://www.genre.com/knowledge/blog/riots-and-civil-commotion-disquieting-times-ahead-en.html')", "javascript:trackRecommentedBlog('https://www.genre.com/knowledge/blog/riots-and-civil-commotion-disquieting-times-ahead-en.html')", 'https://www.genre.com/knowledge/blog/contributors/tim-eppert.html', "javascript:trackRecommentedBlog('https://www.genre.com/knowledge/blog/changes-in-cancer-classification-how-do-they-impact-critical-illness-insurance-en.html')", "javascript:trackRecommentedBlog('https://www.genre.com/knowledge/blog/changes-in-cancer-classification-how-do-they-impact-critical-illness-insurance-en.html')", 'https://twitter.com/Gen_Re', '#tab2', '/reinsurance-solutions/#tab=-1', '/reinsurance-solutions/lifehealth/na/', '/reinsurance-solutions/lifehealth/na/', '/reinsurance-solutions/lifehealth/na/', '/reinsurance-solutions/lifehealth/international/', '/reinsurance-solutions/lifehealth/international/', '/reinsurance-solutions/lifehealth/international/', '/reinsurance-solutions/#tab=0', '/reinsurance-solutions/property-casualty/auto-motor/', '/reinsurance-solutions/property-casualty/auto-motor/', '/reinsurance-solutions/property-casualty/auto-motor/', '/reinsurance-solutions/property-casualty/casualty/', '/reinsurance-solutions/property-casualty/casualty/', '/reinsurance-solutions/property-casualty/casualty/', '/reinsurance-solutions/property-casualty/property-engineering-marine/', '/reinsurance-solutions/property-casualty/property-engineering-marine/', '/reinsurance-solutions/property-casualty/property-engineering-marine/', '/reinsurance-solutions/property-casualty/surety-bond/', '/reinsurance-solutions/property-casualty/surety-bond/', '/reinsurance-solutions/property-casualty/surety-bond/', '#tab3', 'https://www.genre.com/knowledge/blog/contributors/sandra-mitic.html', 'https://www.genre.com/knowledge/blog/contributors/sandra-mitic.html', 'https://www.genre.com/knowledge/blog/contributors/roman-hannig.html', 'https://www.genre.com/knowledge/blog/contributors/roman-hannig.html', '/careers/', '/terms/', '/sitemap/', '/imprint/', '/aboutus/privacy-at-genre/', 'http://www.genre.com/?c=n', 'http://www.linkedin.com/company/gen-re', 'https://twitter.com/Gen_Re', 'https://www.youtube.com/user/GenRePerspective/playlists', 'http://www.slideshare.net/genreperspective', 'https://www.genre.com/reinsurance-solutions/', 'https://www.genre.com/reinsurance-solutions/lifehealth/', 'https://www.genre.com/reinsurance-solutions/lifehealth/na/', 'https://www.genre.com/reinsurance-solutions/lifehealth/international/', 'https://www.genre.com/reinsurance-solutions/property-casualty/', 'https://www.genre.com/reinsurance-solutions/property-casualty/property-engineering-marine/', 'https://www.genre.com/reinsurance-solutions/property-casualty/auto-motor/', 'https://www.genre.com/reinsurance-solutions/property-casualty/surety-bond/', 'https://www.genre.com/reinsurance-solutions/property-casualty/casualty/', 'https://www.genre.com/knowledge/', 'https://www.genre.com/knowledge/all/', 'https://www.genre.com/knowledge/publications/', 'https://www.genre.com/knowledge/blog/', 'https://www.genre.com/knowledge/multimedia/', 'http://knowledge.genre.com/subscribe?utm_campaign=Subscription%20Management%20Center&utm_medium=footer&utm_source=website', 'https://www.genre.com/contactus/', 'mailto:Genre_Feedback_EN#genre.com?subject=Reg: Gen Re Website Feedback', 'https://www.genre.com/careers/', 'https://www.genre.com/careers/job-posting/', 'https://www.genre.com/careers/recent-graduates/', 'https://www.genre.com/careers/internships/', 'https://www.genre.com/aboutus/', 'https://www.genre.com/aboutus/meet-genre/', 'https://www.genre.com/aboutus/senior-management-team/', 'https://www.genre.com/aboutus/financial-info/', 'https://www.genre.com/aboutus/press-releases/', 'https://www.genre.com/aboutus/privacy-at-genre/'])
Whereas when i use both functions together, it produces the below output - an empty list :-
[]
Thanks !!
findurls returns two objects
return s, page
but murls only wants one, page.
Option 1: separate the calls into separate lines so you can select what arguments to pass to murls.
s, page = findurls("https://www.genre.com/")
details = murls(page)
print(details)
Option 2: use indexing to pick out the second item from the tuple.
details = murls(findurls("https://www.genre.com/")[1])
print(details)
You have problem in murls function. You are supposed to pass a page but you are passing a URL. So page becomes https://www.genre.com which will not match according to your code making matching, None. This when passed to findurls will match nothing as your page is empty. So you get an empty list. You should try to fetch page in mruls and then apply logic.

Confused regarding the usage of -init_ & self in a class

I first wrote the necessary code to get the information I wanted from the internet, and it works. But now I'm trying to make the code look a bit nicer, therefore I want to put it into functions that are in a class. But I'm a bit confused when it comes to the usages of self and _init_. Currently, the code isn't working as I want, meaning it isn't adding the information to my dictionary.
As I have understood, you have to add self as a parameter in every function you create in a class. But I don't think I'm using the _init_ in a correct way.
from bs4 import BeautifulSoup
import requests
# Importing data from Nasdaq
page_link = "https://www.nasdaq.com/symbol/aapl/financials?query=balance-sheet"
page_response = requests.get(page_link, timeout=1000)
page_content = BeautifulSoup(page_response.content, "lxml")
# Creating class that gather essential stock information
class CompanySheet:
# creating dictionary to store stock information
def __init__(self):
self.stockInfo = {
"ticker": "",
"sharePrice": "",
"assets": "",
"liabilities": "",
"shareholderEquity": ""
}
def ticker(self):
# Finding ticker
self.tickerSymbol = page_content.find("div", attrs={"class":"qbreadcrumb"})
self.a_TickerList = self.tickerSymbol.findAll("a")
self.a_TickerList = (self.a_TickerList[2].text)
# Adding ticker to dictionary
self.stockInfo["ticker"] = self.a_TickerList
print(self.a_TickerList)
def share(self):
# Finding share price
self.sharePrice = page_content.findAll("div", attrs={"id":"qwidget_lastsale"})
self.aSharePrice = (self.sharePrice[0].text)
# Transforming share price to desired format
self.aSharePrice = str(self.aSharePrice[1:]).replace( ',' , '' )
self.aSharePrice = float(self.aSharePrice)
# Adding results to dictionary
self.stockInfo["sharePrice"] = self.aSharePrice
"""
def assets(self):
# Finding total assets
totalAssets = page_content.findAll("tr", attrs={"class":"net"})[1]
td_assetList = totalAssets.findAll("td")
tdAssets = (td_assetList[22].text)
# Transforming share price to desired format
tdAssets = str(tdAssets[1:]).replace( ',' , '' )
tdAssets = float(tdAssets)
# Adding results to dictionary
self.stockInfo["assets"] = tdAssets
def liabilites(self):
# Finding total liabilities
totalLiabilities = page_content.findAll("tr", attrs={"class":"net"})[3]
td_liabilityList = totalLiabilities.findAll("td")
tdLiabilities = (td_liabilityList[24].text)
# Transforming share price to desired format
tdLiabilities = str(tdLiabilities[1:]).replace( ',' , '' )
tdLiabilities = float(tdLiabilities)
# Adding results to dictionary
self.stockInfo["liabilities"] = tdLiabilities
def equity(self):
# Finding shareholder equity
netEquity = page_content.findAll("tr", attrs={"class":"net"})[4]
td_equityList = netEquity.findAll("td")
tdShareholderEquity = (td_equityList[24].text)
# Transforming shareholder equity to desired format
tdShareholderEquity = str(tdShareholderEquity[1:]).replace( ',' , '' )
tdShareholderEquity = float(tdShareholderEquity)
# Adding results to dictionary
self.stockInfo["shareholderEquity"] = tdShareholderEquity
"""
companySheet = CompanySheet()
print(companySheet.stockInfo)
All I want the code to do is for each function to parse it's information to my dictionary. I then want to access it outside of the class. Can someone help to clarify how I can use _init_ in this scenario, or do I even have to use it?
init is a constructor, which is called along with the creation of the class object. Whereas, self is an instance of the class, which is used accessing methods and attributes of a python class.
In your code, firstly change:
_init_(self) to __init__(self)
Then, in the methods:
def share(self):
# Finding share price
sharePrice = page_content.findAll("div", attrs={"id":"qwidget_lastsale"})
self.aSharePrice = (sharePrice[0].text)
# Transforming share price to desired format
self.aSharePrice = str(aSharePrice[1:]).replace( ',' , '' )
self.aSharePrice = float(aSharePrice)
# Adding results to dictionary
self.stockInfo["sharePrice"] = self.aSharePrice
Similarly, in all the remaining methods, access the variable through the self keyword.
Now, you also need to call the methods which are updating your dictionary.
So, after you have created the object, call the methods through the object and then print the dictionary, like this:
companySheet = CompanySheet()
companySheet.share()
print(companySheet.stockInfo)
Probably it would work!

monitoring a text site (json) using python

IM working on a program to grab variant ID from this website
https://www.deadstock.ca/collections/new-arrivals/products/nike-air-max-1-cool-grey.json
Im using the code
import json
import requests
import time
endpoint = "https://www.deadstock.ca/collections/new-arrivals/products/nike-air-max-1-cool-grey.json"
req = requests.get(endpoint)
reqJson = json.loads(req.text)
for id in reqJson['product']:
name = (id['title'])
print (name)
I dont know what to do here in order to grab the Name of the items. If you visit the link you will see that the name is under 'title'. If you could help me with this that would be awesome.
I get the error message "TypeError: string indices must be integers" so im not too sure what to do.
Your biggest problem right now is that you are adding items to the list before you're checking if they're in it, so everything is coming back as in the list.
Looking at your code right now, I think what you want to do is combine things into a single for loop.
Also as a heads up you shouldn't use a variable name like list as it is shadowing the built-in Python function list().
list = [] # You really should change this to something else
def check_endpoint():
endpoint = ""
req = requests.get(endpoint)
reqJson = json.loads(req.text)
for id in reqJson['threads']: # For each id in threads list
PID = id['product']['globalPid'] # Get current PID
if PID in list:
print('checking for new products')
else:
title = (id['product']['title'])
Image = (id['product']['imageUrl'])
ReleaseType = (id['product']['selectionEngine'])
Time = (id['product']['effectiveInStockStartSellDate'])
send(title, PID, Image, ReleaseType, Time)
print ('added to database'.format(PID))
list.append(PID) # Add PID to the list
return
def main():
while(True):
check_endpoint()
time.sleep(20)
return
if __name__ == "__main__":
main()

Parsing already parsed results with BeautifulSoup

I have a question with using python and beautifulsoup.
My end result program basically fills out a form on a website and brings me back the results which I will eventually output to an lxml file. I'll be taking the results from https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS and I want to get a list for every city all into some excel documents.
Here is my code, I put it on pastebin:
http://pastebin.com/bZJfMp2N
MY RESULTS ARE ALMOST GOOD :D except now I'm getting            355 for my "correct value" instead of 355, for example. I want to parse that and only show the number, you will see when you put this into python.
However, anything I have tried does NOT work, there is no way I can parse that values_2 variable because the results are in bs4.element.resultset when I think i need to parse a string. Sorry if I am a noob, I am still learning and have worked very long on this program.
Would anyone have any input? Anything would be appreciated! I've read up that my results are in a list or something and i can't parse lists? How would I go about doing this?
Here is the code:
__author__ = 'kennytruong'
#THE PROBLEM HERE IS TO PARSE THE RESULTS PROPERLY!!
import urllib.parse, urllib.request
import re
from bs4 import BeautifulSoup
URL = "https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS"
#Goes through these locations, strips the whitespace in the string and creates a list that starts at every new line
LOCATIONS = '''
ALAMEDA ALAMEDA
'''.strip().split('\n') #strip() basically removes whitespaces
print('Available locations to choose from:', LOCATIONS)
INSURANCE_TYPES = '''
HOMEOWNERS,CONDOMINIUM,MOBILEHOME,RENTERS,EARTHQUAKE - Single Family,EARTHQUAKE - Condominium,EARTHQUAKE - Mobilehome,EARTHQUAKE - Renters
'''.strip().split(',') #strips the whitespaces and starts a newline of the list every comma
print('Available insurance types to choose from:', INSURANCE_TYPES)
COVERAGE_AMOUNTS = '''
15000,25000,35000,50000,75000,100000,150000,200000,250000,300000,400000,500000,750000
'''.strip().split(',')
print('All options for coverage amounts:', COVERAGE_AMOUNTS)
HOME_AGE = '''
New,1-3 Years,4-6 Years,7-15 Years,16-25 Years,26-40 Years,41-70 Years
'''.strip().split(',')
print('All Home Age Options:', HOME_AGE)
def get_premiums(location, coverage_type, coverage_amt, home_age):
formEntries = {'location':location,
'coverageType':coverage_type,
'coverageAmount':coverage_amt,
'homeAge':home_age}
inputData = urllib.parse.urlencode(formEntries)
inputData = inputData.encode('utf-8')
request = urllib.request.Request(URL, inputData)
response = urllib.request.urlopen(request)
responseData = response.read()
soup = BeautifulSoup(responseData, "html.parser")
parseResults = soup.find_all('tr', {'valign':'top'})
for eachthing in parseResults:
parse_me = eachthing.text
name = re.findall(r'[A-z].+', parse_me) #find me all the words that start with a cap, as many and it doesn't matter what kind.
# the . for any character and + to signify 1 or more of it.
values = re.findall(r'\d{1,10}', parse_me) #find me any digits, however many #'s long as long as btwn 1 and 10
values_2 = eachthing.find_all('div', {'align':'right'})
print('raw code for this part:\n' ,eachthing, '\n')
print('here is the name: ', name[0], values)
print('stuff on sheet 1- company name:', name[0], '- Premium Price:', values[0], '- Deductible', values[1])
print('but here is the correct values - ', values_2) #NEEDA STRIP THESE VALUES
# print(type(values_2)) DOING SO GIVES ME <class 'bs4.element.ResultSet'>, NEEDA PARSE bs4.element type
# values_3 = re.split(r'\d', values_2)
# print(values_3) ANYTHING LIKE THIS WILL NOT WORK BECAUSE I BELIEVE RESULTS ARENT STRING
print('\n\n')
def main():
for location in LOCATIONS: #seems to be looping the variable location in LOCATIONS - each location is one area
print('Here are the options that you selected: ', location, "HOMEOWNERS", "150000", "New", '\n\n')
get_premiums(location, "HOMEOWNERS", "150000", "New") #calls function get_premiums and passes parameters
if __name__ == "__main__": #this basically prevents all the indent level 0 code from getting executed, because otherwise the indent level 0 code gets executed regardless upon opening
main()

Same code used in multiple functions but with minor differences - how to optimize?

This is the code of a Udacity course, and I changed it a little. Now, when it runs, it asks me for a movie name and the trailer would open in a pop up in a browser (that's another part, which is not shown).
As you can see, this program has a lot of repetitive code in it, the functions extract_name, movie_poster_url and movie_trailer_url have kind of the same code. Is there a way to get rid of the same code being repeated but have the same output? If so, will it run faster?
import fresh_tomatoes
import media
import urllib
import requests
from BeautifulSoup import BeautifulSoup
name = raw_input("Enter movie name:- ")
global movie_name
def extract_html(name):
url = "website name" + name + "continuation of website name" + name + "again continuation of web site name"
response = requests.get(url)
page = str(BeautifulSoup(response.content))
return page
def extract_name(page):
start_link = page.find(' - IMDb</a></h3><div class="s"><div class="kv"')
start_url = page.find('>',start_link-140)
start_url1 = page.find('>', start_link-140)
end_url = page.find(' - IMDb</a>', start_link-140)
name_of_movie = page[start_url1+1:end_url]
return extract_char(name_of_movie)
def extract_char(name_of_movie):
name_array = []
for words in name_of_movie:
word = words.strip('</b>,')
name_array.append(word)
return ''.join(name_array)
def movie_poster_url(name_of_movie):
movie_name, seperator, tail = name_of_movie.partition(' (')
#movie_name = name_of_movie.rstrip('()0123456789 ')
page = urllib.urlopen('another web site name' + movie_name + 'continuation of website name').read()
start_link = page.find('"Poster":')
start_url = page.find('"',start_link+9)
end_url = page.find('"',start_url+1)
poster_url = page[start_url+1:end_url]
return poster_url
def movie_trailer_url(name_of_movie):
movie_name, seperator, tail = name_of_movie.partition(' (')
#movie_name = name_of_movie.rstrip('()0123456789 ')
page = urllib.urlopen('another website name' + movie_name + " trailer").read()
start_link = page.find('<div class="yt-lockup-dismissable"><div class="yt-lockup-thumbnail contains-addto"><a aria-hidden="true" href=')
start_url = page.find('"',start_link+110)
end_url = page.find('" ',start_url+1)
trailer_url1 = page[start_url+1:end_url]
trailer_url = "www.youtube.com" + trailer_url1
return trailer_url
page = extract_html(name)
movie_name = extract_name(page)
new_movie = media.Movie(movie_name, "Storyline WOW", movie_poster_url(movie_name), movie_trailer_url(movie_name))
movies = [new_movie]
fresh_tomatoes.open_movies_page(movies)
You could move the shared parts into their own function:
def find_page(url, name, find, offset):
movie_name, seperator, tail = name_of_movie.partition(' (')
page = urllib.urlopen(url.format(name)).read()
start_link = page.find(find)
start_url = page.find('"',start_link+offset)
end_url = page.find('" ',start_url+1)
return page[start_url+1:end_url]
def movie_poster_url(name_of_movie):
return find_page("another website name{} continuation of website name", name_of_movie, '"Poster":', 9)
def movie_trailer_url(name_of_movie):
trailer_url = find_page("another website name{} trailer", name_of_movie, '<div class="yt-lockup-dismissable"><div class="yt-lockup-thumbnail contains-addto"><a aria-hidden="true" href=', 110)
return "www.youtube.com" + trailer_url
It definetely wont run faster (there is extra work to do to "switch" between the functions) but the performance difference is probably negligable.
For your second question: Profiling is not a technique or method, it's "finding out what's being bad" in your code:
Profiling is a form of
dynamic program analysis that measures, for example, the space
(memory) or time complexity of a program, the usage of particular
instructions, or the frequency and duration of function calls.
(wikipedia)
So it's not something that speeds up your program, it's a word for things you do to find out what you can do to speed up your program.
Going really quickly here because I am a super newb but I can see the repetition; what I would do is to figure out the (mostly) repeating blocks of code shared by all 3 functions and then figure out where they differ; write a new function that takes the differences as the arguments. so for instance:
def extract(tarString,delim,startDiff,endDiff):
start_link = page.find(tarString)
start_url = page.find(delim,start_link+startDiff)
end_url = page.find(delim,start_url+endDiff)
url_out = page[start_url+1:end_url]
Then, in your poster, trailer, etc functions, just call this extract function with the appropriate arguments for each case. ie poster would call
poster_url=extract(tarString='"Poster:"',delim='"',startDiff=9, endDiff=1)
I can see you've got another answer already and it's very likely it's written by someone who knows more than I do, but I hope you get something out of my "philosophy of modularizing" from a newbie perspective.

Categories

Resources