issue extracting html page's string using bs4

issue extracting html page's string using bs4 - python

I'm writing a program to find song lyrics , the program is almost near to done but i have a little problem with bs4 data type ,
my question is how to extract plain text from lyric variable at the end of line ?
import re
import requests
import bs4
from urllib import unquote
def getLink(fileName):
webFileName = unquote(fileName)
page = requests.get("http://songmeanings.com/query/?query="+str(webFileName)+"&type=songtitles")
match = re.search('songmeanings\.com\/[^image].*?\/"',page.content)
if match:
Mached = str("http://"+match.group())
return(Mached[:-1:]) # this line used to remove a " at the end of line
else:
return(1)
def getText(link):
page = requests.get(str(link))
soup = bs4.BeautifulSoup(page.content ,"lxml")
return(soup)
Soup = getText(getLink("paranoid android"))
lyric = Soup.findAll(attrs={"lyric-box"})
print (lyric)
and here is outout :
[\n\t\t\t\t\t\tPlease could you stop the noise,\nI'm trying to get some rest\nFrom all the unborn chicken voices in my head\nWhat's that?\nWhat's that?\n\nWhen I am king, you will be first against the wall\nWith your opinion which is of no consequence at all\nWhat's that?\nWhat's that?\n\nAmbition makes you look pretty ugly\nKicking and squealing Gucci little piggy\nYou don't remember\nYou don't remember\nWhy don't you remember my name?\nOff with his head, man\nOff with his head, man\nWhy don't you remember my name?\nI guess he does\n\nRain down, rain down\nCome on rain down on me\nFrom a great height\nFrom a great height, height\nRain down, rain down\nCome on rain down on me\nFrom a great height\nFrom a great height, height,\nRain down, rain down\nCome on rain down on me\n\nThat's it, sir\nYou're leaving\nThe crackle of pigskin\nThe dust and the screaming\nThe yuppies networking\nThe panic, the vomit\nThe panic, the vomit\nGod loves his children,\nGod loves his children, yeah!\nEdit Lyrics\nEdit Wiki\nAdd Video\n ]

Append following line of code:
lyric = ''.join([tag.text for tag in lyric])
After
lyric = Soup.findAll(attrs={"lyric-box"})
You'll get output something like
Please could you stop the noise,
I'm trying to get some rest
From all the unborn chicken voices in my head
What's that?
What's that?
When I am king, you will be first against the wall
With your opinion which is of no consequence at all
What's that?
What's that?
...

First trim the leading and trailing [] by doing stringvar[1:-1] then on each line call linevar.strip() which will strip off all that whitespace.

for guys whom like the idea , with some little changes finally my code is looking like this :)
import re
import pycurl
import bs4
from urllib import unquote
from StringIO import StringIO
def getLink(fileName):
fileName = unquote(fileName)
baseAddres = "https://songmeanings.com/query/?query="
linkToPage = str(baseAddres)+str(fileName)+str("&type=songtitles")
buffer = StringIO()
page = pycurl.Curl()
page.setopt(page.URL,linkToPage)
page.setopt(page.WRITEDATA,buffer)
page.perform()
page.close()
pageSTR = buffer.getvalue()
soup = bs4.BeautifulSoup(pageSTR,"lxml")
tab_content = str(soup.find_all(attrs={"tab-content"}))
pattern = r'\"\/\/songmeanings.com\/.+?\"'
links = re.findall(pattern,tab_content)
"""returns first mached item without double quote
at the beginning and at the end of the string"""
return("http:"+links[0][1:-1:])
def getText(linkToSong):
buffer = StringIO()
page = pycurl.Curl()
page.setopt(page.URL,linkToSong)
page.setopt(page.WRITEDATA,buffer)
page.perform()
page.close()
pageSTR = buffer.getvalue()
soup = bs4.BeautifulSoup(pageSTR,"lxml")
lyric_box = soup.find_all(attrs={"lyric-box"})
lyric_boxSTR = ''.join([tag.text for tag in lyric_box])
return(lyric_boxSTR)
link = getLink("Anarchy In The U.K")
text = getText(link)
print(text)

Related

Invalid syntax error in python program which tries to figure out an IP

I have no experience whatsoever in coding but wanted to get this code snippet here working:
import re
import sys
import json
import GeoIP
import urllib
import string
import requests
gi = GeoIP.open("GeoLiteCity.dat",GeoIP.GEOIP STANDARD)
r = requests.get('http://lichess.org/stream', stream=True)
buff = ''
pattern = re.compile(sys.argv[1] + '.{30}')
for content in r.iter content():
if content:
buff = buff + content
if len(buff) > 1000:
result keys = re.findall(pattern, buff)
for el in result keys:
result = string.split(el)
print result[0], result[1], result[2][:-8], gi.record by addr(result[2][:-8])['country name'],
gi.record by addr(result[2][:-8])['region name'], gi.record by addr(result[2][:-8])['city']
buff = buff[-30:]
the compiler tells me there is invalid syntax in line 9, where it says STANDARD.
I looked the code up to find out the IP adress of a user based on the ID of a game on a chess site called lichess.org. I sort of expect some changes will be necessary given the fact that this code was posted 7 years ago and lichess changed certain things.
The OP of the thread where I found this additionally gave this advice:
usage: getip.py owlc08je
where getip.py your script name, "owlc08je" -id of game. If someone making move in this game his ip, country and city print out to the console.
However, it does not work.
Thanks in advance
Edited code with underscores and changes:
import re
import sys
import json
import GeoIP
import urllib
import string
import requests
gi = GeoIP.open("GeoLiteCity.dat",GeoIP.GEOIP_STANDARD)
r = requests.get('http://lichess.org/stream', stream=True)
buff = ''
pattern = re.compile(sys.argv[1] + '.{30}')
for content in r.iter_content():
if content:
buff = buff + content
if len(buff) > 1000:
result_keys = re.findall(pattern, buff)
for el in result_keys:
result = string.split(el)
print(result[0], result[1], result[2][:-8], gi.record_by_addr(result[2][:-8])['country name'],
gi.record by addr(result[2][:-8])['region name'], gi.record by addr(result[2][:-8])['city'])
buff = buff[-30:]

I think you are missing an underscore between GEOIP and STANDARD.
Replacing Line 9 with this should probably solve the issue:
gi = GeoIP.open("GeoLiteCity.dat",GeoIP.GEOIP_STANDARD)
EDIT:
As mentioned in one of the comments, if there are other places where underscores have been left out; that should solve the issue.

Web scraping multiple sites in python

I signed up to this website just to ask this question as I have been searching for hours over multiple days and haven't found anything.
I am trying to, within 10 seconds, scrape the 2-3 characters from 5 websites, combine them, and paste them into a box.
I have a rough idea of what I would need, but no idea how to go about this.
I believe I want to assign variables the scraped contents from a certain website, and then get it to print the combination of these variables for me to copy and paste.
I'm not an expert by any means in Python, so if possible, a copy/pasteable script would be great.
The websites are:
https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
Keeping this up now only because I cannot take it down. Thank you to those who have helped, I hope this helps someone else.
Sorry for being dumb

Thing is, I've done the code and tried it. It works, but that isn't the answer to the question. Getting the characters from the links and putting them together doesn't work. I've tried many things and I am still working it out myself. My advice, work it out yourself. It's a lot more rewarding and will probably help for future parts of the competition. Also, if you ever think about removing all of the 'a's from the code, that doesn't work either. I tried.
To answer your stack overflow question, here is the code (you need to install the 'requests' python modeule first):
import requests
page1 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page1_content = requests.get(page1)
page1text = page1_content.text
page2 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page2_content = requests.get(page2)
page2text = page2_content.text
page3 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page3_content = requests.get(page3)
page3text = page3_content.text
page4 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page4_content = requests.get(page4)
page4text = page4_content.text
page5 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page5_content = requests.get(page5)
page5text = page5_content.text
print(page1text + page2text + page3text + page4text + page5text)
But this method doesn't answer challenge 14.

I know the answer to the question, but instead of giving the code to complete it, I'll tell you one of the ways you might find it, as I completed that question myself.
When you asked this question, you completely forgot to mention that there was a sixth link: https://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=j7fPvtmWLDY5qeYFuJtmKw%3D%3D&string=%3Cclock%20pts%3E
Notice at the end of that hyperlink it says 'clock pts', whereas all the other links have had something like clock-pt1 or clock-pt4. What if the clock pts refers to all of the different links at once such as you have to create a string out of all the previous links you've been given, replace the 'clock pts' in the string section of the hyperlink WITH the string you made from the separate links, which would then give you the code to complete the level?
Below is the code I used to get the answer. It requires the requests module, in case you want to use it. (Also, I'm not 100% certain it will work all the time, since the challenge is based on a timer, the program may not get all the strings in time before the clock change, so make sure to run the program after the timer has reset)
import requests
page1 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page1_content = requests.get(page1)
page1text = page1_content.text
page2 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page2_content = requests.get(page2)
page2text = page2_content.text
page3 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page3_content = requests.get(page3)
page3text = page3_content.text
page4 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page4_content = requests.get(page4)
page4text = page4_content.text
page5 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page5_content = requests.get(page5)
page5text = page5_content.text
code=(page1text + page2text + page3text + page4text + page5text)
page6= "https://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=j7fPvtmWLDY5qeYFuJtmKw%3D%3D&string="+code
page6_content = requests.get(page6)
print(page6_content.text)

I have done something very similar with just as poor results at the end. I did, however, leave this running for a while and notice that the clock follow a pattern. Some time ago the clock read all as "aaaaaaaaaaaaaaa" then "aBaa1aafaa2aa3a" and "aDaafaaHaajaala". I'm going to wait for a full list and try suggesting the next clock sequence in the final URL. I'll get back to you if this works, just something to think about.
Also for help importing moduals I suggest :
https://programminghistorian.org/lessons/installing-python-modules-pip
&
https://docs.python.org/3/installing/index.html
import requests
abc = ""
while 1 == 1 :
page1 = requests.get('your first link')
page2 = requests.get('your second link')
page3 = requests.get('your thrid link')
page4 = requests.get('your fourth link')
page5 = requests.get('your fith link')
text = page1.text+page2.text+page3.text+page4.text+page5.text
# abc1 = "the verify link except clock pts is replaced with "+"text>" so the end looks like this :string=<"+text+">"
abc1 = text
if abc1 != abc:
print (abc1)
abc = abc1
Edit
The clock runs in 15-minute cycles with 90 codes altogether Im not sure how this helps as of yet but just posting ideas. I had to make some changes to get the codes to output cleanly and here is my improved version (this is very messy sorry):
import requests
abc = ""
page1 = requests.get('your first link')
page2 = requests.get('your second link')
page3 = requests.get('your thrid link')
page4 = requests.get('your fourth link')
page5 = requests.get('your fith link')
while 1 == 1 :
page12 = requests.get('your first link')
page22 = requests.get('your second link')
page32 = requests.get('your thrid link')
page42 = requests.get('your fourth link')
page52 = requests.get('your fith link')
if page1.text != page12.text and page2.text != page22.text and page3.text != page32.text and page4.text != page42.text and page5.text != page52.text:
text = page12.text+page22.text+page32.text+page42.text+page52.text
abc1 = text
# abc1 = * your url for verification with * string=<"+text+">"
if abc1 != abc:
print (abc1)
abc = abc1
page1 = page12
page2 = page22
page3 = page32
page4 = page42
page5 = page52
Final edit
I had sepnt so long going down the path of figuring out how that made the tak and doing way too much work. When Submitting the final url dont incluede your solutin as a repalcement for the section and NOT inside the <> so yours should likehttps://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=*this is an identifiere*&string=*The string you get*

I completed the challenge, I used an excel spreadsheet with functions to get all the little code things from every clock cycle and put them together to make one code every 10 seconds. Sorry if that doesn't make sense I'm not sure how to explain it. Then I pasted this into the end of the "validation link" to replace the < clock pts > at the end of the URL. I had to do this very fast before the clock reset. Very stressful haha. Then eventually I did this in time and it gave me the code. I hope this helps.
But you'll have to figure out how to get all the codes together in under 10 seconds by yourself, otherwise this is basically cheating, right?

Indexing issue that doesn't make sense when trying to scrape using BeautifulSoup

I'm trying to use the script below to go through a list of urls, find the date of each race per url and find the location of each race per url. I am getting an IndexError for out of range, but I know that the lists that I'm iterating over are all the same length and these errors don't make sense. Also when runnning through Pycharm I get different points at which the IndexErrors occur when compared to running through terminal. I wasn't going to post here, but I'm seriously confused and wondering if anyone else can replicate what I'm seeing and has an explanation of what I'm missing. Here's the code and the list:
import urllib.request
from bs4 import BeautifulSoup
with open('hk_pages.txt', 'r') as urls:
starting_list = urls.read().split()
for url in starting_list:
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
# Track
tracksoup = str(soup.findAll("td", {"class": "racingTitle"}))
tracklist = tracksoup.split('>')
track = tracklist[1][:2]
# Date
datesoup = str(soup.findAll("td", {"class": "tdAlignL number13 color_black"}))
datelist = datesoup.split()
date = datelist[6]
print(date)
print(track)
print("**************************************************")
Here's the list of urls:
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20150906
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20150909
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20150913
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20150916
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20150919
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20150923
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20150928
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151001
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151004
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151007
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151010
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151014
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151017
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151018
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151022
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151025
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151031
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151101
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151103
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151107
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151108
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151111
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151114
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151118
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151121
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151125
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151129
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151202
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151206
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151209
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151213
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151216
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151219
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151223
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20151227
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160101
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160106
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160109
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160113
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160117
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160120
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160124
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160131
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160203
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160206
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160210
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160214
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160217
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160221
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160224
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160227
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160228
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160302
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160305
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160306
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160309
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160313
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160316
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160319
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160320
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160323
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160326
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160328
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160331
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160402
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160403
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160406
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160409
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160410
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160413
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160416
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160417
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160420
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160424
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160427
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160501
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160504
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160507
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160511
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160514
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160518
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160522
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160529
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160601
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160604
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160605
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160609
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160612
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160614
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160615
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160616
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160618
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160619
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160622
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160626
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160701
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160706
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160710
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160903
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160907
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160911
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160918
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160921
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160925
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20160928
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161001
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161002
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161005
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161008
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161012
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161015
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161016
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161019
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161022
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161023
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161026
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161029
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161030
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161101
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161102
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161105
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161106
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161109
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161112
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161116
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161118
http://racing.hkjc.com/racing/info/meeting/ResultsAll/English/Local/20161120

# Track
tracksoup = str(soup.findAll("td", {"class": "racingTitle"}))
tracklist = tracksoup.split('>')
track = tracklist[1][:2]
the problem is you can not str([item,item...]), soup.findAll will return a list, if you try this, the out_put will be:
'[item,item...]'
this is not what you what.

Parsing already parsed results with BeautifulSoup

I have a question with using python and beautifulsoup.
My end result program basically fills out a form on a website and brings me back the results which I will eventually output to an lxml file. I'll be taking the results from https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS and I want to get a list for every city all into some excel documents.
Here is my code, I put it on pastebin:
http://pastebin.com/bZJfMp2N
MY RESULTS ARE ALMOST GOOD :D except now I'm getting            355 for my "correct value" instead of 355, for example. I want to parse that and only show the number, you will see when you put this into python.
However, anything I have tried does NOT work, there is no way I can parse that values_2 variable because the results are in bs4.element.resultset when I think i need to parse a string. Sorry if I am a noob, I am still learning and have worked very long on this program.
Would anyone have any input? Anything would be appreciated! I've read up that my results are in a list or something and i can't parse lists? How would I go about doing this?
Here is the code:
__author__ = 'kennytruong'
#THE PROBLEM HERE IS TO PARSE THE RESULTS PROPERLY!!
import urllib.parse, urllib.request
import re
from bs4 import BeautifulSoup
URL = "https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS"
#Goes through these locations, strips the whitespace in the string and creates a list that starts at every new line
LOCATIONS = '''
ALAMEDA ALAMEDA
'''.strip().split('\n') #strip() basically removes whitespaces
print('Available locations to choose from:', LOCATIONS)
INSURANCE_TYPES = '''
HOMEOWNERS,CONDOMINIUM,MOBILEHOME,RENTERS,EARTHQUAKE - Single Family,EARTHQUAKE - Condominium,EARTHQUAKE - Mobilehome,EARTHQUAKE - Renters
'''.strip().split(',') #strips the whitespaces and starts a newline of the list every comma
print('Available insurance types to choose from:', INSURANCE_TYPES)
COVERAGE_AMOUNTS = '''
15000,25000,35000,50000,75000,100000,150000,200000,250000,300000,400000,500000,750000
'''.strip().split(',')
print('All options for coverage amounts:', COVERAGE_AMOUNTS)
HOME_AGE = '''
New,1-3 Years,4-6 Years,7-15 Years,16-25 Years,26-40 Years,41-70 Years
'''.strip().split(',')
print('All Home Age Options:', HOME_AGE)
def get_premiums(location, coverage_type, coverage_amt, home_age):
formEntries = {'location':location,
'coverageType':coverage_type,
'coverageAmount':coverage_amt,
'homeAge':home_age}
inputData = urllib.parse.urlencode(formEntries)
inputData = inputData.encode('utf-8')
request = urllib.request.Request(URL, inputData)
response = urllib.request.urlopen(request)
responseData = response.read()
soup = BeautifulSoup(responseData, "html.parser")
parseResults = soup.find_all('tr', {'valign':'top'})
for eachthing in parseResults:
parse_me = eachthing.text
name = re.findall(r'[A-z].+', parse_me) #find me all the words that start with a cap, as many and it doesn't matter what kind.
# the . for any character and + to signify 1 or more of it.
values = re.findall(r'\d{1,10}', parse_me) #find me any digits, however many #'s long as long as btwn 1 and 10
values_2 = eachthing.find_all('div', {'align':'right'})
print('raw code for this part:\n' ,eachthing, '\n')
print('here is the name: ', name[0], values)
print('stuff on sheet 1- company name:', name[0], '- Premium Price:', values[0], '- Deductible', values[1])
print('but here is the correct values - ', values_2) #NEEDA STRIP THESE VALUES
# print(type(values_2)) DOING SO GIVES ME <class 'bs4.element.ResultSet'>, NEEDA PARSE bs4.element type
# values_3 = re.split(r'\d', values_2)
# print(values_3) ANYTHING LIKE THIS WILL NOT WORK BECAUSE I BELIEVE RESULTS ARENT STRING
print('\n\n')
def main():
for location in LOCATIONS: #seems to be looping the variable location in LOCATIONS - each location is one area
print('Here are the options that you selected: ', location, "HOMEOWNERS", "150000", "New", '\n\n')
get_premiums(location, "HOMEOWNERS", "150000", "New") #calls function get_premiums and passes parameters
if __name__ == "__main__": #this basically prevents all the indent level 0 code from getting executed, because otherwise the indent level 0 code gets executed regardless upon opening
main()

Same code used in multiple functions but with minor differences - how to optimize?

This is the code of a Udacity course, and I changed it a little. Now, when it runs, it asks me for a movie name and the trailer would open in a pop up in a browser (that's another part, which is not shown).
As you can see, this program has a lot of repetitive code in it, the functions extract_name, movie_poster_url and movie_trailer_url have kind of the same code. Is there a way to get rid of the same code being repeated but have the same output? If so, will it run faster?
import fresh_tomatoes
import media
import urllib
import requests
from BeautifulSoup import BeautifulSoup
name = raw_input("Enter movie name:- ")
global movie_name
def extract_html(name):
url = "website name" + name + "continuation of website name" + name + "again continuation of web site name"
response = requests.get(url)
page = str(BeautifulSoup(response.content))
return page
def extract_name(page):
start_link = page.find(' - IMDb</a></h3><div class="s"><div class="kv"')
start_url = page.find('>',start_link-140)
start_url1 = page.find('>', start_link-140)
end_url = page.find(' - IMDb</a>', start_link-140)
name_of_movie = page[start_url1+1:end_url]
return extract_char(name_of_movie)
def extract_char(name_of_movie):
name_array = []
for words in name_of_movie:
word = words.strip('</b>,')
name_array.append(word)
return ''.join(name_array)
def movie_poster_url(name_of_movie):
movie_name, seperator, tail = name_of_movie.partition(' (')
#movie_name = name_of_movie.rstrip('()0123456789 ')
page = urllib.urlopen('another web site name' + movie_name + 'continuation of website name').read()
start_link = page.find('"Poster":')
start_url = page.find('"',start_link+9)
end_url = page.find('"',start_url+1)
poster_url = page[start_url+1:end_url]
return poster_url
def movie_trailer_url(name_of_movie):
movie_name, seperator, tail = name_of_movie.partition(' (')
#movie_name = name_of_movie.rstrip('()0123456789 ')
page = urllib.urlopen('another website name' + movie_name + " trailer").read()
start_link = page.find('<div class="yt-lockup-dismissable"><div class="yt-lockup-thumbnail contains-addto"><a aria-hidden="true" href=')
start_url = page.find('"',start_link+110)
end_url = page.find('" ',start_url+1)
trailer_url1 = page[start_url+1:end_url]
trailer_url = "www.youtube.com" + trailer_url1
return trailer_url
page = extract_html(name)
movie_name = extract_name(page)
new_movie = media.Movie(movie_name, "Storyline WOW", movie_poster_url(movie_name), movie_trailer_url(movie_name))
movies = [new_movie]
fresh_tomatoes.open_movies_page(movies)

You could move the shared parts into their own function:
def find_page(url, name, find, offset):
movie_name, seperator, tail = name_of_movie.partition(' (')
page = urllib.urlopen(url.format(name)).read()
start_link = page.find(find)
start_url = page.find('"',start_link+offset)
end_url = page.find('" ',start_url+1)
return page[start_url+1:end_url]
def movie_poster_url(name_of_movie):
return find_page("another website name{} continuation of website name", name_of_movie, '"Poster":', 9)
def movie_trailer_url(name_of_movie):
trailer_url = find_page("another website name{} trailer", name_of_movie, '<div class="yt-lockup-dismissable"><div class="yt-lockup-thumbnail contains-addto"><a aria-hidden="true" href=', 110)
return "www.youtube.com" + trailer_url
It definetely wont run faster (there is extra work to do to "switch" between the functions) but the performance difference is probably negligable.
For your second question: Profiling is not a technique or method, it's "finding out what's being bad" in your code:
Profiling is a form of
dynamic program analysis that measures, for example, the space
(memory) or time complexity of a program, the usage of particular
instructions, or the frequency and duration of function calls.
(wikipedia)
So it's not something that speeds up your program, it's a word for things you do to find out what you can do to speed up your program.

Going really quickly here because I am a super newb but I can see the repetition; what I would do is to figure out the (mostly) repeating blocks of code shared by all 3 functions and then figure out where they differ; write a new function that takes the differences as the arguments. so for instance:
def extract(tarString,delim,startDiff,endDiff):
start_link = page.find(tarString)
start_url = page.find(delim,start_link+startDiff)
end_url = page.find(delim,start_url+endDiff)
url_out = page[start_url+1:end_url]
Then, in your poster, trailer, etc functions, just call this extract function with the appropriate arguments for each case. ie poster would call
poster_url=extract(tarString='"Poster:"',delim='"',startDiff=9, endDiff=1)
I can see you've got another answer already and it's very likely it's written by someone who knows more than I do, but I hope you get something out of my "philosophy of modularizing" from a newbie perspective.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

issue extracting html page's string using bs4 - python

First trim the leading and trailing [] by doing stringvar[1:-1] then on each line call linevar.strip() which will strip off all that whitespace.

Related

Invalid syntax error in python program which tries to figure out an IP

Web scraping multiple sites in python

Indexing issue that doesn't make sense when trying to scrape using BeautifulSoup

Parsing already parsed results with BeautifulSoup

Same code used in multiple functions but with minor differences - how to optimize?

Categories

Resources