Web scraping multiple sites in python

Web scraping multiple sites in python - python

I signed up to this website just to ask this question as I have been searching for hours over multiple days and haven't found anything.
I am trying to, within 10 seconds, scrape the 2-3 characters from 5 websites, combine them, and paste them into a box.
I have a rough idea of what I would need, but no idea how to go about this.
I believe I want to assign variables the scraped contents from a certain website, and then get it to print the combination of these variables for me to copy and paste.
I'm not an expert by any means in Python, so if possible, a copy/pasteable script would be great.
The websites are:
https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
Keeping this up now only because I cannot take it down. Thank you to those who have helped, I hope this helps someone else.
Sorry for being dumb

Thing is, I've done the code and tried it. It works, but that isn't the answer to the question. Getting the characters from the links and putting them together doesn't work. I've tried many things and I am still working it out myself. My advice, work it out yourself. It's a lot more rewarding and will probably help for future parts of the competition. Also, if you ever think about removing all of the 'a's from the code, that doesn't work either. I tried.
To answer your stack overflow question, here is the code (you need to install the 'requests' python modeule first):
import requests
page1 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page1_content = requests.get(page1)
page1text = page1_content.text
page2 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page2_content = requests.get(page2)
page2text = page2_content.text
page3 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page3_content = requests.get(page3)
page3text = page3_content.text
page4 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page4_content = requests.get(page4)
page4text = page4_content.text
page5 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page5_content = requests.get(page5)
page5text = page5_content.text
print(page1text + page2text + page3text + page4text + page5text)
But this method doesn't answer challenge 14.

I know the answer to the question, but instead of giving the code to complete it, I'll tell you one of the ways you might find it, as I completed that question myself.
When you asked this question, you completely forgot to mention that there was a sixth link: https://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=j7fPvtmWLDY5qeYFuJtmKw%3D%3D&string=%3Cclock%20pts%3E
Notice at the end of that hyperlink it says 'clock pts', whereas all the other links have had something like clock-pt1 or clock-pt4. What if the clock pts refers to all of the different links at once such as you have to create a string out of all the previous links you've been given, replace the 'clock pts' in the string section of the hyperlink WITH the string you made from the separate links, which would then give you the code to complete the level?
Below is the code I used to get the answer. It requires the requests module, in case you want to use it. (Also, I'm not 100% certain it will work all the time, since the challenge is based on a timer, the program may not get all the strings in time before the clock change, so make sure to run the program after the timer has reset)
import requests
page1 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page1_content = requests.get(page1)
page1text = page1_content.text
page2 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page2_content = requests.get(page2)
page2text = page2_content.text
page3 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page3_content = requests.get(page3)
page3text = page3_content.text
page4 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page4_content = requests.get(page4)
page4text = page4_content.text
page5 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page5_content = requests.get(page5)
page5text = page5_content.text
code=(page1text + page2text + page3text + page4text + page5text)
page6= "https://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=j7fPvtmWLDY5qeYFuJtmKw%3D%3D&string="+code
page6_content = requests.get(page6)
print(page6_content.text)

I have done something very similar with just as poor results at the end. I did, however, leave this running for a while and notice that the clock follow a pattern. Some time ago the clock read all as "aaaaaaaaaaaaaaa" then "aBaa1aafaa2aa3a" and "aDaafaaHaajaala". I'm going to wait for a full list and try suggesting the next clock sequence in the final URL. I'll get back to you if this works, just something to think about.
Also for help importing moduals I suggest :
https://programminghistorian.org/lessons/installing-python-modules-pip
&
https://docs.python.org/3/installing/index.html
import requests
abc = ""
while 1 == 1 :
page1 = requests.get('your first link')
page2 = requests.get('your second link')
page3 = requests.get('your thrid link')
page4 = requests.get('your fourth link')
page5 = requests.get('your fith link')
text = page1.text+page2.text+page3.text+page4.text+page5.text
# abc1 = "the verify link except clock pts is replaced with "+"text>" so the end looks like this :string=<"+text+">"
abc1 = text
if abc1 != abc:
print (abc1)
abc = abc1
Edit
The clock runs in 15-minute cycles with 90 codes altogether Im not sure how this helps as of yet but just posting ideas. I had to make some changes to get the codes to output cleanly and here is my improved version (this is very messy sorry):
import requests
abc = ""
page1 = requests.get('your first link')
page2 = requests.get('your second link')
page3 = requests.get('your thrid link')
page4 = requests.get('your fourth link')
page5 = requests.get('your fith link')
while 1 == 1 :
page12 = requests.get('your first link')
page22 = requests.get('your second link')
page32 = requests.get('your thrid link')
page42 = requests.get('your fourth link')
page52 = requests.get('your fith link')
if page1.text != page12.text and page2.text != page22.text and page3.text != page32.text and page4.text != page42.text and page5.text != page52.text:
text = page12.text+page22.text+page32.text+page42.text+page52.text
abc1 = text
# abc1 = * your url for verification with * string=<"+text+">"
if abc1 != abc:
print (abc1)
abc = abc1
page1 = page12
page2 = page22
page3 = page32
page4 = page42
page5 = page52
Final edit
I had sepnt so long going down the path of figuring out how that made the tak and doing way too much work. When Submitting the final url dont incluede your solutin as a repalcement for the section and NOT inside the <> so yours should likehttps://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=*this is an identifiere*&string=*The string you get*

I completed the challenge, I used an excel spreadsheet with functions to get all the little code things from every clock cycle and put them together to make one code every 10 seconds. Sorry if that doesn't make sense I'm not sure how to explain it. Then I pasted this into the end of the "validation link" to replace the < clock pts > at the end of the URL. I had to do this very fast before the clock reset. Very stressful haha. Then eventually I did this in time and it gave me the code. I hope this helps.
But you'll have to figure out how to get all the codes together in under 10 seconds by yourself, otherwise this is basically cheating, right?

Related

beautifulsoup (webscraping) not updating variables when HTML text has changed

I am new to python and I cant understand why this isn't working, but I've narrowed down the issue to one line of code.
The purpose of this bot is to scrape HTML from a website (using beautiful and post to discord when the text changes. I use FC2 and FR2 (flightcategory2 and flightrestrictions2) as memory variables for the code to check against every time it runs. If they're the same, the code waits for _ minutes and checks again, if they're different it posts it.
However when running this code, the variables "flightCategory" "flightRestrictions" change the first time the code runs, but for some reason stop changing when the HTML text on the website changes. the line in question is this if loop.
if 1==1: # using 1==1 so this loop constantly runs for testing, otherwise I have it set for a time
flightCategory, flightRestrictions = und.getInfo()
When debugging mode, the code IS run, but the variables in the code don't update, and I am confused as to why they would update the first time the code is run, but not sequential times. This line is critical to the operation of my code.
Here's an abbreviated version of the code to make it easier to read. I'd appreciate any help.
FC2 = 0
FR2 = 0
flightCategory = ""
flightRestrictions = ""
class UND:
def __init__(self):
page = requests.get("http://sof.aero.und.edu")
self.soup = BeautifulSoup(page.content, "html.parser")
def getFlightCategory(self): # Takes the appropriate html text and sets it to a variable
flightCategoryClass = self.soup.find(class_="auto-style1b")
return flightCategoryClass.get_text()
def getRestrictions(self): # Takes the appropriate html text and sets it to a variable
flightRestrictionsClass = self.soup.find(class_="auto-style4")
return flightRestrictionsClass.get_text()
def getInfo(self):
return self.getFlightCategory(), self.getRestrictions()
und = UND()
while 1 == 1:
if 1==1: #using 1==1 so this loop constantly runs for testing, otherwise I have it set for a time
flightCategory, flightRestrictions = und.getInfo() (scrape the html from the web)
if flightCategory == FC2 and flightRestrictions == FR2: # if previous check is the same as this check then skip posting
Do Something
elif flightCategory != FC2 or flightRestrictions != FR2: # if any variable has changed since the last time
FC2 = flightCategory # set the comparison variable to equal the variable
FR2 = flightRestrictions
if flightRestrictions == "Manager on Duty:": # if this is seen only output category
Do Something
elif flightRestrictions != "Manager on Duty:":
Do Something
else:
print("Outside Time")
time.sleep(5) # Wait _ seconds. This would be set for 30 min but for testing it is 5 seconds. O

According to your code, you're only sending a request to http://sof.aero.und.edu when you're creating an instance of the UND class.
Therefore, the soup attribute of your instance is never updated during the loop and you keep on getting outdated values.
You could make it work with the following logic:
class UND:
def __init__(self):
pass
def scrape(self):
page = requests.get("http://sof.aero.und.edu")
self.soup = BeautifulSoup(page.content, "html.parser")
## SOME CODE
und = UND()
while 1 == 1:
und.scrape() # We scrape the website at the beginning of each loop iteration
## SOME OTHER CODE

python print() doesnt output what I expect

I made a small web-crawler in one function, upso_final.
If I print(upso_final()), I get 15 lists that include title, address, phone #. However, I want to print out only title, so I made variable title a global string. When I print it, I get only 1 title, the last one in the run. I want to get all 15 titles.
from __future__ import unicode_literals
import requests
from scrapy.selector import Selector
import scrapy
import pymysql
def upso_final(page=1):
def upso_from_page(url):
html = fetch_page(url)
sel = Selector(text=html)
global title,address,phone
title = sel.css('h1::text').extract()
address = sel.css('address::text').extract()
phone = sel.css('.mt1::text').extract()
return {
'title' : title,
'address' : address,
'phone' : phone
}
def upso_list_from_listpage(url):
html = fetch_page(url)
sel = Selector(text=html)
upso_list = sel.css('.title_list::attr(href)').extract()
return upso_list
def fetch_page(url):
r = requests.get(url)
return r.text
list_url = "http://yp.koreadaily.com/list/list.asp?page={0}&bra_code=LA&cat_code=L020502&strChar=&searchField=&txtAddr=&txtState=&txtZip=&txtSearch=&sort=N".format(page)
upso_lists = upso_list_from_listpage(list_url)
upsos = [upso_from_page(url) for url in upso_lists]
return upsos
upso_final()
print (title,address,phone)

The basic problem is that you're confused about passing values back from a function.
upso_from_page finds each of the 15 records in turn, placing the desired information in the global variables (generally a bad design). However, the only time you print any results is after you've found all 15. Since your logic has each record overwriting the previous one, you print only the last one you found.
It appears that upso_final accumulates the list and returns it, but you ignore that return value. Instead, try this in your main program:
upso_list = upso_final()
for upso in upso.list:
print (upso)
This should give you a 3-item dictionary for each upso record; from there, you can learn the referencing and format to your taste.
AN alternate solution is to print each record as you find it, from within upso_from_page, but your overall design suggests that's not what you want.

issue extracting html page's string using bs4

I'm writing a program to find song lyrics , the program is almost near to done but i have a little problem with bs4 data type ,
my question is how to extract plain text from lyric variable at the end of line ?
import re
import requests
import bs4
from urllib import unquote
def getLink(fileName):
webFileName = unquote(fileName)
page = requests.get("http://songmeanings.com/query/?query="+str(webFileName)+"&type=songtitles")
match = re.search('songmeanings\.com\/[^image].*?\/"',page.content)
if match:
Mached = str("http://"+match.group())
return(Mached[:-1:]) # this line used to remove a " at the end of line
else:
return(1)
def getText(link):
page = requests.get(str(link))
soup = bs4.BeautifulSoup(page.content ,"lxml")
return(soup)
Soup = getText(getLink("paranoid android"))
lyric = Soup.findAll(attrs={"lyric-box"})
print (lyric)
and here is outout :
[\n\t\t\t\t\t\tPlease could you stop the noise,\nI'm trying to get some rest\nFrom all the unborn chicken voices in my head\nWhat's that?\nWhat's that?\n\nWhen I am king, you will be first against the wall\nWith your opinion which is of no consequence at all\nWhat's that?\nWhat's that?\n\nAmbition makes you look pretty ugly\nKicking and squealing Gucci little piggy\nYou don't remember\nYou don't remember\nWhy don't you remember my name?\nOff with his head, man\nOff with his head, man\nWhy don't you remember my name?\nI guess he does\n\nRain down, rain down\nCome on rain down on me\nFrom a great height\nFrom a great height, height\nRain down, rain down\nCome on rain down on me\nFrom a great height\nFrom a great height, height,\nRain down, rain down\nCome on rain down on me\n\nThat's it, sir\nYou're leaving\nThe crackle of pigskin\nThe dust and the screaming\nThe yuppies networking\nThe panic, the vomit\nThe panic, the vomit\nGod loves his children,\nGod loves his children, yeah!\nEdit Lyrics\nEdit Wiki\nAdd Video\n ]

Append following line of code:
lyric = ''.join([tag.text for tag in lyric])
After
lyric = Soup.findAll(attrs={"lyric-box"})
You'll get output something like
Please could you stop the noise,
I'm trying to get some rest
From all the unborn chicken voices in my head
What's that?
What's that?
When I am king, you will be first against the wall
With your opinion which is of no consequence at all
What's that?
What's that?
...

First trim the leading and trailing [] by doing stringvar[1:-1] then on each line call linevar.strip() which will strip off all that whitespace.

for guys whom like the idea , with some little changes finally my code is looking like this :)
import re
import pycurl
import bs4
from urllib import unquote
from StringIO import StringIO
def getLink(fileName):
fileName = unquote(fileName)
baseAddres = "https://songmeanings.com/query/?query="
linkToPage = str(baseAddres)+str(fileName)+str("&type=songtitles")
buffer = StringIO()
page = pycurl.Curl()
page.setopt(page.URL,linkToPage)
page.setopt(page.WRITEDATA,buffer)
page.perform()
page.close()
pageSTR = buffer.getvalue()
soup = bs4.BeautifulSoup(pageSTR,"lxml")
tab_content = str(soup.find_all(attrs={"tab-content"}))
pattern = r'\"\/\/songmeanings.com\/.+?\"'
links = re.findall(pattern,tab_content)
"""returns first mached item without double quote
at the beginning and at the end of the string"""
return("http:"+links[0][1:-1:])
def getText(linkToSong):
buffer = StringIO()
page = pycurl.Curl()
page.setopt(page.URL,linkToSong)
page.setopt(page.WRITEDATA,buffer)
page.perform()
page.close()
pageSTR = buffer.getvalue()
soup = bs4.BeautifulSoup(pageSTR,"lxml")
lyric_box = soup.find_all(attrs={"lyric-box"})
lyric_boxSTR = ''.join([tag.text for tag in lyric_box])
return(lyric_boxSTR)
link = getLink("Anarchy In The U.K")
text = getText(link)
print(text)

Parsing already parsed results with BeautifulSoup

I have a question with using python and beautifulsoup.
My end result program basically fills out a form on a website and brings me back the results which I will eventually output to an lxml file. I'll be taking the results from https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS and I want to get a list for every city all into some excel documents.
Here is my code, I put it on pastebin:
http://pastebin.com/bZJfMp2N
MY RESULTS ARE ALMOST GOOD :D except now I'm getting            355 for my "correct value" instead of 355, for example. I want to parse that and only show the number, you will see when you put this into python.
However, anything I have tried does NOT work, there is no way I can parse that values_2 variable because the results are in bs4.element.resultset when I think i need to parse a string. Sorry if I am a noob, I am still learning and have worked very long on this program.
Would anyone have any input? Anything would be appreciated! I've read up that my results are in a list or something and i can't parse lists? How would I go about doing this?
Here is the code:
__author__ = 'kennytruong'
#THE PROBLEM HERE IS TO PARSE THE RESULTS PROPERLY!!
import urllib.parse, urllib.request
import re
from bs4 import BeautifulSoup
URL = "https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS"
#Goes through these locations, strips the whitespace in the string and creates a list that starts at every new line
LOCATIONS = '''
ALAMEDA ALAMEDA
'''.strip().split('\n') #strip() basically removes whitespaces
print('Available locations to choose from:', LOCATIONS)
INSURANCE_TYPES = '''
HOMEOWNERS,CONDOMINIUM,MOBILEHOME,RENTERS,EARTHQUAKE - Single Family,EARTHQUAKE - Condominium,EARTHQUAKE - Mobilehome,EARTHQUAKE - Renters
'''.strip().split(',') #strips the whitespaces and starts a newline of the list every comma
print('Available insurance types to choose from:', INSURANCE_TYPES)
COVERAGE_AMOUNTS = '''
15000,25000,35000,50000,75000,100000,150000,200000,250000,300000,400000,500000,750000
'''.strip().split(',')
print('All options for coverage amounts:', COVERAGE_AMOUNTS)
HOME_AGE = '''
New,1-3 Years,4-6 Years,7-15 Years,16-25 Years,26-40 Years,41-70 Years
'''.strip().split(',')
print('All Home Age Options:', HOME_AGE)
def get_premiums(location, coverage_type, coverage_amt, home_age):
formEntries = {'location':location,
'coverageType':coverage_type,
'coverageAmount':coverage_amt,
'homeAge':home_age}
inputData = urllib.parse.urlencode(formEntries)
inputData = inputData.encode('utf-8')
request = urllib.request.Request(URL, inputData)
response = urllib.request.urlopen(request)
responseData = response.read()
soup = BeautifulSoup(responseData, "html.parser")
parseResults = soup.find_all('tr', {'valign':'top'})
for eachthing in parseResults:
parse_me = eachthing.text
name = re.findall(r'[A-z].+', parse_me) #find me all the words that start with a cap, as many and it doesn't matter what kind.
# the . for any character and + to signify 1 or more of it.
values = re.findall(r'\d{1,10}', parse_me) #find me any digits, however many #'s long as long as btwn 1 and 10
values_2 = eachthing.find_all('div', {'align':'right'})
print('raw code for this part:\n' ,eachthing, '\n')
print('here is the name: ', name[0], values)
print('stuff on sheet 1- company name:', name[0], '- Premium Price:', values[0], '- Deductible', values[1])
print('but here is the correct values - ', values_2) #NEEDA STRIP THESE VALUES
# print(type(values_2)) DOING SO GIVES ME <class 'bs4.element.ResultSet'>, NEEDA PARSE bs4.element type
# values_3 = re.split(r'\d', values_2)
# print(values_3) ANYTHING LIKE THIS WILL NOT WORK BECAUSE I BELIEVE RESULTS ARENT STRING
print('\n\n')
def main():
for location in LOCATIONS: #seems to be looping the variable location in LOCATIONS - each location is one area
print('Here are the options that you selected: ', location, "HOMEOWNERS", "150000", "New", '\n\n')
get_premiums(location, "HOMEOWNERS", "150000", "New") #calls function get_premiums and passes parameters
if __name__ == "__main__": #this basically prevents all the indent level 0 code from getting executed, because otherwise the indent level 0 code gets executed regardless upon opening
main()

youtube data api comment paging

I'm struggling a little bit with the syntax for iterating through all comments on a youtube video. I'm using the python and have found little documentation on the GetYouTubeVideoCommentFeed() function.
What I'm really trying to do is search all comments of a video for an instance of a word and increase a counter (eventually the comment will be printed out). It functions for the 25 results returned, but I need to access the rest of the comments.
import gdata.youtube
import gdata.youtube.service
video_id = 'hMnk7lh9M3o'
yt_service = gdata.youtube.service.YouTubeService()
comment_feed = yt_service.GetYouTubeVideoCommentFeed(video_id=video_id)
for comment_entry in comment_feed.entry:
comment = comment_entry.content.text
if comment.find('hi') != -1:
counter = counter + 1
print "hi: "
print counter
I tried to set the start_index of GetYouTubeVideoCommentFeed() in addition to the video_id but it didn't like that.
Is there something I'm missing?
Thanks!
Steve

Here's the code snippet for the same:
# Comment feed URL
comment_feed_url = "http://gdata.youtube.com/feeds/api/videos/%s/comments"
''' Get the comment feed of a video given a video_id'''
def WriteCommentFeed(video_id, data_file):
url = comment_feed_url % video_id
comment_feed = yt_service.GetYouTubeVideoCommentFeed(uri=url)
try:
while comment_feed:
for comment_entry in comment_feed.entry:
print comment_entry.id.text
print comment_entry.author[0].name.text
print comment_entry.title.text
print comment_entry.published.text
print comment_entry.updated.text
print comment_entry.content.text
comment_feed = yt_service.Query(comment_feed.GetNextLink().href)
except:
pass

Found out how to do it. Instead of passing a video_id to the GetYouTubeVideoCommentFeed function, you can pass it a URL. You can iterate through the comments by changing the URL parameters.
There must be an API limitation though; I can only access the last 1000 comments on the video.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping multiple sites in python - python

Related

beautifulsoup (webscraping) not updating variables when HTML text has changed

python print() doesnt output what I expect

issue extracting html page's string using bs4

Parsing already parsed results with BeautifulSoup

youtube data api comment paging

Categories

Resources