beautifulsoup (webscraping) not updating variables when HTML text has changed - python

I am new to python and I cant understand why this isn't working, but I've narrowed down the issue to one line of code.
The purpose of this bot is to scrape HTML from a website (using beautiful and post to discord when the text changes. I use FC2 and FR2 (flightcategory2 and flightrestrictions2) as memory variables for the code to check against every time it runs. If they're the same, the code waits for _ minutes and checks again, if they're different it posts it.
However when running this code, the variables "flightCategory" "flightRestrictions" change the first time the code runs, but for some reason stop changing when the HTML text on the website changes. the line in question is this if loop.
if 1==1: # using 1==1 so this loop constantly runs for testing, otherwise I have it set for a time
flightCategory, flightRestrictions = und.getInfo()
When debugging mode, the code IS run, but the variables in the code don't update, and I am confused as to why they would update the first time the code is run, but not sequential times. This line is critical to the operation of my code.
Here's an abbreviated version of the code to make it easier to read. I'd appreciate any help.
FC2 = 0
FR2 = 0
flightCategory = ""
flightRestrictions = ""
class UND:
def __init__(self):
page = requests.get("http://sof.aero.und.edu")
self.soup = BeautifulSoup(page.content, "html.parser")
def getFlightCategory(self): # Takes the appropriate html text and sets it to a variable
flightCategoryClass = self.soup.find(class_="auto-style1b")
return flightCategoryClass.get_text()
def getRestrictions(self): # Takes the appropriate html text and sets it to a variable
flightRestrictionsClass = self.soup.find(class_="auto-style4")
return flightRestrictionsClass.get_text()
def getInfo(self):
return self.getFlightCategory(), self.getRestrictions()
und = UND()
while 1 == 1:
if 1==1: #using 1==1 so this loop constantly runs for testing, otherwise I have it set for a time
flightCategory, flightRestrictions = und.getInfo() (scrape the html from the web)
if flightCategory == FC2 and flightRestrictions == FR2: # if previous check is the same as this check then skip posting
Do Something
elif flightCategory != FC2 or flightRestrictions != FR2: # if any variable has changed since the last time
FC2 = flightCategory # set the comparison variable to equal the variable
FR2 = flightRestrictions
if flightRestrictions == "Manager on Duty:": # if this is seen only output category
Do Something
elif flightRestrictions != "Manager on Duty:":
Do Something
else:
print("Outside Time")
time.sleep(5) # Wait _ seconds. This would be set for 30 min but for testing it is 5 seconds. O

According to your code, you're only sending a request to http://sof.aero.und.edu when you're creating an instance of the UND class.
Therefore, the soup attribute of your instance is never updated during the loop and you keep on getting outdated values.
You could make it work with the following logic:
class UND:
def __init__(self):
pass
def scrape(self):
page = requests.get("http://sof.aero.und.edu")
self.soup = BeautifulSoup(page.content, "html.parser")
## SOME CODE
und = UND()
while 1 == 1:
und.scrape() # We scrape the website at the beginning of each loop iteration
## SOME OTHER CODE

Related

monitoring a text site (json) using python

IM working on a program to grab variant ID from this website
https://www.deadstock.ca/collections/new-arrivals/products/nike-air-max-1-cool-grey.json
Im using the code
import json
import requests
import time
endpoint = "https://www.deadstock.ca/collections/new-arrivals/products/nike-air-max-1-cool-grey.json"
req = requests.get(endpoint)
reqJson = json.loads(req.text)
for id in reqJson['product']:
name = (id['title'])
print (name)
I dont know what to do here in order to grab the Name of the items. If you visit the link you will see that the name is under 'title'. If you could help me with this that would be awesome.
I get the error message "TypeError: string indices must be integers" so im not too sure what to do.
Your biggest problem right now is that you are adding items to the list before you're checking if they're in it, so everything is coming back as in the list.
Looking at your code right now, I think what you want to do is combine things into a single for loop.
Also as a heads up you shouldn't use a variable name like list as it is shadowing the built-in Python function list().
list = [] # You really should change this to something else
def check_endpoint():
endpoint = ""
req = requests.get(endpoint)
reqJson = json.loads(req.text)
for id in reqJson['threads']: # For each id in threads list
PID = id['product']['globalPid'] # Get current PID
if PID in list:
print('checking for new products')
else:
title = (id['product']['title'])
Image = (id['product']['imageUrl'])
ReleaseType = (id['product']['selectionEngine'])
Time = (id['product']['effectiveInStockStartSellDate'])
send(title, PID, Image, ReleaseType, Time)
print ('added to database'.format(PID))
list.append(PID) # Add PID to the list
return
def main():
while(True):
check_endpoint()
time.sleep(20)
return
if __name__ == "__main__":
main()

Web scraping multiple sites in python

I signed up to this website just to ask this question as I have been searching for hours over multiple days and haven't found anything.
I am trying to, within 10 seconds, scrape the 2-3 characters from 5 websites, combine them, and paste them into a box.
I have a rough idea of what I would need, but no idea how to go about this.
I believe I want to assign variables the scraped contents from a certain website, and then get it to print the combination of these variables for me to copy and paste.
I'm not an expert by any means in Python, so if possible, a copy/pasteable script would be great.
The websites are:
https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
Keeping this up now only because I cannot take it down. Thank you to those who have helped, I hope this helps someone else.
Sorry for being dumb
Thing is, I've done the code and tried it. It works, but that isn't the answer to the question. Getting the characters from the links and putting them together doesn't work. I've tried many things and I am still working it out myself. My advice, work it out yourself. It's a lot more rewarding and will probably help for future parts of the competition. Also, if you ever think about removing all of the 'a's from the code, that doesn't work either. I tried.
To answer your stack overflow question, here is the code (you need to install the 'requests' python modeule first):
import requests
page1 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page1_content = requests.get(page1)
page1text = page1_content.text
page2 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page2_content = requests.get(page2)
page2text = page2_content.text
page3 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page3_content = requests.get(page3)
page3text = page3_content.text
page4 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page4_content = requests.get(page4)
page4text = page4_content.text
page5 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page5_content = requests.get(page5)
page5text = page5_content.text
print(page1text + page2text + page3text + page4text + page5text)
But this method doesn't answer challenge 14.
I know the answer to the question, but instead of giving the code to complete it, I'll tell you one of the ways you might find it, as I completed that question myself.
When you asked this question, you completely forgot to mention that there was a sixth link: https://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=j7fPvtmWLDY5qeYFuJtmKw%3D%3D&string=%3Cclock%20pts%3E
Notice at the end of that hyperlink it says 'clock pts', whereas all the other links have had something like clock-pt1 or clock-pt4. What if the clock pts refers to all of the different links at once such as you have to create a string out of all the previous links you've been given, replace the 'clock pts' in the string section of the hyperlink WITH the string you made from the separate links, which would then give you the code to complete the level?
Below is the code I used to get the answer. It requires the requests module, in case you want to use it. (Also, I'm not 100% certain it will work all the time, since the challenge is based on a timer, the program may not get all the strings in time before the clock change, so make sure to run the program after the timer has reset)
import requests
page1 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page1_content = requests.get(page1)
page1text = page1_content.text
page2 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page2_content = requests.get(page2)
page2text = page2_content.text
page3 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page3_content = requests.get(page3)
page3text = page3_content.text
page4 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page4_content = requests.get(page4)
page4text = page4_content.text
page5 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page5_content = requests.get(page5)
page5text = page5_content.text
code=(page1text + page2text + page3text + page4text + page5text)
page6= "https://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=j7fPvtmWLDY5qeYFuJtmKw%3D%3D&string="+code
page6_content = requests.get(page6)
print(page6_content.text)
I have done something very similar with just as poor results at the end. I did, however, leave this running for a while and notice that the clock follow a pattern. Some time ago the clock read all as "aaaaaaaaaaaaaaa" then "aBaa1aafaa2aa3a" and "aDaafaaHaajaala". I'm going to wait for a full list and try suggesting the next clock sequence in the final URL. I'll get back to you if this works, just something to think about.
Also for help importing moduals I suggest :
https://programminghistorian.org/lessons/installing-python-modules-pip
&
https://docs.python.org/3/installing/index.html
import requests
abc = ""
while 1 == 1 :
page1 = requests.get('your first link')
page2 = requests.get('your second link')
page3 = requests.get('your thrid link')
page4 = requests.get('your fourth link')
page5 = requests.get('your fith link')
text = page1.text+page2.text+page3.text+page4.text+page5.text
# abc1 = "the verify link except clock pts is replaced with "+"text>" so the end looks like this :string=<"+text+">"
abc1 = text
if abc1 != abc:
print (abc1)
abc = abc1
Edit
The clock runs in 15-minute cycles with 90 codes altogether Im not sure how this helps as of yet but just posting ideas. I had to make some changes to get the codes to output cleanly and here is my improved version (this is very messy sorry):
import requests
abc = ""
page1 = requests.get('your first link')
page2 = requests.get('your second link')
page3 = requests.get('your thrid link')
page4 = requests.get('your fourth link')
page5 = requests.get('your fith link')
while 1 == 1 :
page12 = requests.get('your first link')
page22 = requests.get('your second link')
page32 = requests.get('your thrid link')
page42 = requests.get('your fourth link')
page52 = requests.get('your fith link')
if page1.text != page12.text and page2.text != page22.text and page3.text != page32.text and page4.text != page42.text and page5.text != page52.text:
text = page12.text+page22.text+page32.text+page42.text+page52.text
abc1 = text
# abc1 = * your url for verification with * string=<"+text+">"
if abc1 != abc:
print (abc1)
abc = abc1
page1 = page12
page2 = page22
page3 = page32
page4 = page42
page5 = page52
Final edit
I had sepnt so long going down the path of figuring out how that made the tak and doing way too much work. When Submitting the final url dont incluede your solutin as a repalcement for the section and NOT inside the <> so yours should likehttps://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=*this is an identifiere*&string=*The string you get*
I completed the challenge, I used an excel spreadsheet with functions to get all the little code things from every clock cycle and put them together to make one code every 10 seconds. Sorry if that doesn't make sense I'm not sure how to explain it. Then I pasted this into the end of the "validation link" to replace the < clock pts > at the end of the URL. I had to do this very fast before the clock reset. Very stressful haha. Then eventually I did this in time and it gave me the code. I hope this helps.
But you'll have to figure out how to get all the codes together in under 10 seconds by yourself, otherwise this is basically cheating, right?

python print() doesnt output what I expect

I made a small web-crawler in one function, upso_final.
If I print(upso_final()), I get 15 lists that include title, address, phone #. However, I want to print out only title, so I made variable title a global string. When I print it, I get only 1 title, the last one in the run. I want to get all 15 titles.
from __future__ import unicode_literals
import requests
from scrapy.selector import Selector
import scrapy
import pymysql
def upso_final(page=1):
def upso_from_page(url):
html = fetch_page(url)
sel = Selector(text=html)
global title,address,phone
title = sel.css('h1::text').extract()
address = sel.css('address::text').extract()
phone = sel.css('.mt1::text').extract()
return {
'title' : title,
'address' : address,
'phone' : phone
}
def upso_list_from_listpage(url):
html = fetch_page(url)
sel = Selector(text=html)
upso_list = sel.css('.title_list::attr(href)').extract()
return upso_list
def fetch_page(url):
r = requests.get(url)
return r.text
list_url = "http://yp.koreadaily.com/list/list.asp?page={0}&bra_code=LA&cat_code=L020502&strChar=&searchField=&txtAddr=&txtState=&txtZip=&txtSearch=&sort=N".format(page)
upso_lists = upso_list_from_listpage(list_url)
upsos = [upso_from_page(url) for url in upso_lists]
return upsos
upso_final()
print (title,address,phone)
The basic problem is that you're confused about passing values back from a function.
upso_from_page finds each of the 15 records in turn, placing the desired information in the global variables (generally a bad design). However, the only time you print any results is after you've found all 15. Since your logic has each record overwriting the previous one, you print only the last one you found.
It appears that upso_final accumulates the list and returns it, but you ignore that return value. Instead, try this in your main program:
upso_list = upso_final()
for upso in upso.list:
print (upso)
This should give you a 3-item dictionary for each upso record; from there, you can learn the referencing and format to your taste.
AN alternate solution is to print each record as you find it, from within upso_from_page, but your overall design suggests that's not what you want.

Implementing multiprocessing in a loop scraper and appending the data

I am making a web scraper to build a database. The site I plan to use has index pages each containing 50 links. The amount of pages to be parsed is estimated to be around 60K and up, this is why I want to implement multiprocessing.
Here is some pseudo-code of what I want to do:
def harvester(index):
main=dict()
....
links = foo.findAll ( 'a')
for link in links:
main.append(worker(link))
# or maybe something like: map_async(worker(link))
def worker(url):
''' this function gather the data from the given url'''
return dictionary
Now what I want to do with that is to have a certain number of worker function to gather data in parallel on different pages. This data would then be appended to a big dictionary located in harvester or written directly in a csv file by the worker function.
I'm wondering how I can implement parallelism. I have done a faire
amount of research on using gevent, threading and multiprocessing but
I am not sure how to implement it.
I am also not sure if appending data to a large dictionary or writing
directly in a csv using DictWriter will be stable with that many input at the same time.
Thanks
I propose you to split your work into separate workers which communicate via Queues.
Here you mostly have IO wait time (crawling, csv writing)
So you can do the following (not tested, just see the idea):
import threading
import Queue
class CsvWriter(threading.Thread):
def __init__(self, resultq):
super(CsvWriter, self).__init__()
self.resultq = resultq
self.writer = csv.DictWriter(open('results.csv', 'wb'))
def run(self):
done = False
while not done:
row = self.requltq.get()
if row != -1:
self.writer.writerow(row)
else:
done = True
class Crawler(threading.Thread):
def __init__(self, inputqueue, resultq):
super(Crawler, self).__init__()
self.iq = inputq
self.oq = resultq
def run(self):
done = False
while not done:
link = self.iq.get()
if link != -1:
result = self.extract_data(link)
self.oq.put(result)
else:
done = True
def extract_data(self, link):
# crawl and extract what you need and return a dict
pass
def main():
linkq = Queue.Queue()
for url in your_urls:
linkq.put(url)
resultq = Queue.Queue()
writer = CsvWriter(resultq)
writer.start()
crawlers = [Crawler(linkq, resultq) for _ in xrange(10)]
[c.start() for c in crawlers]
[linkq.put(-1) for _ in crawlers]
[c.join() for c in crawlers]
resultq.put(-1)
writer.join()
This code should work (fix possible typos) and make it to exit when all the urls are finished

Same code used in multiple functions but with minor differences - how to optimize?

This is the code of a Udacity course, and I changed it a little. Now, when it runs, it asks me for a movie name and the trailer would open in a pop up in a browser (that's another part, which is not shown).
As you can see, this program has a lot of repetitive code in it, the functions extract_name, movie_poster_url and movie_trailer_url have kind of the same code. Is there a way to get rid of the same code being repeated but have the same output? If so, will it run faster?
import fresh_tomatoes
import media
import urllib
import requests
from BeautifulSoup import BeautifulSoup
name = raw_input("Enter movie name:- ")
global movie_name
def extract_html(name):
url = "website name" + name + "continuation of website name" + name + "again continuation of web site name"
response = requests.get(url)
page = str(BeautifulSoup(response.content))
return page
def extract_name(page):
start_link = page.find(' - IMDb</a></h3><div class="s"><div class="kv"')
start_url = page.find('>',start_link-140)
start_url1 = page.find('>', start_link-140)
end_url = page.find(' - IMDb</a>', start_link-140)
name_of_movie = page[start_url1+1:end_url]
return extract_char(name_of_movie)
def extract_char(name_of_movie):
name_array = []
for words in name_of_movie:
word = words.strip('</b>,')
name_array.append(word)
return ''.join(name_array)
def movie_poster_url(name_of_movie):
movie_name, seperator, tail = name_of_movie.partition(' (')
#movie_name = name_of_movie.rstrip('()0123456789 ')
page = urllib.urlopen('another web site name' + movie_name + 'continuation of website name').read()
start_link = page.find('"Poster":')
start_url = page.find('"',start_link+9)
end_url = page.find('"',start_url+1)
poster_url = page[start_url+1:end_url]
return poster_url
def movie_trailer_url(name_of_movie):
movie_name, seperator, tail = name_of_movie.partition(' (')
#movie_name = name_of_movie.rstrip('()0123456789 ')
page = urllib.urlopen('another website name' + movie_name + " trailer").read()
start_link = page.find('<div class="yt-lockup-dismissable"><div class="yt-lockup-thumbnail contains-addto"><a aria-hidden="true" href=')
start_url = page.find('"',start_link+110)
end_url = page.find('" ',start_url+1)
trailer_url1 = page[start_url+1:end_url]
trailer_url = "www.youtube.com" + trailer_url1
return trailer_url
page = extract_html(name)
movie_name = extract_name(page)
new_movie = media.Movie(movie_name, "Storyline WOW", movie_poster_url(movie_name), movie_trailer_url(movie_name))
movies = [new_movie]
fresh_tomatoes.open_movies_page(movies)
You could move the shared parts into their own function:
def find_page(url, name, find, offset):
movie_name, seperator, tail = name_of_movie.partition(' (')
page = urllib.urlopen(url.format(name)).read()
start_link = page.find(find)
start_url = page.find('"',start_link+offset)
end_url = page.find('" ',start_url+1)
return page[start_url+1:end_url]
def movie_poster_url(name_of_movie):
return find_page("another website name{} continuation of website name", name_of_movie, '"Poster":', 9)
def movie_trailer_url(name_of_movie):
trailer_url = find_page("another website name{} trailer", name_of_movie, '<div class="yt-lockup-dismissable"><div class="yt-lockup-thumbnail contains-addto"><a aria-hidden="true" href=', 110)
return "www.youtube.com" + trailer_url
It definetely wont run faster (there is extra work to do to "switch" between the functions) but the performance difference is probably negligable.
For your second question: Profiling is not a technique or method, it's "finding out what's being bad" in your code:
Profiling is a form of
dynamic program analysis that measures, for example, the space
(memory) or time complexity of a program, the usage of particular
instructions, or the frequency and duration of function calls.
(wikipedia)
So it's not something that speeds up your program, it's a word for things you do to find out what you can do to speed up your program.
Going really quickly here because I am a super newb but I can see the repetition; what I would do is to figure out the (mostly) repeating blocks of code shared by all 3 functions and then figure out where they differ; write a new function that takes the differences as the arguments. so for instance:
def extract(tarString,delim,startDiff,endDiff):
start_link = page.find(tarString)
start_url = page.find(delim,start_link+startDiff)
end_url = page.find(delim,start_url+endDiff)
url_out = page[start_url+1:end_url]
Then, in your poster, trailer, etc functions, just call this extract function with the appropriate arguments for each case. ie poster would call
poster_url=extract(tarString='"Poster:"',delim='"',startDiff=9, endDiff=1)
I can see you've got another answer already and it's very likely it's written by someone who knows more than I do, but I hope you get something out of my "philosophy of modularizing" from a newbie perspective.

Categories

Resources