I am working on a script to scrape a website, the problem is that it works normally when I run it with the interpreter, however after compiling it (PyInstaller or Py2exe) it fails, it appears to be that mechanize / requests both fail to keep the session alive.
I have hidden my username and password here, but I did put them correctly in the compiled code
import requests
from bs4 import BeautifulSoup as bs
from sys import argv
import re
import logging
url = argv[1]
payload = {"userName": "real_username", "password": "realpassword"}
session = requests.session()
resp = session.post("http://website.net/login.do", data=payload)
if "forgot" in resp.content:
logging.error("Login failed")
exit()
resp = session.get(url)
soup = bs(resp.content)
urlM = url[:url.find("?") + 1] + "page=(PLACEHOLDER)&" + \
url[url.find("?") + 1:]
# Get number of pages
regex = re.compile("\|.*\|\sof\s(\d+)")
script = str(soup.findAll("script")[1])
epNum = int(re.findall(regex, script)[0]) # Number of EPs
pagesNum = epNum // 50
links = []
# Get list of links
# If number of EPs > 50, more than one page
if pagesNum == 0:
links = [url]
else:
for i in range(1, pagesNum + 2):
url = urlM.replace("(PLACEHOLDER)", str(i))
links.append(url)
# Loop over the links and extract info: ID, NAME, START_DATE, END_DATE
raw_info = []
for pos, link in enumerate(links):
print "Processing page %d" % (pos + 1)
sp = bs(session.get(link).content)
table = sp.table.table
raw_info.extend(table.findAll("td"))
epURL = "http://www.website.net/exchange/viewep.do?operation"\
"=executeAction&epId="
# Final data extraction
raw_info = map(str, raw_info)
ids = [re.findall("\d+", i)[0] for i in raw_info[::4]]
names = [re.findall("<td>(.*)</td", i)[0] for i in raw_info[1::4]]
start_dates = [re.findall("<td>(.*)</td", i)[0] for i in raw_info[2::4]]
end_dates = [re.findall("<td>(.*)</td", i)[0] for i in raw_info[3::4]]
emails = []
eplinks = [epURL + str(i) for i in ids]
print names
The error happens on the level of epNum variable, this means as I figured that the HTML page is not the one I requested, but it works normally on linux script and compiled, work on widows as script but fails when compiled.
The py2exe tutorial mentions that you need MSVCR90.dll, did you check its present on the PC?
Related
My code below will open a website and scrape values into an array and plot. Notice at the bottom, One can comment out "driver.quit()" and when the Python code stops, the webpage of interest is still open. At a short time later, I would like to soft start the Python code and continue reading from the website. My attempt was to print out the value for the driver and skip to this value without having to open a new page. Once I am on the welcome page it takes a lot of time/effort to get to the desired page and I would like to avoid that. Look at the third line of code where I have pasted the value of the driver for the session that is currently open. Python does not like that. Is there a way to continue on that session while it is open in Python?
driver = webdriver.Chrome (executable_path="C:\chromedriver.exe")
driver.get("https://my_example.com/welcome")
#driver = <selenium.webdriver.chrome.webdriver.WebDriver (session="1a636e51f3d40bd9b66996e3d52d945b")>
my_name = ["nm25", "nm26", "nm27", "nm33", "nm38", "nm41", "nm45", ]
data_points = 450
my_file = np.zeros((13, 7, data_points))
x = []
y = []
soup = "chicken" # Initialization constant
while soup.find(my_name[0]) == -1:
source = driver.page_source
soup = BeautifulSoup(source, "lxml")
soup = (soup.get_text())
time.sleep(5)
tim = time.time()
my_cntr =0
plt.title("Title")
plt.xlabel(" Time")
plt.ylabel("y axis amplitude")
for i in range(data_points):
source = driver.page_source
soup = BeautifulSoup(source, "lxml") # was "html.parser"
soup3 = (soup.get_text())
soup2 = (soup.get_text("**", strip=True)) # adds "**" between values for ease of reading
x_pos = soup2.find(my_name[0]) # Remove all but First name
soup2 = soup2[x_pos-2:] # Data starts with "**"
for j in range(len(my_name)): # Go through all of the names
for k in range(0, 13): # The number of values to read per name
soup2 = soup2[2:] # Remove **
x_pos = soup2.find('**')
if k < 2 or k==6:
my_file[k, j, i] =time.time() - tim # adds time stamp
else:
my_file[k, j, i] = soup2[:x_pos]
soup2 = soup2[x_pos:]
if ( k== 7) and j==0 :
x.append(my_file[0,0,i] )
y.append(my_file[7,0,i] )
my_cntr = my_cntr +1
if my_cntr/20 == int(my_cntr/20):
plt.plot(x, y)
plt.pause(0.2)
plt.show()
driver.quit() # remove this line to leave the browser open
Try getting the session id and setting the value. You should probably save the session id in a config.py file, json file, txt file, etc. and access it later
driver = webdriver.Chrome (executable_path="C:\chromedriver.exe")
driver.get("https://my_example.com/welcome")
# do stuff and then save the session id at the end
session_id = driver.session_id
import json
# save the session id in a json file
with open("sample.json", "w") as f:
f.write(json.dumps({'session_id': session_id}))
Once the code finishes running and the selenium browser is still open, you should be able to have another .py file that you will run to get the session id you just saved in a json file so you can access the open browser again
import json
# open the json file
with open('sample.json', 'r') as f:
session_id = json.load(f)['session_id']
driver.session_id = sessioin_id
# do more stuff
In a previous question I got the answer from Hedgehog! (How to check for new discounts and send to telegram if changes detected?)
But another question is, how can I get only the new (products) items in the output and not all the text what is changed. My feeling is that the output I got is literally anything what is changed on the website and not only the new added discount.
Here is the code, and see the attachment what the output is. Thanks again for all the effort.
`# Import all necessary packages
import requests, time, difflib, os, re, schedule, cloudscraper
from bs4 import BeautifulSoup
from datetime import datetime
# Define scraper
scraper = cloudscraper.create_scraper()
# Send a message via a telegram bot
def telegram_bot_sendtext(bot_message):
bot_token = '1XXXXXXXXXXXXXXXXXXXXXXXXXXG5pses8'
bot_chatID = '-XXXXXXXXXXX'
send_text = 'https://api.telegram.org/bot' + bot_token + '/sendMessage?chat_id=' + bot_chatID
+ '&parse_mode=Markdown&text=' + bot_message
response = requests.get(send_text)
return response.json()
PrevVersion = ""
FirstRun = True
while True:
# Download the page with the specified URL
response = scraper.get("https://").content
# Url for in the messages to show
url = "https://"
# Act like a browser
#headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
# Parse the downloaded page and check for discount on the page
soup = BeautifulSoup(response, 'html.parser')
def get_discounts(soup):
for d in soup.select('.cept-discount'):
if d.text != '' and 65 < int(''.join(filter(str.isdigit, d.text))) < 99:
return True
else:
return False
# Remove all scripts and styles
for script in soup(["script", "style"]):
script.extract()
discounts = get_discounts(soup)
soup = soup.get_text()
# Compare the page text to the previous version and check if there are any discounts in your range
if PrevVersion != soup and discounts:
# On the first run - just memorize the page
if FirstRun == True:
PrevVersion = soup
FirstRun = False
print ("Start Monitoring "+url+ ""+ str(datetime.now()))
else:
print ("Changes detected at: "+ str(datetime.now()))
OldPage = PrevVersion.splitlines()
NewPage = soup.splitlines()
diff = difflib.context_diff(OldPage,NewPage,n=0)
out_text = "\n".join([ll.rstrip() for ll in '\n'.join(diff).splitlines() if ll.strip()])
print (out_text)
OldPage = NewPage
# Send a message with the telegram bot
telegram_bot_sendtext("Nieuwe prijsfout op Pepper " + url )
# print ('\n'.join(diff))
PrevVersion = soup
else:
print( "No Changes "+ str(datetime.now()))
time.sleep(5)
continue`
What happens?
As discussed, your assumptions are going in the right direction, all the changes identified by the difflib will be displayed.
It may be possible to adjust the content of difflib but I am sure that difflib is not absolutely necessary for this task.
How to fix?
First step is to upgrade get_discounts(soup) to not only check if discount is in range but also get information of the item itself, if you like to display or operate on later:
def get_discounts(soup):
discounts = []
for d in soup.select('.cept-discount'):
if d.text != '' and 65 < int(''.join(filter(str.isdigit, d.text))) < 99:
discounts.append({
'name':d.find_previous('strong').a.get('title'),
'url':d.find_previous('strong').a.get('href'),
'discount':d.text,
'price':d.parent.parent.select_one('.thread-price').text,
'bestprice':d.previous_sibling.text
})
return discounts
Second step is to check if there is a new discount, close to the difflib but more focused:
def compare_discounts(d1: list, d2: list):
diff = [i for i in d1 + d2 if i not in d1]
result = len(diff) == 0
if not result:
return diff
Last step is to react to changes from the discounts, if so it will print the urls from so you can go directly to the offert products.
Note Cause we have stored additional information in our list of dicts you can adjust the printing to get also the whole information or specific attributes
if newDiscounts:
#Send a message with the telegram bot
print('\n'.join([c['url'] for c in newDiscounts]))
telegram_bot_sendtext("Nieuwe prijsfout op Pepper " + url)
Example
import requests, time, difflib, os, re, schedule, cloudscraper
from bs4 import BeautifulSoup
from datetime import datetime
# Define scraper
scraper = cloudscraper.create_scraper()
# Send a message via a telegram bot
def telegram_bot_sendtext(bot_message):
bot_token = '1XXXXXXXXXXXXXXXXXXXXXXXXXXG5pses8'
bot_chatID = '-XXXXXXXXXXX'
send_text = 'https://api.telegram.org/bot' + bot_token + '/sendMessage?chat_id=' + bot_chatID + '&parse_mode=Markdown&text=' + bot_message
response = requests.get(send_text)
return response.json()
PrevVersion = ""
PrevDiscounts = []
FirstRun = True
def get_discounts(soup):
discounts = []
for d in soup.select('.cept-discount'):
if d.text != '' and 65 < int(''.join(filter(str.isdigit, d.text))) < 99:
discounts.append({
'name':d.find_previous('strong').a.get('title'),
'url':d.find_previous('strong').a.get('href'),
'discount':d.text,
'price':d.parent.parent.select_one('.thread-price').text,
'bestprice':d.previous_sibling.text
})
return discounts
def compare_discounts(d1: list, d2: list):
diff = [i for i in d1 + d2 if i not in d1]
result = len(diff) == 0
if not result:
return diff
while True:
# Download the page with the specified URL
response = requests.get("https://nl.pepper.com/nieuw").content
# Url for in the messages to show
url = "https://nl.pepper.com/nieuw"
# Parse the downloaded page and check for discount on the page
soup = BeautifulSoup(response, 'html.parser')
# Remove all scripts and styles
for script in soup(["script", "style"]):
script.extract()
discounts = get_discounts(soup)
souptext = soup.get_text()
# Compare the page text to the previous version and check if there are any discounts in your range
if PrevVersion != souptext and discounts:
# On the first run - just memorize the page
if FirstRun == True:
PrevVersion = souptext
PrevDiscounts = discounts
FirstRun = False
print ("Start Monitoring "+url+ ""+ str(datetime.now()))
else:
print ("Changes detected at: "+ str(datetime.now()))
newDiscounts = compare_discounts(PrevDiscounts,discounts)
if newDiscounts:
print('\n'.join([c['url'] for c in newDiscounts]))
#Send a message with the telegram bot
telegram_bot_sendtext("Nieuwe prijsfout op Pepper " + url)
else:
print('These are general changes but there are no new discounts available.')
PrevVersion = souptext
PrevDiscounts = discounts
else:
print( "No Changes "+ str(datetime.now()))
time.sleep(10)
continue
Output
Start Monitoring https://nl.pepper.com/nieuw 2021-12-12 12:28:38.391028
No Changes 2021-12-12 12:28:54.009881
Changes detected at: 2021-12-12 12:29:04.429961
https://nl.pepper.com/aanbiedingen/gigaset-plug-startpakket-221003
No Changes 2021-12-12 12:29:14.698933
No Changes 2021-12-12 12:29:24.985394
No Changes 2021-12-12 12:29:35.271794
No Changes 2021-12-12 12:29:45.629790
No Changes 2021-12-12 12:29:55.917246
Changes detected at: 2021-12-12 12:30:06.184814
These are general changes but there are no new discounts available.
I am a student working on a scraping project and I am having trouble completing my script because it fills my computer's memory with all of the data is stores.
It currently stores all of my data until the end, so my solution to this would be to break up the scrape into smaller bits and then write out the data periodically so it does not just continue to make one big list and then write out at the end.
In order to do this, I would need to stop my scroll method, scrape the loaded profiles, write out the data that I have collected, and then repeat this process without duplicating my data. It would be appreciated if someone could show me how to do this. Thank you for your help :)
Here's my current code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
from selenium.common.exceptions import NoSuchElementException
Data = []
driver = webdriver.Chrome()
driver.get("https://directory.bcsp.org/")
count = int(input("Number of Pages to Scrape: "))
body = driver.find_element_by_xpath("//body")
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
while len(profile_count) < count: # Get links up to "count"
body.send_keys(Keys.END)
sleep(1)
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
for link in profile_count: # Calling up links
temp = link.get_attribute('href') # temp for
driver.execute_script("window.open('');") # open new tab
driver.switch_to.window(driver.window_handles[1]) # focus new tab
driver.get(temp)
# scrape code
Name = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[1]/div[2]/div').text
IssuedBy = "Board of Certified Safety Professionals"
CertificationorDesignaationNumber = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[1]/td[3]/div[2]').text
CertfiedorDesignatedSince = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[3]/td[1]/div[2]').text
try:
AccreditedBy = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[3]/div[2]/a').text
except NoSuchElementException:
AccreditedBy = "N/A"
try:
Expires = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[1]/div[2]').text
except NoSuchElementException:
Expires = "N/A"
info = Name, IssuedBy, CertificationorDesignaationNumber, CertfiedorDesignatedSince, AccreditedBy, Expires + "\n"
Data.extend(info)
driver.close()
driver.switch_to.window(driver.window_handles[0])
with open("Spredsheet.txt", "w") as output:
output.write(','.join(Data))
driver.close()
Test.py
Displaying Test.py.
Try the below approach using requests and beautifulsoup. In the below script i have used the API URL fetched from website itself for ex:-API URL
First it will create the URL(refer first url) for first iteration, add headers and data in .csv file.
Second iteration it will again create the URL(refer second url) with 2 extra params start_on_page=20 & show_per_page=20 where start_on_page number 20 is incremented by 20 on each iteration and show_per_page = 100 defaulted to extract 100 records per iteration so on till all the data dumped in to the .csv file.second iteration API URL
Script is dumping 4 things number, name, location and profile url.
On each iteration data will be appended to .csv file , so your memory issue will get resolved by this approach.
Do not forget to add your system path in file_path variable where do you want to create .csv file before running the script.
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
import csv
def scrap_directory_data():
list_of_credentials = []
file_path = ''
file_name = 'credential_list.csv'
count = 0
page_number = 0
page_size = 100
create_url = ''
main_url = 'https://directory.bcsp.org/search_results.php?'
first_iteration_url = 'first_name=&last_name=&city=&state=&country=&certification=&unauthorized=0&retired=0&specialties=&industries='
number_of_records = 0
csv_headers = ['#','Name','Location','Profile URL']
while True:
if count == 0:
create_url = main_url + first_iteration_url
print('-' * 100)
print('1 iteration URL created: ' + create_url)
print('-' * 100)
else:
create_url = main_url + 'start_on_page=' + str(page_number) + '&show_per_page=' + str(page_size) + '&' + first_iteration_url
print('-' * 100)
print('Other then first iteration URL created: ' + create_url)
print('-' * 100)
page = requests.get(create_url,verify=False)
extracted_text = bs(page.text, 'lxml')
result = extracted_text.find_all('tr')
if len(result) > 0:
for idx, data in enumerate(result):
if idx > 0:
number_of_records +=1
name = data.contents[1].text
location = data.contents[3].text
profile_url = data.contents[5].contents[0].attrs['href']
list_of_credentials.append({
'#':number_of_records,
'Name':name,
'Location': location,
'Profile URL': profile_url
})
print(data)
with open(file_path + file_name ,'a+') as cred_CSV:
csvwriter = csv.DictWriter(cred_CSV, delimiter=',',lineterminator='\n',fieldnames=csv_headers)
if idx == 0 and count == 0:
print('Writing CSV header now...')
csvwriter.writeheader()
else:
for item in list_of_credentials:
print('Writing data rows now..')
print(item)
csvwriter.writerow(item)
list_of_credentials = []
count +=1
page_number +=20
scrap_directory_data()
I am trying to extract some information about mtg cards from a webpage with the following program but I repeatedly retrieve information about the initial page given(InitUrl). The crawler is unable to proceed further. I have started to believe that i am not using the correct urls or maybe there is a restriction in using urllib that slipped my attention. Here is the code that i struggle with for weeks now:
import re
from math import ceil
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup
InitUrl = "https://mtgsingles.gr/search?q=dragon"
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages = 4 # depth of pages to be retrieved
query = InitUrl.split("?")[1]
for i in range(0, NumOfPages):
if i == 0:
Url = InitUrl
else:
Url = URL_Next
print(Url)
UClient = uReq(Url) # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")
try:
URL_Next = InitUrl + "&page=" + str(i + 2)
print("The next URL is: " + URL_Next + "\n")
except IndexError:
print("Crawling process completed! No more infomation to retrieve!")
else:
NumOfCrawledPages += 1
Url = URL_Next
finally:
print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")
One of the reasons your code fail is, that you don't use cookies. The site seem to require these to allow paging.
A clean and simple way of extracting the data you're interested in would be like this:
import requests
from bs4 import BeautifulSoup
# the site actually uses this url under the hood for paging - check out Google Dev Tools
paging_url = "https://mtgsingles.gr/search?ajax=products-listing&lang=en&page={}&q=dragon"
return_list = []
# the page-scroll will only work when we support cookies
# so we fetch the page in a session
session = requests.Session()
session.get("https://mtgsingles.gr/")
All pages have a next button except the last one. So we use this knowledge to loop until the next-button goes away. When it does - meaning that the last page is reached - the button is replaced with a 'li'-tag with the class of 'next hidden'. This only exists on the last page
Now we're ready to start looping
page = 1 # set count for start page
keep_paging = True # use flag to end loop when last page is reached
while keep_paging:
print("[*] Extracting data for page {}".format(page))
r = session.get(paging_url.format(page))
soup = BeautifulSoup(r.text, "html.parser")
items = soup.select('.iso-item.item-row-view.clearfix')
for item in items:
name = item.find('div', class_='col-md-10').get_text().strip().split('\xa0')[0]
toughness_element = item.find('div', class_='card-power-toughness')
try:
toughness = toughness_element.get_text().strip()
except:
toughness = None
cardtype = item.find('div', class_='cardtype').get_text()
card_dict = {
"name": name,
"toughness": toughness,
"cardtype": cardtype
}
return_list.append(card_dict)
if soup.select('li.next.hidden'): # this element only exists if the last page is reached
keep_paging = False
print("[*] Scraper is done. Quitting...")
else:
page += 1
# do stuff with your list of dicts - e.g. load it into pandas and save it to a spreadsheet
This will scroll until no more pages exists - no matter how many subpages would be in the site.
My point in the comment above was merely that if you encounter an Exception in your code, your pagecount would never increase. That's probably not what you want to do, which is why I recommended you to learn a little more about the behaviour of the whole try-except-else-finally deal.
I am also bluffed, by the request given the same reply, ignoring the page parameter. As a dirty soulution I can offer you first to set up the page-size to a high enough number to get all the Items that you want (this parameter works for some reason...)
import re
from math import ceil
import requests
from bs4 import BeautifulSoup as soup
InitUrl = Url = "https://mtgsingles.gr/search"
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages = 2 # depth of pages to be retrieved
query = "dragon"
cardSet=set()
for i in range(1, NumOfPages):
page_html = requests.get(InitUrl,params={"page":i,"q":query,"page-size":999})
print(page_html.url)
page_soup = soup(page_html.text, "html.parser")
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
cardString=card_name + "\n" + cardP_T + "\n" + cardType + "\n"
cardSet.add(cardString)
print(cardString)
NumOfCrawledPages += 1
print("Moving to page : " + str(NumOfCrawledPages + 1) + " with " +str(len(cards)) +"(cards)\n")
I have recently followed this(http://blog.dispatched.ch/2009/03/15/webscraping-with-python-and-beautifulsoup/) tutorial on web scraping with python and beautifulsoup. Unfortunately, this tutorial was written with a 2.x version of python while I am using verion 3.2.3 of python as to focus learning a language version which is still developing for the future. My program retrieves, opens, reads and enters the search term on web page just fine(as far as I can tell) but it doesn't enter the
for result in results
loop so nothing is collected and printed. I think this may have to do with how I have assigned results but I am unsure of how to fix it. Here's my code:
`
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import string
import sys
user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = {'user-agent' : user_agent}
url = "http://www.dict.cc";
values = {'s' : 'rainbow'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
request = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(request)
the_page = response.read()
pool = BeautifulSoup(the_page)
results = pool.find_all('td', attrs = {'class' : 'td7n1'})
source = ''
translations = []
for result in results:
word = ''
for tmp in result.find_all(text = True):
word += " " + unicode(tmp).encode("utf-8")
if source == '':
source = word
else:
translations.append((source, word))
for translation in translations:
print ("%s => %s", translation[0], translation[1])