Python: urllib.request.urlopen Is not working properly

Python: urllib.request.urlopen Is not working properly - python

I don't know why this code does not work.
When I print the list "videos" and "search results" to see what is happening, they are both empty. This is why the if statement is never reached as you can see in the screenshot.
# Query youtube results based on the songs entered in the text file
def find_video_urls(self, songs):
videos = list()
for song in songs:
self.update_text_widget("\nSong - " + song + " - Querying youtube results ...")
query_string = urllib.parse.urlencode({"search_query": song})
with urllib.request.urlopen("http://www.youtube.com/results?" + query_string) as html_content:
# retrieve all videos that met the song name criteria
search_results = re.findall(r'href=\"\/watch\?v=(.{11})', html_content.read().decode())
# only take the top result
if len(search_results) != 0:
videos.append("https://www.youtube.com/watch?v=" + search_results[0])
self.update_text_widget("\nSong - " + song + " - Found top result!")
return videos
GUI Output

Hi Vigilante I will share a code on how to do it with requests. Maybe you can implement it in your code.
import re
import requests
def find_video_urls(songs):
videos = list()
for song in songs:
with requests.session() as ses:
r = ses.get('http://www.youtube.com/results', params={"search_query": song})
search_results = re.findall(b'(/watch\?v=.{11})\"', r.content, re.MULTILINE | re.IGNORECASE | re.DOTALL)
print(search_results)
# only take the top result
if len(search_results) != 0:
videos.append(b"".join([b'https://www.youtube.com', search_results[0]]))
return videos
print(find_video_urls(['Eminem - Lose Yourself']))

Related

How to output only relevant changes while scraping for new discounts?

In a previous question I got the answer from Hedgehog! (How to check for new discounts and send to telegram if changes detected?)
But another question is, how can I get only the new (products) items in the output and not all the text what is changed. My feeling is that the output I got is literally anything what is changed on the website and not only the new added discount.
Here is the code, and see the attachment what the output is. Thanks again for all the effort.
`# Import all necessary packages
import requests, time, difflib, os, re, schedule, cloudscraper
from bs4 import BeautifulSoup
from datetime import datetime
# Define scraper
scraper = cloudscraper.create_scraper()
# Send a message via a telegram bot
def telegram_bot_sendtext(bot_message):
bot_token = '1XXXXXXXXXXXXXXXXXXXXXXXXXXG5pses8'
bot_chatID = '-XXXXXXXXXXX'
send_text = 'https://api.telegram.org/bot' + bot_token + '/sendMessage?chat_id=' + bot_chatID
+ '&parse_mode=Markdown&text=' + bot_message
response = requests.get(send_text)
return response.json()
PrevVersion = ""
FirstRun = True
while True:
# Download the page with the specified URL
response = scraper.get("https://").content
# Url for in the messages to show
url = "https://"
# Act like a browser
#headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
# Parse the downloaded page and check for discount on the page
soup = BeautifulSoup(response, 'html.parser')
def get_discounts(soup):
for d in soup.select('.cept-discount'):
if d.text != '' and 65 < int(''.join(filter(str.isdigit, d.text))) < 99:
return True
else:
return False
# Remove all scripts and styles
for script in soup(["script", "style"]):
script.extract()
discounts = get_discounts(soup)
soup = soup.get_text()
# Compare the page text to the previous version and check if there are any discounts in your range
if PrevVersion != soup and discounts:
# On the first run - just memorize the page
if FirstRun == True:
PrevVersion = soup
FirstRun = False
print ("Start Monitoring "+url+ ""+ str(datetime.now()))
else:
print ("Changes detected at: "+ str(datetime.now()))
OldPage = PrevVersion.splitlines()
NewPage = soup.splitlines()
diff = difflib.context_diff(OldPage,NewPage,n=0)
out_text = "\n".join([ll.rstrip() for ll in '\n'.join(diff).splitlines() if ll.strip()])
print (out_text)
OldPage = NewPage
# Send a message with the telegram bot
telegram_bot_sendtext("Nieuwe prijsfout op Pepper " + url )
# print ('\n'.join(diff))
PrevVersion = soup
else:
print( "No Changes "+ str(datetime.now()))
time.sleep(5)
continue`

What happens?
As discussed, your assumptions are going in the right direction, all the changes identified by the difflib will be displayed.
It may be possible to adjust the content of difflib but I am sure that difflib is not absolutely necessary for this task.
How to fix?
First step is to upgrade get_discounts(soup) to not only check if discount is in range but also get information of the item itself, if you like to display or operate on later:
def get_discounts(soup):
discounts = []
for d in soup.select('.cept-discount'):
if d.text != '' and 65 < int(''.join(filter(str.isdigit, d.text))) < 99:
discounts.append({
'name':d.find_previous('strong').a.get('title'),
'url':d.find_previous('strong').a.get('href'),
'discount':d.text,
'price':d.parent.parent.select_one('.thread-price').text,
'bestprice':d.previous_sibling.text
})
return discounts
Second step is to check if there is a new discount, close to the difflib but more focused:
def compare_discounts(d1: list, d2: list):
diff = [i for i in d1 + d2 if i not in d1]
result = len(diff) == 0
if not result:
return diff
Last step is to react to changes from the discounts, if so it will print the urls from so you can go directly to the offert products.
Note Cause we have stored additional information in our list of dicts you can adjust the printing to get also the whole information or specific attributes
if newDiscounts:
#Send a message with the telegram bot
print('\n'.join([c['url'] for c in newDiscounts]))
telegram_bot_sendtext("Nieuwe prijsfout op Pepper " + url)
Example
import requests, time, difflib, os, re, schedule, cloudscraper
from bs4 import BeautifulSoup
from datetime import datetime
# Define scraper
scraper = cloudscraper.create_scraper()
# Send a message via a telegram bot
def telegram_bot_sendtext(bot_message):
bot_token = '1XXXXXXXXXXXXXXXXXXXXXXXXXXG5pses8'
bot_chatID = '-XXXXXXXXXXX'
send_text = 'https://api.telegram.org/bot' + bot_token + '/sendMessage?chat_id=' + bot_chatID + '&parse_mode=Markdown&text=' + bot_message
response = requests.get(send_text)
return response.json()
PrevVersion = ""
PrevDiscounts = []
FirstRun = True
def get_discounts(soup):
discounts = []
for d in soup.select('.cept-discount'):
if d.text != '' and 65 < int(''.join(filter(str.isdigit, d.text))) < 99:
discounts.append({
'name':d.find_previous('strong').a.get('title'),
'url':d.find_previous('strong').a.get('href'),
'discount':d.text,
'price':d.parent.parent.select_one('.thread-price').text,
'bestprice':d.previous_sibling.text
})
return discounts
def compare_discounts(d1: list, d2: list):
diff = [i for i in d1 + d2 if i not in d1]
result = len(diff) == 0
if not result:
return diff
while True:
# Download the page with the specified URL
response = requests.get("https://nl.pepper.com/nieuw").content
# Url for in the messages to show
url = "https://nl.pepper.com/nieuw"
# Parse the downloaded page and check for discount on the page
soup = BeautifulSoup(response, 'html.parser')
# Remove all scripts and styles
for script in soup(["script", "style"]):
script.extract()
discounts = get_discounts(soup)
souptext = soup.get_text()
# Compare the page text to the previous version and check if there are any discounts in your range
if PrevVersion != souptext and discounts:
# On the first run - just memorize the page
if FirstRun == True:
PrevVersion = souptext
PrevDiscounts = discounts
FirstRun = False
print ("Start Monitoring "+url+ ""+ str(datetime.now()))
else:
print ("Changes detected at: "+ str(datetime.now()))
newDiscounts = compare_discounts(PrevDiscounts,discounts)
if newDiscounts:
print('\n'.join([c['url'] for c in newDiscounts]))
#Send a message with the telegram bot
telegram_bot_sendtext("Nieuwe prijsfout op Pepper " + url)
else:
print('These are general changes but there are no new discounts available.')
PrevVersion = souptext
PrevDiscounts = discounts
else:
print( "No Changes "+ str(datetime.now()))
time.sleep(10)
continue
Output
Start Monitoring https://nl.pepper.com/nieuw 2021-12-12 12:28:38.391028
No Changes 2021-12-12 12:28:54.009881
Changes detected at: 2021-12-12 12:29:04.429961
https://nl.pepper.com/aanbiedingen/gigaset-plug-startpakket-221003
No Changes 2021-12-12 12:29:14.698933
No Changes 2021-12-12 12:29:24.985394
No Changes 2021-12-12 12:29:35.271794
No Changes 2021-12-12 12:29:45.629790
No Changes 2021-12-12 12:29:55.917246
Changes detected at: 2021-12-12 12:30:06.184814
These are general changes but there are no new discounts available.

Extract specific text from a list in Python

I am trying to extract certain information from a long list of text do display it nicely but i cannot seem to figure out how exactly to tackle this problem.
My text is as follows:
"(Craw...Crawley\n\n\n\n\n\n\n08:00\n\n\n\n\n\n\n**Hotstage**\n **248236**\n\n\n\n\n\n\n\n\n\n\n\n\n\nCosta Collect...Costa Coffee (Bedf...Bedford\n\n\n\n\n\n\n08:00\n\n\n\n \n\n\n**Hotstage**\n **247962**\n\n\n\n\n\n\n\n\n\n\n\n\n\nKFC - Acrelec Deployment...KFC - Sheffield Qu...Sheffield\n\n\n\n\n\n\n08:00\n\n\n\n\n\n\nHotstage\n 247971\n\n\n\n\n\n\n\n\n\n\n\n\n\nKFC - Acrelec Deployment...KFC - Brentford...BRENTFORD\n\n\n\n\n\n\n08:00\n\n\n\n\n\n\nHotstage\n 248382\n\n\n\n\n\n\n\n\n\n\n\n\n\nKFC - Acrelec Deployment...KFC - Newport"
I would like to extract what is highlighted.
I'm thinking the solution is simple and maybe I am not storing the information properly or not extracting it properly.
This is my code
from bs4 import BeautifulSoup
import requests
import re
import time
def main():
url = "http://antares.platinum-computers.com/schedule.htm"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
response.close()
# Get
tech_count = 0
technicians = [] #List to hold technicians names
xcount = 0
test = 0
name_links = soup.find_all('td', {"class": "resouce_on"}) #Get all table data with class name "resource on".
# iterate through html data and add them to "technicians = []"
for i in name_links:
technicians.append(str(i.text.strip())) # append value to dictionary
tech_count += 1
print("Found: " + str(tech_count) + " technicians + 1 default unallocated.")
for t in technicians:
print(xcount,t)
xcount += 1
test = int(input("choose technician: "))
for link in name_links:
if link.find(text=re.compile(technicians[test])):
jobs = []
numbers = []
unique_cr = []
jobs.append(link.parent.text.strip())
for item in jobs:
for subitem in item.split():
if(subitem.isdigit()):
numbers.append(subitem)
for number in numbers:
if number not in unique_cr:
unique_cr.append(number)
print ("tasks for technician " + str(technicians[test]) + " are as follows")
for cr in unique_cr:
print (jobs)
if __name__ == '__main__':
main()

It's fairly simple:
myStr = "your complicated text"
words = mystr.split("\n")
niceWords = []
for word in words:
If "**"in word:
niceWords.append(word.replace("**", "")
print(niceWords)

Excluding 'duplicated' scraped URLs in Python app?

I've never used Python before so excuse my lack of knowledge but I'm trying to scrape a xenforo forum for all of the threads. So far so good, except for the fact its picking up multiple URLs for each page of the same thread, I've posted some data before to explain what I mean.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-9
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-10
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-11
Really, what I would ideally want to scrape is just one of these.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
Here is my script:
from bs4 import BeautifulSoup
import requests
def get_source(url):
return requests.get(url).content
def is_forum_link(self):
return self.find('special string') != -1
def fetch_all_links_with_word(url, word):
source = get_source(url)
soup = BeautifulSoup(source, 'lxml')
return soup.select("a[href*=" + word + "]")
main_url = "http://example.com/forum/"
forumLinks = fetch_all_links_with_word(main_url, "forums")
forums = []
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
forums.append(link.attrs['href']);
print('Fetched ' + str(len(forums)) + ' forums')
threads = {}
for link in forums:
threadLinks = fetch_all_links_with_word(main_url + link, "threads")
for threadLink in threadLinks:
print(link + ': ' + threadLink.attrs['href'])
threads[link] = threadLink
print('Fetched ' + str(len(threads)) + ' threads')

This solution assumes that what should be removed from the url to check for uniqueness is always going to be "/page-#...". If that is not the case this solution will not work.
Instead of using a list to store your urls you can use a set, which will only add unique values. Then in the url remove the last instance of "page" and anything that comes after it if it is in the format of "/page-#", where # is any number, before adding it to the set.
forums = set()
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
url = link.attrs['href']
position = url.rfind('/page-')
if position > 0 and url[position + 6:position + 7].isdigit():
url = url[:position + 1]
forums.add(url);

Web crawler not able to process more than one webpage

I am trying to extract some information about mtg cards from a webpage with the following program but I repeatedly retrieve information about the initial page given(InitUrl). The crawler is unable to proceed further. I have started to believe that i am not using the correct urls or maybe there is a restriction in using urllib that slipped my attention. Here is the code that i struggle with for weeks now:
import re
from math import ceil
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup
InitUrl = "https://mtgsingles.gr/search?q=dragon"
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages = 4 # depth of pages to be retrieved
query = InitUrl.split("?")[1]
for i in range(0, NumOfPages):
if i == 0:
Url = InitUrl
else:
Url = URL_Next
print(Url)
UClient = uReq(Url) # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")
try:
URL_Next = InitUrl + "&page=" + str(i + 2)
print("The next URL is: " + URL_Next + "\n")
except IndexError:
print("Crawling process completed! No more infomation to retrieve!")
else:
NumOfCrawledPages += 1
Url = URL_Next
finally:
print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")

One of the reasons your code fail is, that you don't use cookies. The site seem to require these to allow paging.
A clean and simple way of extracting the data you're interested in would be like this:
import requests
from bs4 import BeautifulSoup
# the site actually uses this url under the hood for paging - check out Google Dev Tools
paging_url = "https://mtgsingles.gr/search?ajax=products-listing&lang=en&page={}&q=dragon"
return_list = []
# the page-scroll will only work when we support cookies
# so we fetch the page in a session
session = requests.Session()
session.get("https://mtgsingles.gr/")
All pages have a next button except the last one. So we use this knowledge to loop until the next-button goes away. When it does - meaning that the last page is reached - the button is replaced with a 'li'-tag with the class of 'next hidden'. This only exists on the last page
Now we're ready to start looping
page = 1 # set count for start page
keep_paging = True # use flag to end loop when last page is reached
while keep_paging:
print("[*] Extracting data for page {}".format(page))
r = session.get(paging_url.format(page))
soup = BeautifulSoup(r.text, "html.parser")
items = soup.select('.iso-item.item-row-view.clearfix')
for item in items:
name = item.find('div', class_='col-md-10').get_text().strip().split('\xa0')[0]
toughness_element = item.find('div', class_='card-power-toughness')
try:
toughness = toughness_element.get_text().strip()
except:
toughness = None
cardtype = item.find('div', class_='cardtype').get_text()
card_dict = {
"name": name,
"toughness": toughness,
"cardtype": cardtype
}
return_list.append(card_dict)
if soup.select('li.next.hidden'): # this element only exists if the last page is reached
keep_paging = False
print("[*] Scraper is done. Quitting...")
else:
page += 1
# do stuff with your list of dicts - e.g. load it into pandas and save it to a spreadsheet
This will scroll until no more pages exists - no matter how many subpages would be in the site.
My point in the comment above was merely that if you encounter an Exception in your code, your pagecount would never increase. That's probably not what you want to do, which is why I recommended you to learn a little more about the behaviour of the whole try-except-else-finally deal.

I am also bluffed, by the request given the same reply, ignoring the page parameter. As a dirty soulution I can offer you first to set up the page-size to a high enough number to get all the Items that you want (this parameter works for some reason...)
import re
from math import ceil
import requests
from bs4 import BeautifulSoup as soup
InitUrl = Url = "https://mtgsingles.gr/search"
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages = 2 # depth of pages to be retrieved
query = "dragon"
cardSet=set()
for i in range(1, NumOfPages):
page_html = requests.get(InitUrl,params={"page":i,"q":query,"page-size":999})
print(page_html.url)
page_soup = soup(page_html.text, "html.parser")
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
cardString=card_name + "\n" + cardP_T + "\n" + cardType + "\n"
cardSet.add(cardString)
print(cardString)
NumOfCrawledPages += 1
print("Moving to page : " + str(NumOfCrawledPages + 1) + " with " +str(len(cards)) +"(cards)\n")

Scrapy, how to limit time per domain?

I have been searching for an answer and there is no answer on this forum although several questions have been asked. One answer is the it is possible to stop spider after certain time but that is not suitable for me because I usually launch 10 websites per spider. So my challenge is that I have spider for 10 websited and I would like to limit time to 20 seconds per domain in order to avoid getting stuck at some webshop. How to do it?
In general I can also tell you that I crawl 2000 company websites and in order to make it in one day I divide these websites into 200 groups of 10 websites and I launch 200 spiders in parallel. That may be amateur but that I the best that I know. The computer almost freezes because spiders consume entire CPU and memory, but next day I have the results. What I am looking for is employment webpages on companies' websites. Does anyone have any better idea how to crawl 2000 websites ? In case there is a webshop among websites the crawling could take days, so that is why I would like to limit the time per domain.
Thank you in advance.
Marko
My code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
import scrapy, urlparse, time, sys
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem
#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
from scrapy.conf import settings
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#Settings.set(name, value, priority='cmdline')
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}
#start_time = time.time()
# We run the programme in the command line with this command:
# scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
# We get two output files
# 1) urls.csv
# 2) log.txt
# Url whitelist.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/url_whitelist.txt", "r+") as kw:
url_whitelist = kw.read().replace('\n', '').split(",")
url_whitelist = map(str.strip, url_whitelist)
# Tab whitelist.
# We need to replace character the same way as in detector.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/tab_whitelist.txt", "r+") as kw:
tab_whitelist = kw.read().decode(sys.stdin.encoding).encode('utf-8')
tab_whitelist = tab_whitelist.replace('Ŕ', 'č')
tab_whitelist = tab_whitelist.replace('L', 'č')
tab_whitelist = tab_whitelist.replace('Ő', 'š')
tab_whitelist = tab_whitelist.replace('Ü', 'š')
tab_whitelist = tab_whitelist.replace('Ä', 'ž')
tab_whitelist = tab_whitelist.replace('×', 'ž')
tab_whitelist = tab_whitelist.replace('\n', '').split(",")
tab_whitelist = map(str.strip, tab_whitelist)
# Look for occupations in url.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/occupations_url.txt", "r+") as occ_url:
occupations_url = occ_url.read().replace('\n', '').split(",")
occupations_url = map(str.strip, occupations_url)
# Look for occupations in tab.
# We need to replace character the same way as in detector.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/occupations_tab.txt", "r+") as occ_tab:
occupations_tab = occ_tab.read().decode(sys.stdin.encoding).encode('utf-8')
occupations_tab = occupations_tab.replace('Ŕ', 'č')
occupations_tab = occupations_tab.replace('L', 'č')
occupations_tab = occupations_tab.replace('Ő', 'š')
occupations_tab = occupations_tab.replace('Ü', 'š')
occupations_tab = occupations_tab.replace('Ä', 'ž')
occupations_tab = occupations_tab.replace('×', 'ž')
occupations_tab = occupations_tab.replace('\n', '').split(",")
occupations_tab = map(str.strip, occupations_tab)
#Join url whitelist and occupations.
url_whitelist_occupations = url_whitelist + occupations_url
#Join tab whitelist and occupations.
tab_whitelist_occupations = tab_whitelist + occupations_tab
#base = open("G:/myVE/vacancies/bazni.txt", "w")
#non_base = open("G:/myVE/vacancies/ne_bazni.txt", "w")
class JobSpider(scrapy.Spider):
#Name of spider
name = "jobs"
#start_urls = open("Q:\Big_Data\Utrip\spletne_strani.txt", "r+").readlines()[0]
#print urls
#start_urls = map(str.strip, urls)
#Start urls
start_urls = ["http://www.alius.si"]
print "\nSpletna stran ", start_urls, "\n"
#Result of the programme is this list of job vacancies webpages.
jobs_urls = []
def parse(self, response):
#Theoretically I could save the HTML of webpage to be able to check later and see how it looked like
# at the time of downloading. That is important for validation, because it is easier to look at nice HTML webpage instead of naked text.
# but I have to write a pipeline http://doc.scrapy.org/en/0.20/topics/item-pipeline.html
response.selector.remove_namespaces()
#print "response url" , str(response.url)
#Take url of response, because we would like to stay on the same domain.
parsed = urlparse(response.url)
#Base url.
#base_url = get_base_url(response).strip()
base_url = parsed.scheme+'://'+parsed.netloc
#print "base url" , str(base_url)
#If the urls grows from seeds, it's ok, otherwise not.
if base_url in self.start_urls:
#print "base url je v start"
#base.write(response.url+"\n")
#net1 = parsed.netloc
#Take all urls, they are marked by "href" or "data-link". These are either webpages on our website either new websites.
urls_href = response.xpath('//#href').extract()
urls_datalink = response.xpath('//#data-link').extract()
urls = urls_href + urls_datalink
#print "povezave na tej strani ", urls
#Loop through all urls on the webpage.
for url in urls:
#Test all new urls. NE DELA
#print "url ", str(url)
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
if not (url.startswith("http")):
#Povežem delni url z baznim url.
url = urljoin(base_url,url).strip()
#print "new url ", str(url)
new_parsed = urlparse(url)
new_base_url = new_parsed.scheme+'://'+new_parsed.netloc
#print "new base url ", str(new_base_url)
if new_base_url in self.start_urls:
#print "yes"
url = url.replace("\r", "")
url = url.replace("\n", "")
url = url.replace("\t", "")
url = url.strip()
#Remove anchors '#', that point to a section on the same webpage, because this is the same webpage.
#But we keep question marks '?', which mean, that different content is pulled from database.
if '#' in url:
index = url.find('#')
url = url[:index]
if url in self.jobs_urls:
continue
#Ignore ftp and sftp.
if url.startswith("ftp") or url.startswith("sftp"):
continue
#Compare each url on the webpage with original url, so that spider doesn't wander away on the net.
#net2 = urlparse(url).netloc
#test.write("lokacija novega url "+ str(net2)+"\n")
#if net2 != net1:
# continue
#test.write("ni ista lokacija, nadaljujemo\n")
#If the last character is slash /, I remove it to avoid duplicates.
if url[len(url)-1] == '/':
url = url[:(len(url)-1)]
#If url includes characters like %, ~ ... it is LIKELY NOT to be the one I are looking for and I ignore it.
#However in this case I exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#slike
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#dokumenti
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#glasba in video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#stiskanje in drugo
'.zip', '.rar', '.css', '.flv', '.xml'
'.ZIP', '.RAR', '.CSS', '.FLV', '.XML'
#Twitter, Facebook, Youtube
'://twitter.com', '://mobile.twitter.com', 'www.twitter.com',
'www.facebook.com', 'www.youtube.com'
#Feeds, RSS, arhiv
'/feed', '=feed', '&feed', 'rss.xml', 'arhiv'
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
#url_xpath = url
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
#if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
#tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
tabs = response.xpath('//a[#href="%s"]/text()' % url).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url.
keyword_url = ''
#for keyword in url_whitelist:
for keyword in url_whitelist_occupations:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# a) If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls :
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print "Zaposlitvena podstran ", url
#We return the item.
yield item
#2. There are texts in tabs, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
#We search for keywords in tabs.
keyword_url_tab = ''
#for key in tab_whitelist:
for key in tab_whitelist_occupations:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
# If we find some keywords in tabs, then we have found keywords in both url and tab and we can save the url.
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab. So we add initial keyword_url.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print "Zaposlitvena podstran ", url
#We return the item.
yield item
#We haven't found any keywords in tabs, but url is still good, because it contains some keywords, so we save it.
else:
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print "Zaposlitvena podstran ", url
#We return the item.
yield item
# b) If keyword_url = empty, there are no keywords in url, but perhaps there are keywords in tabs. So we check tabs.
else:
for tab in tabs:
keyword_tab = ''
#for key in tab_whitelist:
for key in tab_whitelist_occupations:
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print "Zaposlitvena podstran ", url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH set in settings.py.
yield Request(url, callback = self.parse)
#else:
#non_base.write(response.url+"\n")

Just use scrapyd to schedule 2000 single web-site crawls. Set max_proc = 10 [1] to run 10 spiders in parallel. Set spider's CLOSESPIDER_TIMEOUT [2] to 20 run every spider for 20 seconds. Stop using Windows natively because it's a pain. I've observed Scrapy and scrapyd run faster inside in a VM rather than natively on Windows. I might be wrong - so try for yourself to cross-check but I have a strong feeling that if you use an Ubuntu 14.04 virtualbox image on Windows, it will be faster. Your crawl will take exactly 2000 * 20 / 10 = 17 minutes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: urllib.request.urlopen Is not working properly - python

Related

How to output only relevant changes while scraping for new discounts?

Extract specific text from a list in Python

Excluding 'duplicated' scraped URLs in Python app?

Web crawler not able to process more than one webpage

Scrapy, how to limit time per domain?

Categories

Resources