How to extract a URL's domain name python (no imports) - python

Currently doing a school mini-project where we have to make a program to extract the domain name from a few given URL's, and put those which end in .uk (ie are websites from the united kingdom) in a list.
A couple of specifications:
We cannot import any modules or anything.
We can ignore urls that don't start with either "http://" or "https://"
I was originally just going to do:
uksites = []
file = open('urlfile.txt','r')
urllist = file.read().splitlines()
for url in urllist:
if "http://" in url:
domainstart = url.find("http://") + len("http://")
elif "https://" in url:
domainstart = url.find("https://") + len("https://")
domainend = url.find("/", domainstart)
if domainend >= 0:
domain = url[domainstart:domainend]
else:
domain = url[domainstart:]
if domain[-3:] = ".uk":
uksites.append(url)
But then our professor warned us that not all domain names will be ended with a "/" (for example, one of the given ones in the test file we were supplied ends with ":")
Is this the only other valid character that can signify the end of a domain name? or are there even more?
If so how could I approach this?
The test file is pretty short, it contains a few "links" (some aren't real sites apparently):
http://google.com.au/
https://google.co.uk/
https://00.auga.com/shop/angel-haired-dress/
http://applesandoranges.co.uk:articles/best-seasonal-fruits-for-your-garden/
https://www.youtube.com/watch?v=GVbG35DeMto
http://docs.oracle.com/en/java/javase/11/docs/api/
https://www.instagram.co.uk/posts/hjgjh42324/

This should work
def compute():
ret = []
hs = ['https:', 'http:']
domains = ['uk']
file = open('urlfile.txt','r')
urllist = file.read().splitlines()
for url in urllist:
url_t = url.split('/')
# \setminus {http*}
if url_t[0] in hs:
url_t = (url_t[2]).split('.')
else:
url_t = (url_t[0]).split('.')
# get domain
if ':' in url_t[-1]:
url_t = (url_t[-1].split(':'))[0]
else:
url_t = url_t[-1]
# verify it
if url_t in domains:
ret.append(url)
return ret
if __name__ == '__main__':
print(compute())

Analyze the structure of the url first.
An url contains a few characters which cannot appear randomly e.g. the dot "."
The last dot allows therefore marks the end of the url + the country code.
The easiest solution could be to split the urls at the last dot with .rsplit(".", 1) and then take the first 2 letters after the split and re-attach them to the first part of the split. I chose a different approach and checked the second part after the split for alphanumeric characters, because after the country code there is always a special character (non-alphanumeric) so this allows for an additional split of the second part.
s = '''http://google.com.au/
https://google.co.uk/
https://00.auga.com/shop/angel-haired-dress/
http://applesandoranges.co.uk:articles/best-seasonal-fruits-for-your-garden/
https://www.youtube.com/watch?v=GVbG35DeMto
http://docs.oracle.com/en/java/javase/11/docs/api/
https://www.instagram.co.uk/posts/hjgjh42324/'''
s_splitted = s.split("\n")
for raw_url in s_splitted:
a,b = raw_url.rsplit(".", 1)
c = "".join([x if x.isalnum() else " " for x in b ]).split(" ",1)[0]
url = a+"."+c
if url.endswith(".uk"):
print(url)

Related

Unable to scrape the conversation among debaters in order to put them in a dictionary

I've created a script to fetch all the conversation between different debaters excluding moderators. What I've written so far can fetch the total conversation. However, I would like to grab them like {speaker_name: (first speech, second speech) etc }.
Webpage link
another one similar to the above link
webpage link
I've tried so far:
import requests
from bs4 import BeautifulSoup
url = 'https://www.presidency.ucsb.edu/documents/presidential-debate-the-university-nevada-las-vegas'
def get_links(link):
r = requests.get(link)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select(".field-docs-content p:has( > strong:contains('MODERATOR:')) ~ p"):
print(item.text)
if __name__ == '__main__':
get_links(url)
How can I scrape the conversation among debaters and put them in a dictionary?
I don't hold much hope for this lasting across lots of pages given the variability amongst the two pages I saw and the number of assumptions I have had to make. Essentially, I use regex on participant and moderators nodes text to isolate the lists of moderators and participants. I then loop all speech paragraphs and each time I encounter a moderator at the start of a paragraph I set a boolean variable store_paragraph = False and ignore subsequent paragraphs; likewise, each time I encounter a participant, I set store_paragraph = True and store that paragraph and subsequent ones under the appropriate participant key in my speaker_dict. I store each speaker_dict in a final results dictionary.
import requests, re
from bs4 import BeautifulSoup as bs
import pprint
links = ['https://www.presidency.ucsb.edu/documents/presidential-debate-the-university-nevada-las-vegas','https://www.presidency.ucsb.edu/documents/republican-presidential-candidates-debate-manchester-new-hampshire-0']
results = {}
p = re.compile(r'\b(\w+)\b\s+\(|\b(\w+)\b,')
with requests.Session() as s:
for number, link in enumerate(links):
r = s.get(link)
soup = bs(r.content,'lxml')
participants_tag = soup.select_one('p:has(strong:contains("PARTICIPANTS:"))')
if participants_tag.select_one('strong'):
participants_tag.strong.decompose()
speaker_dict = {i[0].upper() + ':' if i[0] else i[1].upper() + ':': [] for string in participants_tag.stripped_strings for i in p.findall(string)}
# print(speaker_dict)
moderator_data = [string for string in soup.select_one('p:has(strong:contains("MODERATOR:","MODERATORS:"))').stripped_strings][1:]
#print(moderator_data)
moderators = [i[0].upper() + ':' if i[0] else i[1].upper() + ':' for string in moderator_data for i in p.findall(string)]
store_paragraph = False
for paragraph in soup.select('.field-docs-content p:not(p:contains("PARTICIPANTS:","MODERATOR:"))')[1:]:
string_to_compare = paragraph.text.split(':')[0] + ':'
string_to_compare = string_to_compare.upper()
if string_to_compare in moderators:
store_paragraph = False
elif string_to_compare in speaker_dict:
speaker = string_to_compare
store_paragraph = True
if store_paragraph:
speaker_dict[speaker].append(paragraph.text)
results[number] = speaker_dict
pprint.pprint(results[1])

How to fix user input as the variable assignment works fine

My code doesn't give the desired output when I use user input but it works fine when I use a simple variable assignment.
I checked both user input and variable. Both are of type String.
When I use Input, it gives below error: print("\nIPAbuse check for the IP Address: {} \nDatabase Check: \nConfidence of Abuse: \nISP: {} \nUsage: {} \nDomain Name: {} \nCountry: {} \nCity: {}".format(num,description1,description2,isp,usage,domain,country,city)) NameError: name 'description1' is not defined
# sys.stdout.write("Enter Source IP Address: ")
# sys.stdout.flush()
# ip = sys.stdin.readline()
ip = '212.165.108.173'
url = ""
num = str(ip)
req = requests.get(url + num)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
try:
div = soup.find('div', {"class": "well"})
description1 = div.h3.text.strip()
description2 = div.p.text.strip()
isp = soup.find("th", text="ISP").find_next_sibling("td").text.strip()
usage = soup.find("th", text="Usage Type").find_next_sibling("td").text.strip()
domain = soup.find("th", text="Domain Name").find_next_sibling("td").text.strip()
country = soup.find("th", text="Country").find_next_sibling("td").text.strip()
city = soup.find("th", text="City").find_next_sibling("td").text.strip()
except:
isp = 'Invalid'
usage = 'Invalid'
domain = 'Invalid'
country = 'Invalid'
city = 'Invalid'
print(
"num, description1, description2, isp, usage, domain, country, city)
readline() adds a '\n' character to the input, so it's going to be different than if you make it a hardcoded assignment like ip = '212.165.108.173'. The newline char is messing up the request. As a quick patch, confirm that the last character of the user input is '\n' and try making sure that character doesn't get in the url for the request. On the other hand, I'd also suggest going for input like someone said in the comments (if only because that one does not add the \n at the end).

Excluding 'duplicated' scraped URLs in Python app?

I've never used Python before so excuse my lack of knowledge but I'm trying to scrape a xenforo forum for all of the threads. So far so good, except for the fact its picking up multiple URLs for each page of the same thread, I've posted some data before to explain what I mean.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-9
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-10
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-11
Really, what I would ideally want to scrape is just one of these.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
Here is my script:
from bs4 import BeautifulSoup
import requests
def get_source(url):
return requests.get(url).content
def is_forum_link(self):
return self.find('special string') != -1
def fetch_all_links_with_word(url, word):
source = get_source(url)
soup = BeautifulSoup(source, 'lxml')
return soup.select("a[href*=" + word + "]")
main_url = "http://example.com/forum/"
forumLinks = fetch_all_links_with_word(main_url, "forums")
forums = []
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
forums.append(link.attrs['href']);
print('Fetched ' + str(len(forums)) + ' forums')
threads = {}
for link in forums:
threadLinks = fetch_all_links_with_word(main_url + link, "threads")
for threadLink in threadLinks:
print(link + ': ' + threadLink.attrs['href'])
threads[link] = threadLink
print('Fetched ' + str(len(threads)) + ' threads')
This solution assumes that what should be removed from the url to check for uniqueness is always going to be "/page-#...". If that is not the case this solution will not work.
Instead of using a list to store your urls you can use a set, which will only add unique values. Then in the url remove the last instance of "page" and anything that comes after it if it is in the format of "/page-#", where # is any number, before adding it to the set.
forums = set()
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
url = link.attrs['href']
position = url.rfind('/page-')
if position > 0 and url[position + 6:position + 7].isdigit():
url = url[:position + 1]
forums.add(url);

Issue with web crawler: IndexError: string index out of range

I am making a web crawler. I'm not using scrapy or anything, I'm trying to have my script do most things. I have tried doing a search for the issue however I can't seem to find anything that helps with the error. I've tried switching around some of the variable to try and narrow down the problem. I am getting an error on line 24 saying IndexError: string index out of range. The functions run on the first url, (the original url) then the second and fail on the third in the original array. I'm lost, any help would be appreciated greatly! Note, I'm only printing all of them for testing, I'll eventually have them printed to a text file.
import requests
from bs4 import BeautifulSoup
# creating requests from user input
url = raw_input("Please enter a domain to crawl, without the 'http://www' part : ")
def makeRequest(url):
r = requests.get('http://' + url)
# Adding in BS4 for finding a tags in HTML
soup = BeautifulSoup(r.content, 'html.parser')
# Writes a as the link found in the href
output = soup.find_all('a')
return output
def makeFilter(link):
# Creating array for our links
found_link = []
for a in link:
a = a.get('href')
a_string = str(a)
# if statement to filter our links
if a_string[0] == '/': # this is the line with the error
# Realtive Links
found_link.append(a_string)
if 'http://' + url in a_string:
# Links from the same site
found_link.append(a_string)
if 'https://' + url in a_string:
# Links from the same site with SSL
found_link.append(a_string)
if 'http://www.' + url in a_string:
# Links from the same site
found_link.append(a_string)
if 'https://www.' + url in a_string:
# Links from the same site with SSL
found_link.append(a_string)
#else:
# found_link.write(a_string + '\n') # testing only
output = found_link
return output
# Function for removing duplicates
def remove_duplicates(values):
output = []
seen = set()
for value in values:
if value not in seen:
output.append(value)
seen.add(value)
return output
# Run the function with our list in this order -> Makes the request -> Filters the links -> Removes duplicates
def createURLList(values):
requests = makeRequest(values)
new_list = makeFilter(requests)
filtered_list = remove_duplicates(new_list)
return filtered_list
result = createURLList(url)
# print result
# for verifying and crawling resulting pages
for b in result:
sub_directories = createURLList(url + b)
crawler = []
crawler.append(sub_directories)
print crawler
After a_string = str(a) try adding:
if not a_string:
continue

Scrapy, how to limit time per domain?

I have been searching for an answer and there is no answer on this forum although several questions have been asked. One answer is the it is possible to stop spider after certain time but that is not suitable for me because I usually launch 10 websites per spider. So my challenge is that I have spider for 10 websited and I would like to limit time to 20 seconds per domain in order to avoid getting stuck at some webshop. How to do it?
In general I can also tell you that I crawl 2000 company websites and in order to make it in one day I divide these websites into 200 groups of 10 websites and I launch 200 spiders in parallel. That may be amateur but that I the best that I know. The computer almost freezes because spiders consume entire CPU and memory, but next day I have the results. What I am looking for is employment webpages on companies' websites. Does anyone have any better idea how to crawl 2000 websites ? In case there is a webshop among websites the crawling could take days, so that is why I would like to limit the time per domain.
Thank you in advance.
Marko
My code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
import scrapy, urlparse, time, sys
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem
#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
from scrapy.conf import settings
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#Settings.set(name, value, priority='cmdline')
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}
#start_time = time.time()
# We run the programme in the command line with this command:
# scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
# We get two output files
# 1) urls.csv
# 2) log.txt
# Url whitelist.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/url_whitelist.txt", "r+") as kw:
url_whitelist = kw.read().replace('\n', '').split(",")
url_whitelist = map(str.strip, url_whitelist)
# Tab whitelist.
# We need to replace character the same way as in detector.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/tab_whitelist.txt", "r+") as kw:
tab_whitelist = kw.read().decode(sys.stdin.encoding).encode('utf-8')
tab_whitelist = tab_whitelist.replace('Ŕ', 'č')
tab_whitelist = tab_whitelist.replace('L', 'č')
tab_whitelist = tab_whitelist.replace('Ő', 'š')
tab_whitelist = tab_whitelist.replace('Ü', 'š')
tab_whitelist = tab_whitelist.replace('Ä', 'ž')
tab_whitelist = tab_whitelist.replace('×', 'ž')
tab_whitelist = tab_whitelist.replace('\n', '').split(",")
tab_whitelist = map(str.strip, tab_whitelist)
# Look for occupations in url.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/occupations_url.txt", "r+") as occ_url:
occupations_url = occ_url.read().replace('\n', '').split(",")
occupations_url = map(str.strip, occupations_url)
# Look for occupations in tab.
# We need to replace character the same way as in detector.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/occupations_tab.txt", "r+") as occ_tab:
occupations_tab = occ_tab.read().decode(sys.stdin.encoding).encode('utf-8')
occupations_tab = occupations_tab.replace('Ŕ', 'č')
occupations_tab = occupations_tab.replace('L', 'č')
occupations_tab = occupations_tab.replace('Ő', 'š')
occupations_tab = occupations_tab.replace('Ü', 'š')
occupations_tab = occupations_tab.replace('Ä', 'ž')
occupations_tab = occupations_tab.replace('×', 'ž')
occupations_tab = occupations_tab.replace('\n', '').split(",")
occupations_tab = map(str.strip, occupations_tab)
#Join url whitelist and occupations.
url_whitelist_occupations = url_whitelist + occupations_url
#Join tab whitelist and occupations.
tab_whitelist_occupations = tab_whitelist + occupations_tab
#base = open("G:/myVE/vacancies/bazni.txt", "w")
#non_base = open("G:/myVE/vacancies/ne_bazni.txt", "w")
class JobSpider(scrapy.Spider):
#Name of spider
name = "jobs"
#start_urls = open("Q:\Big_Data\Utrip\spletne_strani.txt", "r+").readlines()[0]
#print urls
#start_urls = map(str.strip, urls)
#Start urls
start_urls = ["http://www.alius.si"]
print "\nSpletna stran ", start_urls, "\n"
#Result of the programme is this list of job vacancies webpages.
jobs_urls = []
def parse(self, response):
#Theoretically I could save the HTML of webpage to be able to check later and see how it looked like
# at the time of downloading. That is important for validation, because it is easier to look at nice HTML webpage instead of naked text.
# but I have to write a pipeline http://doc.scrapy.org/en/0.20/topics/item-pipeline.html
response.selector.remove_namespaces()
#print "response url" , str(response.url)
#Take url of response, because we would like to stay on the same domain.
parsed = urlparse(response.url)
#Base url.
#base_url = get_base_url(response).strip()
base_url = parsed.scheme+'://'+parsed.netloc
#print "base url" , str(base_url)
#If the urls grows from seeds, it's ok, otherwise not.
if base_url in self.start_urls:
#print "base url je v start"
#base.write(response.url+"\n")
#net1 = parsed.netloc
#Take all urls, they are marked by "href" or "data-link". These are either webpages on our website either new websites.
urls_href = response.xpath('//#href').extract()
urls_datalink = response.xpath('//#data-link').extract()
urls = urls_href + urls_datalink
#print "povezave na tej strani ", urls
#Loop through all urls on the webpage.
for url in urls:
#Test all new urls. NE DELA
#print "url ", str(url)
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
if not (url.startswith("http")):
#Povežem delni url z baznim url.
url = urljoin(base_url,url).strip()
#print "new url ", str(url)
new_parsed = urlparse(url)
new_base_url = new_parsed.scheme+'://'+new_parsed.netloc
#print "new base url ", str(new_base_url)
if new_base_url in self.start_urls:
#print "yes"
url = url.replace("\r", "")
url = url.replace("\n", "")
url = url.replace("\t", "")
url = url.strip()
#Remove anchors '#', that point to a section on the same webpage, because this is the same webpage.
#But we keep question marks '?', which mean, that different content is pulled from database.
if '#' in url:
index = url.find('#')
url = url[:index]
if url in self.jobs_urls:
continue
#Ignore ftp and sftp.
if url.startswith("ftp") or url.startswith("sftp"):
continue
#Compare each url on the webpage with original url, so that spider doesn't wander away on the net.
#net2 = urlparse(url).netloc
#test.write("lokacija novega url "+ str(net2)+"\n")
#if net2 != net1:
# continue
#test.write("ni ista lokacija, nadaljujemo\n")
#If the last character is slash /, I remove it to avoid duplicates.
if url[len(url)-1] == '/':
url = url[:(len(url)-1)]
#If url includes characters like %, ~ ... it is LIKELY NOT to be the one I are looking for and I ignore it.
#However in this case I exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#slike
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#dokumenti
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#glasba in video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#stiskanje in drugo
'.zip', '.rar', '.css', '.flv', '.xml'
'.ZIP', '.RAR', '.CSS', '.FLV', '.XML'
#Twitter, Facebook, Youtube
'://twitter.com', '://mobile.twitter.com', 'www.twitter.com',
'www.facebook.com', 'www.youtube.com'
#Feeds, RSS, arhiv
'/feed', '=feed', '&feed', 'rss.xml', 'arhiv'
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
#url_xpath = url
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
#if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
#tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
tabs = response.xpath('//a[#href="%s"]/text()' % url).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url.
keyword_url = ''
#for keyword in url_whitelist:
for keyword in url_whitelist_occupations:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# a) If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls :
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print "Zaposlitvena podstran ", url
#We return the item.
yield item
#2. There are texts in tabs, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
#We search for keywords in tabs.
keyword_url_tab = ''
#for key in tab_whitelist:
for key in tab_whitelist_occupations:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
# If we find some keywords in tabs, then we have found keywords in both url and tab and we can save the url.
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab. So we add initial keyword_url.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print "Zaposlitvena podstran ", url
#We return the item.
yield item
#We haven't found any keywords in tabs, but url is still good, because it contains some keywords, so we save it.
else:
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print "Zaposlitvena podstran ", url
#We return the item.
yield item
# b) If keyword_url = empty, there are no keywords in url, but perhaps there are keywords in tabs. So we check tabs.
else:
for tab in tabs:
keyword_tab = ''
#for key in tab_whitelist:
for key in tab_whitelist_occupations:
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print "Zaposlitvena podstran ", url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH set in settings.py.
yield Request(url, callback = self.parse)
#else:
#non_base.write(response.url+"\n")
Just use scrapyd to schedule 2000 single web-site crawls. Set max_proc = 10 [1] to run 10 spiders in parallel. Set spider's CLOSESPIDER_TIMEOUT [2] to 20 run every spider for 20 seconds. Stop using Windows natively because it's a pain. I've observed Scrapy and scrapyd run faster inside in a VM rather than natively on Windows. I might be wrong - so try for yourself to cross-check but I have a strong feeling that if you use an Ubuntu 14.04 virtualbox image on Windows, it will be faster. Your crawl will take exactly 2000 * 20 / 10 = 17 minutes.

Categories

Resources