I use Scrapy to get data from an API call but the server is laggy.
First I scrape one page to get some IDs, and I add them to a list.
After that, I check how many IDs I have, and I start scraping.
The max IDs I can add is 10: event_id=1,2,3,4,5,6,7,8,9,10.
The problem is, because there are many IDs like 150, I have to make many requests, and the server responds after 3-5 seconds. I want to request all links at once and parse them later if this is possible.
match = "https://api.---.com/v1/?token=???&event_id&event_id="
class ApiSpider(scrapy.Spider):
name = 'api'
allowed_domains = ['api.---.com']
start_urls = ['https://api.---.com/ids/&token=???']
def parse(self, response):
data = json.loads(response.body)
results = (data['results'])
for result in results:
id_list.append(result['id'])
yield from self.scrape_start()
def scrape_start(self):
if len(matches_id) >= 10:
qq = (
match + id_list[0] + "," + id_list[1] + "," + id_list[2] + "," + id_list[3] + "," +
id_list[4] + "," + id_list[
5] + "," + id_list[6] + "," + id_list[7] + "," + id_list[8] + "," + id_list[9])
yield scrapy.Request(qq, callback=self.parse_product)
del matches_id[0:10]
elif len(matches_id) == 9:
...
def parse_product(self, response):
data = (json.loads(response.body))
results = (data['results'])
for result in results:
...
try changing CONCURRENT_REQUESTS which is by default 16 to a higher number.
as per scrapy docs:
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.
Note that in some cases this results in hardware bottlenecks, so try not to increase them by a lot. I'd recommend gradually increasing this value and observing system stats (CPU/Network).
Related
I am trying to write uni names, department names and ratings to a file from https://www.whatuni.com/university-course-reviews/?pageno=14. It goes well until I reach a post without a department name it gives me the error
file.write(user_name[k].text + ";" + uni_names[k].text + ";" + department[k].text + ";" + date_posted[k].text +
IndexError: list index out of range
Here is the code I use. I believe I need to somehow write null or use space when the department doesn't exist. I use if not and else but it didn't work for me. I would appreciate any help. Thank you
for i in range(20):
try:
driver.refresh()
uni_names = driver.find_elements_by_xpath('//div[#class="rlst_wrap"]/h2/a')
department_names = driver.find_elements_by_xpath('//div[#class="rlst_wrap"]/h3/a')
user_name = driver.find_elements_by_xpath('//div[#class="rev_name"]')
date_posted = driver.find_elements_by_xpath('//div[#class="rev_dte"]')
uni_rev = driver.find_elements_by_xpath('(//div[#class="reviw_rating"]/div[#class="rate_new"]/p)')
uni_rating = driver.find_elements_by_xpath('(//div[#class="reviw_rating"]/div[#class="rate_new"]/span[starts-with(#class,"ml5")])')
job_prospects = driver.find_elements_by_xpath('//span[text()="Job Prospects"]/following-sibling::span')
course_and_lecturers = driver.find_elements_by_xpath('//span[text()="Course and Lecturers"]/following-sibling::span')
if not course_and_lecturers:
lecturers= "None"
else:
lecturers = course_and_lecturers
uni_facilities = driver.find_elements_by_xpath('//span[text()= "Facilities" or "Uni Facilities"]/following-sibling::span')
if not uni_facilities:
facilities = "None"
else:
facilities = uni_facilities
student_support = driver.find_elements_by_xpath('//span[text()="Student Support"]/following-sibling::span')
if not student_support:
support = "None"
else:
support = student_support
with open('uni_scraping.csv', 'a') as file:
for k in range(len(uni_names)):
if not department_names:
department = "None"
else:
department = department_names
file.write(user_name[k].text + ";" + uni_names[k].text + ";" + department[k].text + ";" + date_posted[k].text +
";" + uni_rating[k].get_attribute("class") + ";" + job_prospects[k].get_attribute("class") +
";" + lecturers[k].get_attribute("class") + ";" + facilities[k].get_attribute("class") +
";" + support[k].get_attribute("class") + ";" + uni_rev[k].text + "\n")
next_page = driver.find_element_by_class_name('mr0')
next_page.click()
file.close()
except exceptions.StaleElementReferenceException as e:
print('e')
pass
driver.close()
Thank you Vimizen for the answer. I did what you suggested and it worked for me. I wrote something like this.
driver = webdriver.Chrome()
driver.get("https://www.whatuni.com/university-course-reviews/?pageno=14")
posts = []
driver.refresh()
post_elements = driver.find_elements_by_xpath('//div[#class="rlst_row"]')
for post_element_index in range(len(post_elements)):
post_element = post_elements[post_element_index]
uni_name = post_element.find_element_by_tag_name('h2')
try:
department_name = post_element.find_element_by_tag_name('h3')
department = department_name
department = department.text
except NoSuchElementException:
department = "aaaaaaaa"
user_name = post_element.find_element_by_class_name('rev_name')
postdict = {
"uni_name": uni_name.text,
"department": department,
"user_name": user_name.text
}
posts.append(postdict)
print(posts)
driver.close()
Best
You had a good feeling when you tried if not department_names but it only works if the list is empty. In your case, the issue is that the list is too short.
Due to the universitie whithout departments, department_names will be a shorter list than uni_names.
As a result, in you loop for k in range(len(uni_names)): the department[k].text will not always be the department of the uni with the same index, and at some point k will have a greater value than your department list. That's why department[k] will cause an error.
I don't know what is most efficient way to go around this but I think that you could get larger elements with the full details of every uni (the whole rlst_wrap for example), then search in it the details for the uni (with regexp for example). That way you would know when there is no department, and avoid the issue.
I am trying to build an item from many parsing functions because am getting data from multiple urls,
I try to iterate a dictionary (that i built using 2 for loops) that's why am using 2 for loops to get the needed variable to generate the URL
then for every variable i call the second parse function passing the needed URL
this is where i want to call the second parse function from my main parse
for r in [1,2]:
for t in [1,2]:
dataName = 'lane'+str(r)+"Player"+str(t)+"Name"
dataHolder = 'lane'+str(r)+"Player"+str(t)
nameP = item[dataName]
print('before parse ==> lane = ' + str(r) + " team = " + str(t))
urlP = 'https://www.leagueofgraphs.com/summoner/euw/'+nameP+'#championsData-soloqueue'
yield Request( urlP, callback=self.parsePlayer , meta={'item': item , "player" : dataHolder} )
I am using those prints() to see in output how my code is executing
same in my second parsing function which is as following
def parsePlayer( self , response ):
item = response.meta['item']
player = response.meta['player']
print('after parse ====> ' + player)
mmr = response.css('.rank .topRankPercentage::text').extract_first().strip().lower()
mmrP = player+"Mmr"
item[mmrP] = mmr
# yield item after the last iteration
( i know i did not explain every detail in the code but i think its not needed to see my problem , not after u see what am getting from those prints )
result i get
expected result
also for some reason everytime i run the spyder i get diffrent random order of prints this is confusing i think it s something about the yield i hope someone can help me with that
Scrapy works asynchronously (as explained clearly in their official documentation), which is why the order of your prints seem random.
Besides the order, the expected output looks exactly the same as the result you get.
If you can explain why the order is relevant, we might be able to answer your question better.
If you want to yield 1 item with data of all 4 players in there, the following structure can be used:
def start_requests(self):
# prepare the urls & players:
urls_dataHolders = []
for r in [1, 2]:
for t in [1, 2]:
dataName = 'lane' + str(r) + "Player" + str(t) + "Name"
dataHolder = 'lane' + str(r) + "Player" + str(t)
urlP = 'https://www.leagueofgraphs.com/summoner/euw/' + dataName\
+ '#championsData-soloqueue'
urls_dataHolders.append((urlP, dataHolder))
# get the first url & dataholder
url, dataHolder = urls_dataHolders.pop()
yield Request(url,
callback=self.parsePlayer,
meta={'urls_dataHolders': urls_dataHolders,
'player': dataHolder})
def parsePlayer(self, response):
item = response.meta.get('item', {})
urls_dataHolders = response.meta['urls_dataHolders']
player = response.meta['player']
mmr = response.css(
'.rank .topRankPercentage::text').extract_first().strip().lower()
mmrP = player + "Mmr"
item[mmrP] = mmr
try:
url, dataHolder = urls_dataHolders.pop()
except IndexError:
# list of urls is empty, so we yield the item
yield item
else:
# still urls to go through
yield Request(url,
callback=self.parsePlayer,
meta={'urls_dataHolders': urls_dataHolders,
'item': item,
'player': dataHolder})
I've never used Python before so excuse my lack of knowledge but I'm trying to scrape a xenforo forum for all of the threads. So far so good, except for the fact its picking up multiple URLs for each page of the same thread, I've posted some data before to explain what I mean.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-9
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-10
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-11
Really, what I would ideally want to scrape is just one of these.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
Here is my script:
from bs4 import BeautifulSoup
import requests
def get_source(url):
return requests.get(url).content
def is_forum_link(self):
return self.find('special string') != -1
def fetch_all_links_with_word(url, word):
source = get_source(url)
soup = BeautifulSoup(source, 'lxml')
return soup.select("a[href*=" + word + "]")
main_url = "http://example.com/forum/"
forumLinks = fetch_all_links_with_word(main_url, "forums")
forums = []
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
forums.append(link.attrs['href']);
print('Fetched ' + str(len(forums)) + ' forums')
threads = {}
for link in forums:
threadLinks = fetch_all_links_with_word(main_url + link, "threads")
for threadLink in threadLinks:
print(link + ': ' + threadLink.attrs['href'])
threads[link] = threadLink
print('Fetched ' + str(len(threads)) + ' threads')
This solution assumes that what should be removed from the url to check for uniqueness is always going to be "/page-#...". If that is not the case this solution will not work.
Instead of using a list to store your urls you can use a set, which will only add unique values. Then in the url remove the last instance of "page" and anything that comes after it if it is in the format of "/page-#", where # is any number, before adding it to the set.
forums = set()
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
url = link.attrs['href']
position = url.rfind('/page-')
if position > 0 and url[position + 6:position + 7].isdigit():
url = url[:position + 1]
forums.add(url);
I have written a program to scrap some data from the web as below.
import scrapy
class JPItem(scrapy.Item):
question_content = scrapy.Field()
best_answer = scrapy.Field()
class JPSpider(scrapy.Spider):
name = "jp"
allowed_domains = ['chiebukuro.yahoo.co.jp']
def start_requests(self):
url = 'https://chiebukuro.yahoo.co.jp/dir/list.php?did=2078297790&flg=1&sort=3&type=list&year=2004&month=1&day=1&page=1'
yield scrapy.Request(url, self.parse)
def parse(self, response):
if str(response.css("div.qa-list small::text").extract()) == '条件に一致する質問はみつかりませんでした。':
for y in range (2004,2007):
for m in range (1,13):
for d in range(1,32):
url = 'https://chiebukuro.yahoo.co.jp/dir/list.php?did=2078297790&flg=1&sort=3&type=list&year='+ str(y) + '&month=' + str(m) + '&day=' + str(d) +'&page=1';
yield scrapy.Request(url, self.parse)
else:
for i in range(0,40):
url = response.xpath('//ul[#id="qalst"]/li/dl/dt/a/#href')[i].extract()
yield scrapy.Request(url, self.parse_info)
next_page = response.css("div.qa-list p.flip a.next::attr(href)").extract_first()
if next_page is not None:
yield scrapy.Request(next_page, self.parse)
def parse_info(self, response):
item = JPItem()
item['question_content'] = "\"" + ''.join(response.css("div.mdPstdQstn div.ptsQes p:not([class])").extract() + response.css("div.mdPstdQstn div.ptsQes p.queTxt::text").extract()).replace("\n","\\n").replace("\r","\\r").replace("\t","\\t").replace("<p>","").replace("</p>","").replace("<br>","") + "\""
item['best_answer'] = "\"" + ''.join(response.css("div.mdPstdBA div.ptsQes p.queTxt::text").extract() + response.css("div.mdPstdBA div.ptsQes p:not([class])").extract()).replace("\n","\\n").replace("\r","\\r").replace("\t","\\t").replace("<p>","").replace("</p>","") + "\""
yield item
I found that there should be a problem with this line
if str(response.css("div.qa-list small::text").extract()) ==
'条件に一致する質問はみつかりませんでした。':
since when I run the program it cannot detect this condition, even if the extracted test should be the equal as stated, it will just skip to the Else condition. I have tried to use .encode("utf-8") but it seems could not solve the issue. Would anyone can help to provide some suggestions on this issue?
Greatly appreciated.
As #paul trmbth pointed out, what you are trying to do here is a compare a list with a string, which is logically incorrect and would always return False. So the options presented are to compare the string with :
response.css("div.qa-list small::text").extract_first() which gives the first extracted element, (here, a string) which is the preferred way since using extract_first() avoids an IndexError and returns None when it doesn’t find any element matching the selection
Since extract() returns a list, just doing response.css("div.qa-list small::text").extract[0] will work and provide the 1st element.
And incase you got a list of more than one strings and you want to take all the text together and do some operation with it, a simple method to turn all of them to a single string is to do ''.join(response.css("div.qa-list small::text"))
In your case using the 1st method is apt, and need not worry about utf-8 conversions as python will handle those internally.
I am new to coding in python (maybe a couple of days in) and basically learning of other people's code on stackoverflow. The code I am trying to write uses beautifulsoup to get the pid and the corresponding price for motorcycles on craigslist. I know there are many other ways of doing this but my current code looks like this:
from bs4 import BeautifulSoup
from urllib2 import urlopen
u = ""
count = 0
while (count < 9):
site = "http://sfbay.craigslist.org/mca/" + str(u)
html = urlopen(site)
soup = BeautifulSoup(html)
postings = soup('p',{"class":"row"})
f = open("pid.txt", "a")
for post in postings:
x = post.getText()
y = post['data-pid']
prices = post.findAll("span", {"class":"itempp"})
if prices == "":
w = 0
else:
z = str(prices)
z = z[:-8]
w = z[24:]
filewrite = str(count) + " " + str(y) + " " +str(w) + '\n'
print y
print w
f.write(filewrite)
count = count + 1
index = 100 * count
print "index is" + str(index)
u = "index" + str(index) + ".html"
It works fine and as I keep learning i plan to optimize it. The problem I have right now, is that entries without price are still showing up. Is there something obvious that I am missing.
thanks.
The problem is how you're comparing prices. You say:
prices = post.findAll("span", {"class":"itempp"})
In BS .findAll returns a list of elements. When you're comparing price to an empty string, it will always return false.
>>>[] == ""
False
Change if prices == "": to if prices == [] and everything should be fine.
I hope this helps.