Scrapy: how to save crawled blogs in their own files - python

I’m very new to scrapy, python and coding in general. I have a project where I’d like to collect blog posts to do some content analysis on them in Atlas.ti 8. Atlas supports files like .html, .txt., docx and PDF.
I’ve built my crawler based on the scrapy tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html
My main issue is that I’m unable to save the posts in their own files. I can download them as one batch with scrapy crawl <crawler> -o filename.csv but from the csv I’ve to use VBA to put the posts in their own files row by row. This is a step I’d like to avoid.
My current code can be see below.
import scrapy
class BlogCrawler(scrapy.Spider):
name = "crawler"
start_urls = ['url']
def parse(self,response):
postnro = 0
for post in response.css('div.post'):
postnro += 1
yield {
'Post nro: ': postnro,
'date': post.css('.meta-date::text').get().replace('\r\n\t\ton', '').replace('\t',''),
'author': post.css('.meta-author i::text').get(),
'headline': post.css('.post-title ::text').get(),
'link': post.css('h1.post-title.single a').attrib['href'],
'text': [item.strip() for item in response.css('div.entry ::text').getall()],
}
filename = f'post-{postnro}.html'
with open(filename, 'wb') as f:
f.write(???)
next_page = response.css('div.alignright a').attrib['href']
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
I’ve no idea how I should go about saving the results. I’ve tried to input response.body, response.text and TextResponse.text to f.write() to no avail. I’ve also tried to collect the data in a for loop and save it like: f.write(date + ‘\n’, author + ‘\n'...) Approaches like these produce empty, 0 KB files.
The reason I’ve set the file type to ‘html’ is because Atlas can take it as it is and the whitespace won’t be an issue. In principle the filetype could also be .txt. However, if I manage to save posts as html, I evade the secondary issue in my project. The getall() creates a list which is why strip(), replace() as well as w3lib methods are hard to implement to clean the data. The current code replaces the whitespace with commas which is readable but it could be better.
If anyone has ideas on how to save each blog post in separate file, one post per file, I'd be happy to hear them.
Best regards,
Leeward

Managed to crack this after good night's sleep and some hours of keyboard (and head) bashing. It is not pretty or elegant and does not make use of Scrapy's advanced features, but suffices for now. This does not solve my secondary issue, but with that I can live with this being my first crawling project. There were multiple issues with my code:
"postnro" was not being updated so the code kept writing the same file over and over again. I was unable to make it work, so I used "date" instead. Could have used post's unique id as well, but those were so random, I would not haven known what file I was working with without opening the said file.
I could not figure out how to save yield to a file so I for looped what I wanted and saved the results one by one.
I switched the filetype from .html to .txt, but it took me some time
to figure out and switch 'wb' to plain 'w'.
For those interested, working code (so to speak) below:
def parse(self,response):
for post in response.css('div.post'):
date = post.css('.meta-date::text').get().replace('\r\n\t\ton ', '').replace('\t','')
author = post.css('.meta-author i::text').get()
headline = post.css('.post-title ::text').get()
link = post.css('h1.post-title.single a').attrib['href']
text = [item.strip() for item in post.css('div.entry ::text').getall()]
filename = f'post-{date}.txt'
with open(filename, 'w') as f:
f.write(str(date) + '\n' + str(author) + '\n' + str(headline) + '\n' + str(link) + '\n'+'\n'+ str(text) + '\n')
next_page = response.css('div.alignleft a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

Related

Web scraping every forum post (Python, Beautifulsoup)

Hello once again fellow stack'ers. Short description.. I am web scraping some data from an automotive forum using Python and saving all data into CSV files. With some help from other stackoverflow members managed to get as far as mining through all pages for certain topic, gathering the dates, title and link for each post.
I also have a seperate script I am now sturggling with implementing (For every link found, python creates a new soup for it, scrapes through all the posts and then goes back to previous link).
Would really appreciate any other tips or advice on how to make this better as it's my first time working with python, I think it might be my nested loop logic that's messed up, but checking through multiple times seems right to me.
Heres the code snippet :
link += (div.get('href'))
savedData += "\n" + title + ", " + link
tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
while tempNumber < 3:
for tempRow in tempSoup.find_all(id=re.compile("^td_post_")):
for tempNext in tempSoup.find_all(title=re.compile("^Next Page -")):
tempNextPage = ""
tempNextPage += (tempNext.get('href'))
post = ""
post += tempRow.get_text(strip=True)
postData += post + "\n"
tempNumber += 1
tempNewUrl = "http://www.automotiveforums.com/vbulletin/" + tempNextPage
tempSoup = make_soup(tempNewUrl)
print(tempNewUrl)
tempNumber = 1
number += 1
print(number)
newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage
soup = make_soup(newUrl)
My main issue with it so far is that tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
Does not seem to create a new soup after it has done scraping all the posts for forum thread.
This is the output I'm getting :
http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=2
http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=3
1
So it does seem to find the correct links for new pages and scrape them, however for next itteration it prints the new dates AND the same exact pages. There's also a reaaly weird 10-12 seconds delays after the last link is printed and only then it hops down to print number 1 and then bash out all the new dates..
But after going for the next forum threads link, it scrapes same exact data every time.
Sorry if it looks really messy, it is sort of a side project, and my first attempt at doing some useful things, so I am very new at this, any advice or tips would be much appreciated. I'm not asking you to solve the code for me, even some pointers for my possibly wrong logic would be greatly appreciated!
So after spending a little bit more time, I have managed to ALMOST crack it. It's now at the point where python finds every thread and it's link on the forum, then goes onto each link, reads all pages and continues on with next link.
This is the fixed code for it if anyone will make any use of it.
link += (div.get('href'))
savedData += "\n" + title + ", " + link
soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
while tempNumber < 4:
for postScrape in soup3.find_all(id=re.compile("^td_post_")):
post = ""
post += postScrape.get_text(strip=True)
postData += post + "\n"
print(post)
for tempNext in soup3.find_all(title=re.compile("^Next Page -")):
tempNextPage = ""
tempNextPage += (tempNext.get('href'))
print(tempNextPage)
soup3 = ""
soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + tempNextPage)
tempNumber += 1
tempNumber = 1
number += 1
print(number)
newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage
soup = make_soup(newUrl)
All I've had to do was to seperate the 2 for loops that were nested within each other, into own loops. Still not a perfect solution, but hey, it ALMOST works.
The non working bit: First 2 threads of provided link have multiple pages of posts. The following 10+ more threads Do not. I cannot figure out a way to check the for tempNext in soup3.find_all(title=re.compile("^Next Page -")):
value outside of loop to see if it's empty or not. Because if it does not find a next page element / href, it just uses the last one. But if I reset the value after each run, it no longer mines each page =l A solution that just created another one problem :D.
Many thanks dear Norbis for sharing your ideas and insights and concepts
since you offer only a snippet i just try to provide an approach that shows how to login to a phpBB - using payload:
import requests
forum = "the forum name"
headers = {'User-Agent': 'Mozilla/5.0'}
payload = {'username': 'username', 'password': 'password', 'redirect':'index.php', 'sid':'', 'login':'Login'}
session = requests.Session()
r = session.post(forum + "ucp.php?mode=login", headers=headers, data=payload)
print(r.text)
but wait: we can - instead of manipulating the website using requests,
also make use a browser automation such as mechanize offers this.
This way we don't have to manage the own session and only have a few lines of code to craft each request.
a interesting example is on GitHub https://github.com/winny-/sirsi/blob/317928f23847f4fe85e2428598fbe44c4dae2352/sirsi/sirsi.py#L74-L211

Scrapy follow link as well as encoding error

I've been trying to implement a parse function.
Essentially I figured out through the scrapy shell that
response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')).extract()[0]
gives me the url directing me to the next page. So I tried following the instructions with next_page. I took a look around stack overflow and it seems that everyone uses rule(LinkExtractor... which I don't believe I need to use. I'm pretty sure I'm doing it completely wrong though. I originally had a for loop that added every link I wanted to visit in the start_urls because I knew it was all in the form of *p1.html, *p2.html .. etc. but I want to make this smarter.
def parse(self, response):
items = []
for sel in response.xpath('//div[#class="Message"]'):
itemx = mydata()
itemx['information'] = sel.extract()
items.append(itemx)
with open('log.txt', 'a') as f:
f.write('\ninformation: ' + itemx.get('information')
#URL of next page response.xpath('//*[#id="PagerAfter"]/a[last()]/#href').extract()[0]
next_page = (response.xpath('//*[#id="PagerAfter"]/a[last()]/#href'))
if (response.url != response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')):
if next_page:
yield Request(response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')[0], self.parse)
return items
but does not work I get a
next_page = (response.xpath('//*[#id="PagerAfter"]/a[last()]/#href'))
^SyntaxError: invalid syntax
error. Additionally I know that the yield Request part is wrong. I want to recursively call and recursively add each scrape of each page into the list items.
Thank you!

Checking ALL links within links from a source HTML, Python

My code is to search a Link passed in the command prompt, get the HTML code for the webpage at the Link, search the HTML code for links on the webpage, and then repeat these steps for the links found. I hope that is clear.
It should print out any links that cause errors.
Some more needed info:
The max visits it can do is 100.
If a website has an error, a None value is returned.
Python3 is what I am using
eg:
s = readwebpage(url)... # This line of code gets the HTML code for the link(url) passed in its argument.... if the link has an error, s = None.
The HTML code for that website has links that end in p2.html, p3.html, p4.html, and p5.html on its webpage. My code reads all of these, but it does not visit these links individually to search for more links. If it did this, it should search through these links and find a link that ends in p10.html, and then it should report that the link ending with p10.html has errors. Obviously it doesn't do that at the moment, and it's giving me a hard time.
My code..
url = args.url[0]
url_list = [url]
checkedURLs = []
AmountVisited = 0
while (url_list and AmountVisited<maxhits):
url = url_list.pop()
s = readwebpage(url)
print("testing url: http",url) #Print the url being tested, this code is here only for testing..
AmountVisited = AmountVisited + 1
if s == None:
print("* bad reference to http", url)
else:
urls_list = re.findall(r'href="http([\s:]?[^\'" >]+)', s) #Creates a list of all links in HTML code starting with...
while urls_list: #... http or https
insert = urls_list.pop()
while(insert in checkedURLs and urls_list):
insert = urls_list.pop()
url_list.append(insert)
checkedURLs = insert
Please help :)
Here is the code you wanted. However, please, stop using regexes for parsing HTML. BeautifulSoup is the way to go for that.
import re
from urllib import urlopen
def readwebpage(url):
print "testing ",current
return urlopen(url).read()
url = 'http://xrisk.esy.es' #put starting url here
yet_to_visit= [url]
visited_urls = []
AmountVisited = 0
maxhits = 10
while (yet_to_visit and AmountVisited<maxhits):
print yet_to_visit
current = yet_to_visit.pop()
AmountVisited = AmountVisited + 1
html = readwebpage(current)
if html == None:
print "* bad reference to http", current
else:
r = re.compile('(?<=href=").*?(?=")')
links = re.findall(r,html) #Creates a list of all links in HTML code starting with...
for u in links:
if u in visited_urls:
continue
elif u.find('http')!=-1:
yet_to_visit.append(u)
print links
visited_urls.append(current)
Not Python but since you mentioned you aren't tied strictly to regex, I think you might find some use in using wget for this.
wget --spider -o C:\wget.log -e robots=off -w 1 -r -l 10 http://www.stackoverflow.com
Broken down:
--spider: When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
-o C:\wget.log: Log all messages to C:\wget.log.
-e robots=off: Ignore robots.txt
-w 1: set a wait time of 1 second
-r: set recursive search on
-l 10: sets the recursive depth to 10, meaning wget will only go as deep as 10 levels in, this may need to change depending on your max requests
http://www.stackoverflow.com: the URL you want to start with
Once complete, you can review the wget.log entries to determine which links had errors by searching for something like HTTP status codes 404, etc.
I suspect your regex is part of your problem. Right now, you have http outside your capture group, and [\s:] matches "some sort of whitespace (ie \s) or :"
I'd change the regex to: urls_list = re.findall(r'href="(.*)"',s). Also known as "match anything in quotes, after href=". If you absolutely need to ensure the http[s]://, use r'href="(https?://.*)"' (s? => one or zero s)
EDIT: And with actually working regex, using a non-greedly glom: href=(?P<q>[\'"])(https?://.*?)(?P=q)'
(Also, uh, while it's not technically necessary in your case because re caches, I think it's good practice to get into the habit of using re.compile.)
I think it's awfully nice that all of your URLs are full URLs. Do you have to deal with relative URLs at all?
`

Cleaning up data written from BeautifulSoup to Text File

I am trying to write a program that will collect specific information from an ebay product page and write that information to a text file. To do this I'm using BeautifulSoup and Requests and I'm working with Python 2.7.9.
I've been mostly using this tutorial (Easy Web Scraping with Python) with a few modifications. So far everything works as intended until it writes to the text file. The information is written, just not in the format that I would like.
What I'm getting is this:
{'item_title': u'Old Navy Pink Coat M', 'item_no': u'301585876394', 'item_price': u'US $25.00', 'item_img': 'http://i.ebayimg.com/00/s/MTYwMFgxMjAw/z/Sv0AAOSwv0tVIoBd/$_35.JPG'}
What I was hoping for was something that would be a bit easier to work with.
For example :
New Shirt 5555555555 US $25.00 http://ImageURL.jpg
In other words I want just the scraped text and not the brackets, the 'item_whatever', or the u'.
After a bit of research I suspect my problem is to do with the encoding of the information as its being written to the text file, but I'm not sure how to fix it.
So far I have tried,
def collect_data():
with open('writetest001.txt','w') as x:
for product_url in get_links():
get_info(product_url)
data = "'{0}','{1}','{2}','{3}'".format(item_data['item_title'],'item_price','item_no','item_img')
x.write(str(data))
In the hopes that it would make the data easier to format in the way I want. It only resulted in "NameError: global name 'item_data' is not defined" displayed in IDLE.
I have also tried using .split() and .decode('utf-8') in various positions but have only received AttributeErrors or the written outcome does not change.
Here is the code for the program itself.
import requests
import bs4
#Main URL for Harvesting
main_url = 'http://www.ebay.com/sch/Coats-Jackets-/63862/i.html?LH_BIN=1&LH_ItemCondition=1000&_ipg=24&rt=nc'
#Harvests Links from "Main" Page
def get_links():
r = requests.get(main_url)
data = r.text
soup = bs4.BeautifulSoup(data)
return [a.attrs.get('href')for a in soup.select('div.gvtitle a[href^=http://www.ebay.com/itm]')]
print "Harvesting Now... Please Wait...\n"
print "Harvested:", len(get_links()), "URLs"
#print (get_links())
print "Finished Harvesting... Scraping will Begin Shortly...\n"
#Scrapes Select Information from each page
def get_info(product_url):
item_data = {}
r = requests.get(product_url)
data = r.text
soup = bs4.BeautifulSoup(data)
#Fixes the 'Details about ' problem in the Title
for tag in soup.find_all('span',{'class':'g-hdn'}):
tag.decompose()
item_data['item_title'] = soup.select('h1#itemTitle')[0].get_text()
#Grabs the Price, if the item is on sale, grabs the sale price
try:
item_data['item_price'] = soup.select('span#prcIsum')[0].get_text()
except IndexError:
item_data['item_price'] = soup.select('span#mm-saleDscPrc')[0].get_text()
item_data['item_no'] = soup.select('div#descItemNumber')[0].get_text()
item_data['item_img'] = soup.find('img', {'id':'icImg'})['src']
return item_data
#Collects information from each page and write to a text file
write_it = open("writetest003.txt","w","utf-8")
def collect_data():
for product_url in get_links():
write_it.write(str(get_info(product_url))+ '\n')
collect_data()
write_it.close()
You were on the right track.
You need a local variable to assign the results of get_info to. The variable item_data you tried to reference only exists within the scope of the get_info function. You can use the same variable name though, and assign the results of the function to it.
There was also a syntax issue in the section you tried with respect to how you're formatting the items.
Replace the section you tried with this:
for product_url in get_links():
item_data = get_info(product_url)
data = "{0},{1},{2},{3}".format(*(item_data[item] for item in ('item_title','item_price','item_no','item_img')))
x.write(data)

Equivalent of wget in Python to download website and resources

Same thing asked 2.5 years ago in Downloading a web page and all of its resource files in Python but doesn't lead to an answer and the 'please see related topic' isn't really asking the same thing.
I want to download everything on a page to make it possible to view it just from the files.
The command
wget --page-requisites --domains=DOMAIN --no-parent --html-extension --convert-links --restrict-file-names=windows
does exactly that I need. However we want to be able to tie it in with other stuff that must be portable, so requires it to be in Python.
I've been looking at Beautiful Soup, scrapy, various spiders posted around the place, but these all seem to deal with getting data/links in clever but specific ways. Using these to do what I want seems like it will require a lot of work to deal with finding all of the resources, when I'm sure there must be an easy way.
thanks very much
You should be using an appropriate tool for the job at hand.
If you want to spider a site and save the pages to disk, Python probably isn't the best choice for that. Open source projects get features when someone needs that feature, and because wget does its job so well, nobody has bothered to try to write a python library to replace it.
Considering wget runs on pretty much any platform that has a Python interpreter, is there a reason you can't use wget?
My colleague wrote up this code, lots pieced together from other sources I believe. Might have some specific quirks for our system but it should help anyone wanting to do the same
"""
Downloads all links from a specified location and saves to machine.
Downloaded links will only be of a lower level then links specified.
To use: python downloader.py link
"""
import sys,re,os,urllib2,urllib,urlparse
tocrawl = set([sys.argv[1]])
# linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?')
linkregex = re.compile('href=[\'|"](.*?)[\'"].*?')
linksrc = re.compile('src=[\'|"](.*?)[\'"].*?')
def main():
link_list = []##create a list of all found links so there are no duplicates
restrict = sys.argv[1]##used to restrict found links to only have lower level
link_list.append(restrict)
parent_folder = restrict.rfind('/', 0, len(restrict)-1)
##a.com/b/c/d/ make /d/ as parent folder
while 1:
try:
crawling = tocrawl.pop()
#print crawling
except KeyError:
break
url = urlparse.urlparse(crawling)##splits url into sections
try:
response = urllib2.urlopen(crawling)##try to open the url
except:
continue
msg = response.read()##save source of url
links = linkregex.findall(msg)##search for all href in source
links = links + linksrc.findall(msg)##search for all src in source
for link in (links.pop(0) for _ in xrange(len(links))):
if link.startswith('/'):
##if /xxx a.com/b/c/ -> a.com/b/c/xxx
link = 'http://' + url[1] + link
elif ~link.find('#'):
continue
elif link.startswith('../'):
if link.find('../../'):##only use links that are max 1 level above reference
##if ../xxx.html a.com/b/c/d.html -> a.com/b/xxx.html
parent_pos = url[2].rfind('/')
parent_pos = url[2].rfind('/', 0, parent_pos-2) + 1
parent_url = url[2][:parent_pos]
new_link = link.find('/')+1
link = link[new_link:]
link = 'http://' + url[1] + parent_url + link
else:
continue
elif not link.startswith('http'):
if url[2].find('.html'):
##if xxx.html a.com/b/c/d.html -> a.com/b/c/xxx.html
a = url[2].rfind('/')+1
parent = url[2][:a]
link = 'http://' + url[1] + parent + link
else:
##if xxx.html a.com/b/c/ -> a.com/b/c/xxx.html
link = 'http://' + url[1] + url[2] + link
if link not in link_list:
link_list.append(link)##add link to list of already found links
if (~link.find(restrict)):
##only grab links which are below input site
print link ##print downloaded link
tocrawl.add(link)##add link to pending view links
file_name = link[parent_folder+1:]##folder structure for files to be saved
filename = file_name.rfind('/')
folder = file_name[:filename]##creates folder names
folder = os.path.abspath(folder)##creates folder path
if not os.path.exists(folder):
os.makedirs(folder)##make folder if it does not exist
try:
urllib.urlretrieve(link, file_name)##download the link
except:
print "could not download %s"%link
else:
continue
if __name__ == "__main__":
main()
thanks for the replies

Categories

Resources