Scrapy email file on completed scrape - python

I have a scraped that runs at a timed interval. I want to send an email when the scrape completes. What would be the best method to go about doing this?
I was thinking of writing an extension, but I cant figure out how to access the file that the output was being written to from within the extension.

Have you considered hooking the spider_closed signal and using the scrapy.mail.MailSender service ?
scrapy.signals.spider_closed(spider, reason)
[...]
reason (str) – a string which describes the reason why the spider was closed. If it was closed because the spider has completed scraping, the reason is 'finished'.

Related

A http proxy same as Fiddler AutoResponder

Hey I'm trying to create something like fiddler autoresponder.
I want replace specific url content to other example:
I researched everything but can't found I tried create proxy server, nodejs script, python script. Nothing.
https://randomdomain.com content replace with https://cliqqi.ml
ps I'm doing this because I want intercept electron app getting main game file from site and then just intercept they game file site to my site.
If you're looking for a way to do this programmatically in Node.js, I've written a library you can use to do exactly that, called Mockttp.
You can use Mockttp to create a HTTPS-intercepting & rewriting proxy, which will allow you to send mock responses directly, redirect traffic from one address to another, rewrite anything including the headers & body of existing traffic, or just log everything that's sent & received. There's a full guide here: https://httptoolkit.tech/blog/javascript-mitm-proxy-mockttp/

How do I get URLs that are being accessed in my browser in 'real time'?

I want to write a program that returns the current or last visited URL by me on my computer (Windows 10) browser. Is there any way in which I can get that URL?
I tried using Python and SQLite to access Chrome history database on C:\Users%USERNAME%\AppData\Local\Google\Chrome\User Data\Default\History and it worked, but if I'm using the browser, the database gets locked.
I know that by using Wireshark, one can see the packets when accessing an URL, but I cannot find the complete URL in those packets fields, only the server name (ie: stackoverflow.com).
I'd like to know whether there is a way in which I can see that information as it's done by Wireshark, but only to get the complete URL, nothing else. Thank you!
I found a solution to this by using mitmproxy: https://mitmproxy.org/. This video on YouTube helped me with the installation and setup process: https://www.youtube.com/watch?v=7BXsaU42yok. The video explains the installation on Mac, but it's not so different from Windows. Then you can use Python to capture and process the URLs contained within the HTTPS requests by using the flow.request.pretty_url property: https://docs.mitmproxy.org/stable/addons-scripting/.

scrapy: spider quits without error messages before all requests yielded

If there are many requests in scheduler, would scheduler reject more requests to be added?
I met a very tricky question. I am trying to scrape a forum with all posts and comments. The problem is scrapy seems never finish it jobs and quits without error messages. I am wondering if I yielded too many requests so that scrapy stopped yielding new requests and just quit.
But I could not find documentation says that scrapy will quit if too many requests in schedular. Here is my code:
https://github.com/spacegoing/sentiment_mqd/blob/a46b59866e8f0a888b43aba6df0481a03136cf21/guba_spiders/guba_spiders/spiders/guba_spider.py#L217
The strange thing is that scrapy seems can only scrape 22 pages. If I start from page 1, it will stop at page 21. If I start from page 21, then it will stop at page 41.... There is no exception raised and scraped results are desired outputs.
1.
The code on GitHub you shared at a46b598 is probably not the exact version you have locally for the sample jobs. E.g. I haven't observed any line for the log lines like <timestamp> [guba] INFO: <url>.
But, well, I assumed there's no too significant difference.
2.
It's suggested to have the log level configured to DEBUG when you encounter any issue.
3.
If you've got the log level configured to DEBUG, you'd probably see something like this:
2018-10-26 15:25:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://guba.eastmoney.com/topic,600000_22.html>: max redirections reached
Some more lines: https://gist.github.com/starrify/b2483f0ed822a02d238cdf9d32dfa60e
That happens because you're passing the full response.meta dict to the following requests (related code), and Scrapy's RedirectMiddleware relies on some meta values (e.g. "redirect_times" and "redirect_ttl") to perform the check.
And the solution is simple: pass only the values you need into next_request.meta.
4.
It's also observed that you're trying to rotate the user agent strings, possibly for avoiding web crawl bans. But there's no other action taken. That would make your requests fishy still, because:
Scrapy's cookie management is enabled by default, which would use a same cookie jar for all your requests.
All your requests come from a same source IP address.
Thus I'm unsure whether it's good enough for you to scrape the whole site properly, especially when you're not throttling the requests.

Node.js scraping with chrome-remote-interface

I have been trying to scrape a website protected by Distil Networks,
in which using selenium (with Python) would just always fail.
I did a few searches, and my conclusion is that the site can detect you are using Selenium by using some sort of javascript. I then took a loot at chrome-remote-interface, like it is the thing that I want, but then I got stuck.
What I would like to do is to automate following steps:
Open a Chrome instance
Navigate to a page
Run some javascript
Collect data and save to file
Repeat steps 2 - 4
I know that I can open a instance of Chrome for debugging by:
google-chrome --remote-debugging-port=9222
And I can open a console on node by:
chrome-remote-interface -t 127.0.0.1 -p 9222 inspect -r
I can also run simple scripts like
Page.navigate({url:"https://google.com"})
Runtime.evaluate({expression:"1+1"})
But like I can't get the DOMs directly on Node.js as what I could do on the Chrome Developer Tools console. Basically what I want is run scripts on Node like what I could do on the Chrome Developer Tools console.
Also , there are not enough documentation on chrome-remote-interface for scraping. Is there any good links for that?
I know it's has been asked two years ago, but let me write it here for documentation purposes.
-- Tools of the trade --
I tried the same technique as you did (used the remote debugger for scraping) but instead of using Python i used Node.js because of it's asynchronous nature, thus making easier to work with websockets that the remote debugger relies on.
-- Runtime.evaluate --
One thing i noted is that Runtime.evaluate isn't a valid option for recovering any data if your expression involves asynchronous calls because it returns the result of the calling function and not of the callback function. You have to stick with synchronous expressions.
Example:
Array.from(document.getElementByTagName('tr'))
.map((e)=>e.children[2].innerHTML)
.filter((e)=>e.length>0)
Other thing is that when your expression returns an array Runtime.evaluate just mention that the expression returned an array but not the array itself! (infuriating i know)
I got around it by simply enconding the arrays as JSON strings in the page context then decoding it back to object when it arrives at the Node.js. For example the above expression would need to be:
JSON.stringify(
Array.from(document.getElementByTagName('tr'))
.map((e)=>e.children[2].innerHTML)
.filter((e)=>e.length>0)
)
-- Navigation --
When you trigger a page load by using "Page.navigate", ".click()", ".submit()", "window.location.href=..." or any other way it's important to know when the next page was completely loaded before sending more instructions with Runtime.evaluate.
I did the trick asking the debugger to send me the page loading events(look for the Page.enable method in the documentation) then waiting for the "Page.loadEventFired" event before sending more expressions.
JavaScript expressions evaluated by Runtime.evaluate are executed within the page context just like what happens in the DevTools console.
You can interact with the DOM using the DOM domain, e.g., DOM.getDocument, DOM.querySelector, etc.
Also remember that chrome-remote-interface is mainly a library meaning that it allows you to write your own Node.js applications, the chrome-remote-interface inspect is just an utility.
There are several places where you can get help:
open an issue to chrome-remote-interface;
the chrome-remote-interface wiki;
the Chrome DevTools Protocol Viewer;
the Chrome Debugging Protocol Google Group.
If you ask something more specific I'd be happy to try to help you with that.
Finally you may want to take a look at automated-chrome-profiling, which I think is structurally similar to what you're trying to achieve.

How Do You Pass New URLs to a Scrapy Crawler

I would like to keep a scrapy crawler constantly running inside a celery task worker probably using something like this. Or as suggested in the docs
The idea would be to use the crawler for querying an external API returning XML responses. I would like to pass the URL (or query parameters and let the crawler build the URL) I want to query to the crawler, and the crawler would make the URL call, and give me back the extracted items. How can I pass this new URL I want to fetch to the crawler once it started running. I do not want to restart the crawler every time I want to give it a new URL, instead I want the crawler to sit idly waiting for URLs to crawl.
The two methods I've spotted to run scrapy inside another python process use a new Process to run the crawler in. I would like to not have to fork and teardown a new process every time I want to crawl a URL, since that is pretty expensive and unnecessary.
Just have a spider that polls a database (or file?) that when presented with a new URL creates and yields a new Request() object for it.
You can build it by hand easily enough. There is probably a better way to do it than that, but thats basically what I did for an open-proxy scraper. The spider gets a list of all the 'potential' proxies from the database and generates a Request() object for each one - when they're returned they're then dispatched down the chain and verified by downstream middleware and their records are updated by item pipeline.
You could use a message queue (like IronMQ--full disclosure, I work for the company that makes IronMQ as a developer evangelist) to pass in the URLs.
Then in your crawler, poll for the URLs from the queue, and crawl based on the messages you retrieve.
The example you linked to could be updated (this is untested and pseudocode, but you should get the basic idea):
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
from iron-mq import IronMQ
mq = IronMQ()
q = mq.queue("scrape_queue")
crawler = Crawler(Settings())
crawler.configure()
while True: # poll forever
msg = q.get(timeout=120) # get messages from queue
# timeout is the number of seconds the message will be reserved for, making sure no other crawlers get that message. Set it to a safe value (the max amount of time it will take you to crawl a page)
if len(msg["messages"]) < 1: # if there are no messages waiting to be crawled
time.sleep(1) # wait one second
continue # try again
spider = FollowAllSpider(domain=msg["messages"][0]["body"]) # crawl the domain in the message
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
q.delete(msg["messages"][0]["id"]) # when you're done with the message, delete it

Categories

Resources