Recently I bought IP rotation service from proxyrack and I want to use with Scrapy. But as their python example, I'm getting confused to implement with Scrapy. Please help me. Here is their code but I want to apply with scrapy
import requests
username = "vranesevic"
password = "svranesevic"
PROXY_RACK_DNS = "megaproxy.rotating.proxyrack.net:222"
urlToGet = "http://ip-api.com/json"
proxy = {"http":"http://{}:{}#{}".format(username, password, PROXY_RACK_DNS)}
r = requests.get(urlToGet , proxies=proxy)
print("Response:\n{}".format(r.text))
you can follow scrapy documentation that how to set up a custom proxy and if you are not familiar with then here are the steps...
Step 1 - Go to Middlewares.py file and paste this. Change the URL 1st and use that provided from proxyrack and keep the HTTP. Also, set the proxy rack user and password inside basic_auth_header.
from w3lib.http import basic_auth_header
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
request.meta[“proxy”] = "http://192.168.1.1:8050"
request.headers[“Proxy-Authorization”] =
basic_auth_header(“<proxy_user>”, “<proxy_pass>”)
Step 2 - Go to settings.py file and Enable Downloader_middleware or paste this at the bottom. Also, make sure you replace the word myproject and set your project name.
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}
That's it and you are ready to go.
Related
I'm trying to automate some data processing in my org but I'm struggling to parse the use of the python module to make the requests. When I use
from office365.runtime.auth.authentication_context import AuthenticationContext
from office365.sharepoint.client_context import ClientContext, UserCredential
from office365.sharepoint.files.file import File
sharepoint_base_url = 'https://organisation.sharepoint.com/sites/suborganisation'
sharepoint_user = 'username#org.org'
sharepoint_password = 'password'
ctx = ClientContext(sharepoint_base_url).with_credentials(UserCredential(sharepoint_user, sharepoint_password))
web = ctx.web
ctx.load(web)
ctx.execute_query()
print(f"Web title: {web.properties['Title']}")
It will access the main site, but as soon as I try to work with any of the sub directories I start to run into errors. I think the problem is either that I just don't understand the relative urls, or that there's something about the structure of the site
For example the full url for the folder I want to point (with modification obviously) to is
folder_in_sharepoint = 'https://organisation.sharepoint.com/sites/suborganisation/current_project/Forms/AllItems.aspx?id=%2Fsites%2suborganisation%2Fcurrent_project%2FApplications%202023&viewid=longasphanumericstring' # copied directly from the browser
def folder_details(ctx, folder_in_sharepoint):
folder = ctx.web.get_folder_by_server_relative_url(folder_in_sharepoint)
fold_names = []
sub_folders = folder.files
ctx.load(sub_folders)
ctx.execute_query()
for s_folder in sub_folders:
fold_names.append(s_folder.properties["Name"])
return fold_names
#listing objects in the folder
file_list = folder_details(ctx, folder_in_sharepoint)
I get the error
ClientRequestException: (None, None, "400 Client Error: Bad Request for url: https://organisation.sharepoint.com/sites/suborganisation/_api/Web/getFolderByServerRelativeUrl('https:%2F%2Forganisation.sharepoint.com%2Fsites%2Fsuborganisation%2Fcurrent_project%2FForms%2FAllItems.aspx%3Fid=%252Fsites%252Fsuborganisation%252Fcurrent_project%252FApplications%25202023%26viewid=longalphanumericstring')/Files")
It appears to me that ctx.web.get_folder_by_server_relative_url(folder_in_sharepoint) is where the problem is, and it's something to do with what I'm passing as the relative url, but I've tried commenting out every line, and trying multiple permutation of the url I can make and I'm still getting nowhere.
Any guidance appreciated.
I've created a script using scrapy implementing rotation of proxies within it to parse the address from few hundreds of similar links like this. I've supplied those links from a csv file within the script.
The script is doing fine until it encounters any response url like this https://www.bcassessment.ca//Property/UsageValidation. Given that once the script starts getting that link, it can't bypass that. FYI, I'm using meta properties containing lead_link to make use of original link instead of redirected link as a retry, so I should be able to bypass that barrier.
It doesn't happen when I use proxies within requests library. To be clearer - while using requests library, the script does encounter this page /Property/UsageValidation but bypass that successfully after few retries.
The spider is like:
class mySpider(scrapy.Spider):
name = "myspider"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'stackoverflow_spider.middlewares.ProxiesMiddleware': 100,
}
}
def start_requests(self):
with open("output_main.csv","r") as f:
reader = csv.DictReader(f)
for item in list(reader):
lead_link = item['link']
yield scrapy.Request(lead_link,self.parse,meta={"lead_link":lead_link,"download_timeout":20}, dont_filter=True)
def parse(self,response):
address = response.css("h1#mainaddresstitle::text").get()
print(response.meta['proxy'],address)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT':'Mozilla/5.0',
'LOG_LEVEL':'ERROR',
})
c.crawl(mySpider)
c.start()
How can I let the script not to encounter that page?
PS I've attached few of the links within a text file in case anyone wants to give a try.
To make session safe proxy implementation for scrapy app You
need to add additional cookiejar meta key to place where you assign proxy to request.meta like this:
....
yield scrapy.Request(url=link, meta = {"proxy":address, "cookiejar":address})
In this case scrapy cookiesMiddleware will create additional cookieSession for each proxy.
related specifics of scrapy proxy implementation mentioned in this answer
I am trying to use requests (python) to grab some pages from a website that requires me to be logged in.
I did inspect the login page to check out the username and password headers. But I found the names for those fields are not the standard 'username', 'password' used by most sites as you can see from the below screenshots
password field
I used them that way in my python script but each time I get a 'wrong syntax' error. Even sublimetext displayed a part of the name in orange as you can see from the pix below
From this I know there must be some problem with the name. But try to escape the $ signs did not help.
Even the login.aspx header disappears before google chrome could register it on the network.
The site is www dot bncnetwork dot net
I'd be happy if someone could help me figure out what to do about this.
Here is the code`import requests
import requests
def get_project_page(seed_page):
username = "*******************"
password = "*******************"
bnc_login = dict(ctl00$MainContent$txtEmailID=username, ctl00$MainContent$txtPassword=password)
sess_req = requests.Session()
sess_req.get(seed_page)
sess_req.post(seed_page, data=bnc_login, headers={"Referer":"http://www.bncnetwork.net/MyBNC.aspx"})
page = sess_req.get(seed_page)
return page.text`
You need to use strings for the keys, the $ will cause a syntax error if you don't:
data = {"ctl00$MainContent$txtPassword":password, "ctl00$MainContent$txtEmailID":email}
There are evenvalidation fileds etc.. to be filled in also, follow the logic from this answer to fill them out, all the fields can be seen in chrome tools:
I'm using scrapy for a project where I want to scrape a number of sites - possibly hundreds - and I have to write a specific spider for each site. I can schedule one spider in a project deployed to scrapyd using:
curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2
But how do I schedule all spiders in a project at once?
All help much appreciated!
My solution for running 200+ spiders at once has been to create a custom command for the project. See http://doc.scrapy.org/en/latest/topics/commands.html#custom-project-commands for more information about implementing custom commands.
YOURPROJECTNAME/commands/allcrawl.py :
from scrapy.command import ScrapyCommand
import urllib
import urllib2
from scrapy import log
class AllCrawlCommand(ScrapyCommand):
requires_project = True
default_settings = {'LOG_ENABLED': False}
def short_desc(self):
return "Schedule a run for all available spiders"
def run(self, args, opts):
url = 'http://localhost:6800/schedule.json'
for s in self.crawler.spiders.list():
values = {'project' : 'YOUR_PROJECT_NAME', 'spider' : s}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
log.msg(response)
Make sure to include the following in your settings.py
COMMANDS_MODULE = 'YOURPROJECTNAME.commands'
Then from the command line (in your project directory) you can simply type
scrapy allcrawl
Sorry, I know this is an old topic, but I've started learning scrapy recently and stumbled here, and I don't have enough rep yet to post a comment, so posting an answer.
From the common scrapy practices you'll see that if you need to run multiple spiders at once, you'll have to start multiple scrapyd service instances and then distribute your Spider runs among those.
I'm currently trying to get a grasp on pycurl. I'm attempting to login to a website. After logging into the site it should redirect to the main page. However when trying this script it just gets returned to the login page. What might I be doing wrong?
import pycurl
import urllib
import StringIO
pf = {'username' : 'user', 'password' : 'pass' }
fields = urllib.urlencode(pf)
pageContents = StringIO.StringIO()
p = pycurl.Curl()
p.setopt(pycurl.FOLLOWLOCATION, 1)
p.setopt(pycurl.COOKIEFILE, './cookie_test.txt')
p.setopt(pycurl.COOKIEJAR, './cookie_test.txt')
p.setopt(pycurl.POST, 1)
p.setopt(pycurl.POSTFIELDS, fields)
p.setopt(pycurl.WRITEFUNCTION, pageContents.write)
p.setopt(pycurl.URL, 'http://localhost')
p.perform()
pageContents.seek(0)
print pageContents.readlines()
EDIT: As pointed out by Peter the URL should point to a login URL but the site I'm trying to get this to work for fails to show me what URL this would be. The form's action just points to the home page ( /index.html )
As you're troubleshooting this problem, I suggest getting a browser plugin like FireBug or LiveHTTPHeaders (I suggest Firefox plugins, but there are similar plugins for other browsers as well). Then you can exercise a request to the site and see what action (URL), method, and form parameters are being passed to the target server. This will likely help elucidate the crux of the problem.
If that's no help, you may consider using a different tool for your mechanization. I've used ClientForm and BeautifulSoup to perform similar operations. Based on what I've read in the pycURL docs and your code above, ClientForm might be a better tool to use. ClientForm will parse your HTML page, locate the forms on it (including login forms), and construct the appropriate request for you based on the answers you supply to the form. You could even use ClientForm with pycURL... but at least ClientForm will provide you with the appropriate action to which to POST, and construct all of the appropriate parameters.
Be aware, though, that if there is JavaScript handling any necessary part of the login form, even ClientForm can't help you there. You will need something that interprets the JavaScript to effectively automate the login. In that case, I've used SeleniumRC to control a browser (and I let the browser handle the JavaScript).
One of the golden rule, you need to 'brake the ice', have debugging enabled when trying to solve pycurl example:
Note: don't forget to use p.close() after p.perform()
def test(debug_type, debug_msg):
if len(debug_msg) < 300:
print "debug(%d): %s" % (debug_type, debug_msg.strip())
p.setopt(pycurl.VERBOSE, True)
p.setopt(pycurl.DEBUGFUNCTION, test)
Now you can see how your code is breathing, because you have debugging enabled
import pycurl
import urllib
import StringIO
def test(debug_type, debug_msg):
if len(debug_msg) < 300:
print "debug(%d): %s" % (debug_type, debug_msg.strip())
pf = {'username' : 'user', 'password' : 'pass' }
fields = urllib.urlencode(pf)
pageContents = StringIO.StringIO()
p = pycurl.Curl()
p.setopt(pycurl.FOLLOWLOCATION, 1)
p.setopt(pycurl.COOKIEFILE, './cookie_test.txt')
p.setopt(pycurl.COOKIEJAR, './cookie_test.txt')
p.setopt(pycurl.POST, 1)
p.setopt(pycurl.POSTFIELDS, fields)
p.setopt(pycurl.WRITEFUNCTION, pageContents.write)
p.setopt(pycurl.VERBOSE, True)
p.setopt(pycurl.DEBUGFUNCTION, test)
p.setopt(pycurl.URL, 'http://localhost')
p.perform()
p.close() # This is mandatory.
pageContents.seek(0)
print pageContents.readlines()