How to send multiple HTTP requests using Python - python

I'm trying to create a script that checks all possible top level domain combinations based on a given root domain name.
What I've done is generate a list of all possible TLD combinations and what I want to do is make multiple HTTP requests, analyze its results and determine whether a root domain name has multiple active top level domains or not.
Example:
A client of mine has this active domains:
domain.com
domain.com.ar
domain.ar
I've tried using grequests but get this error:
TypeError: 'AsyncRequest' object is not iterable
Code:
import grequests
responses = grequests.get([url for url in urls])
grequests.map(responses)

As I mentioned, you cannot put code as a parameter. What you want is to add a list and you can using an inline for loop, like this:
[url for url in urls]
It is called list comprehensions and more information about this can be found over here: https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions
So I just added two brackets:
import grequests
responses = (grequests.get(u) for u in urls)
grequests.map(responses)

Related

Deleting all the CloudWatch rules using boto3

We have ton of cloudwatch rules that we need to get rid of, I was working on a python script, to delete all the CloudWatch rules , but I could only find the delete rule for a specific rule on boto3 website, but I want to delete all the rules we have.
import boto3
client = boto3.client('events')
response = client.delete_rule(
Name='string'
)
You have to do it in two stages.
Get the list of Name of all the rules you have using list_rules
Use iteration to delete all your rules, one by one, using using delete_rule.
client = boto3.client('events')
rule_names = [rule['Name'] for rule in client.list_rules()['Rules']]
for rule_name in rule_names:
response = client.delete_rule(Name=rule_name)
print(response)
Depending on how many rules you actually have, you may need to run list_rules multiple times with NextToken.
As CloudWatch Event Rule usually has at least one target, you need to:
Get list of your rules using list_rules method;
Loop through the list of rules and within each iteration:
Get list of rule's targets' IDs using list_targets_by_rule method;
Remove targets using remove_targets method;
Remove the rule using delete_rule method.
So the resulting script will look like this:
import boto3
client = boto3.client('events')
rules = client.list_rules()['Rules']
for rule in rules:
rule_targets = client.list_targets_by_rule(
Rule=rule['Name']
)['Targets']
target_ids = [target['Id'] for target in rule_targets]
remove_targets_response = client.remove_targets(
Rule=rule['Name'],
Ids=target_ids
)
print(remove_targets_response)
delete_rule_response = client.delete_rule(
Name=rule['Name']
)
print(delete_rule_response)

Is there any option to check whether the URL is working or not? before passing it to the next function in scrapy , Python

I have 2 function blocks in my scraper
1.Parse
2.Parse_info
In the 1st block, I got the list of URLs.
Some of the URLs are working (they already have the 'https://www.example.com/' part)
Rest URLs are not working (they do not have the 'https://www.example.com/' part)
So before passing the URL to 2nd block i.e. parse_info; I want to validate the URL
and If it is not working I want to edit it and add the required part ('https://www.example.com/' part).
You could leverage the requests module and get the status code of the website - a guide to doing that is here.
Similarly, if you're just trying to validate whether the URL contains a specific portion i.e the 'https://www.example.com/', you can perform a regex query and do that.
My interpretation from your question is that you have a list of URLs, some of which have an absolute address like 'https://www.example.com/xyz' and some only have a relative reference like '/xyz' that belongs to the 'https://www.example.com' site.
If that is the case, you can use 'urljoin' to rationalise each of the URLs, for example:
>>> from urllib.parse import urljoin
>>> url = 'https://www.example.com/xyz'
>>> print(urljoin('https://www.example.com', url))
https://www.example.com/xyz
>>> url = '/xyz'
>>> print(urljoin('https://www.example.com', url))
https://www.example.com/xyz

retrieved URLs, trouble building payload to use requests module

I'm a Python novice, thanks for your patience.
I retrieved a web page, using the requests module. I used Beautiful Soup to harvest a few hundred href objects (links). I used uritools to create an array of full URLs for the target pages I want to download.
I don't want everybody who reads this note to bombard the web server with requests, so I'll show a hypothetical example that is realistic for just 2 hrefs. The array looks like this:
hrefs2 = ['http://ku.edu/pls/WP040?PT001F01=910&pf7331=11',
'http://ku.edu/pls/WP040?PT001F01=910&pf7331=12']
If I were typing these into 100s of lines of code, I understand what to do in order to retrieve each page:
from lxml import html
import requests
url = 'http://ku.edu/pls/WP040/'
payload = {'PT001F01' : '910', 'pf7331' : '11')
r = requests.get(url, params = payload)
Then get the second page
payload = {'PT001F01' : '910', 'pf7331' : '12')
r = requests.get(url, params = payload)
And keep typing in payload objects. Not all of the hrefs I'm dealing with are sequential, not all of the payloads are different simply in the last integer.
I want to automate this and I don't see how to create the payloads from the hrefs2 array.
While fiddling with uritools, I find urisplit which can give me the part I need to parse into a payload:
[urisplit(x)[3] for x in hrefs2]
['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
Each one of those has to be turned into a payload object and I don't understand what to do.
I'm using Python3 and I used uritools because that appears to be the standards-compliant replacement of urltools.
I fell back on shell script to get pages with wget, which does work, but it is so un-Python-ish that I'm asking here for what to do. I mean, this does work:
import subprocess
for i in hrefs2:
subprocess.call(["wget", i])
You can pass the full url to requests.get() without splitting up the parameters.
>>> requests.get('http://ku.edu/pls/WP040?PT001F01=910&pf7331=12')
<Response [200]>
If for some reason you don't want to do that, you'll need to split up the parameters some how. I'm sure there are better ways to do it, but the first thing that comes to mind is:
a = ['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
# list to store all url parameters after they're converted to dicts
urldata = []
#iterate over list of params
for param in a:
data = {}
# split the string into key value pairs
for kv in param.split('&'):
# split the pairs up
b = kv.split('=')
# first part is the key, second is the value
data[b[0]] = b[1]
# After converting every kv pair in the parameter, add the result to a list.
urldata.append(data)
You could do this with less code but I wanted to be clear what was going on. I'm sure there is already a module somewhere out there that does this for you too.

Python: Running multiple http requests concurrently based on an initial request?

Currently I am trying to fetch from an API which has 2 endpoints:
GET /AllUsers
GET /user_detail/{id}
In order to get the details of all the users, I would have to call GET /AllUsers, and loop through the IDs to call the GET /user_detail/{id} endpoint 1 by 1. I wonder if it's possible to have multiple GET /user_detail/{id} calls running at the same time? Or perhaps there is a better approach?
This sounds like a great use case for grequests
import grequests
urls = [f'http://example.com/user_detail/{id}' for id in range(10)]
rs = (grequests.get(u) for u in urls)
responses = grequests.map(rs)
Edit: As an example for processing responses to retrieve json you could:
data = []
for response in responses:
data.append(response.json())

Does urllib.request.urlopen(url) use cache?

I have this long list of URL that I need to check response code of, where the links are repeated 2-3 times. I have written this script to check the response code of each URL.
connection =urllib.request.urlopen(url)
return connection.getcode()
The URL comes in XML in this format
< entry key="something" > url</entry>
< entry key="somethingelse" > url</entry>
and I have to associate the response code with the attribute Key so I don't want to use a SET.
Now I definitely don't want to make more than 1 request for the same URL so I was searching whether urlopen uses cache or not but didn't find a conclusive answer. If not what other technique can be used for this purpose.
You can store the urls in a dictionary (urls = {}) as you make a request and check if you have already made a req to that url later:
if key not in urls:
connection = urllib.request.urlopen(url)
urls[key] = url
return connection.getcode()
BTW if you make requests to the same urls repeatedly (multiple runs of the script), and need a persistent cache, i recommend using requests with requests-cache
Why don't you create a python set() of the URLs? That way each url is included only once.
How are you associating the URL with the key? A dictionary?
You can use a dictionary to map the URL to it's response and any other information you need to keep track of. If the URL is already in the dictionary then you know the response. So you have one dictionary:
url_cache = {
"url1" : ("response", [key1,key2])
}
If you need to organize things differently it shouldn't be too hard with another dictionary.

Categories

Resources