I created a list of urls based on a pattern using string format.
Each url looks something like this:
https://www.myurl.com/somestr-0/#X
Where "X" goes from "A" to "Z" (code bellow).
Now I want to iterate through this list and get each url with requests except the "0" in each url should actually be any number that could be one or two digits.
I used the re module to replace the "0" in my pattern but I don't know how to use the output with requests.
import string
alphabet = [x for x in string.ascii_uppercase]
urls = [f'https://www.myurl.com/somestr-x/#{letter}'for letter in alphabet]
for url in urls :
url = re.sub('x',r'\\d{1,2}',url)
I want to be able to use every url with "any number" instead of the "0" without having to specify what number that would be exactly.
ETA : the "any number" can only be 1 or 2 digits and I want to avoid spamming the website with too many requests by "trying" every possible combination.
You can use randrange from random.
for url in urls :
url = re.sub('x', random.randrange(1,9) ,url)
response = requests.get(url)
...
You could use requests. Supposing you only need a get, you could fetch an url with something like:
import requests
response = requests.get(url)
You only need to loop through all the urls you have and process the responses. More info at https://pypi.org/project/requests/
The line
url = re.sub('x',r'\\d{1,3}',url)
Is problematic - you need to replace with an actual string, not a regular expression.
Try
import random
...the rest of your code
url = re.sub('x',str(random.randint(100)),url)
Related
I have very basic knowledge of python, so sorry if my question sounds dumb.
I need to query a website for a personal project I am doing, but I need to query it 500 times, and each time I need to change 1 specific part of the url, then take the data and upload it to gsheets.
(The () signifies what part of the url I need to change)
'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=(symbol)&apikey=apikey'
I thought about using while and format {} to do it, but I was unsure how to change the string each time, bar writing out the names for variables by hand (defeating the whole purpose of this).
I already have a list of the symbols I need to use, but I don't know how to input them
Example of how I get 1 piece of data
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=MMM&apikey=demo'
r = requests.get(url)
data = r.json()
Example of what I'd like to change it to
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=AOS&apikey=demo'
r = requests.get(url)
data = r.json()
#then change it to
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=ABT&apikey=demo'
r = requests.get(url)
data = r.json()
so on and so forth, 500 times.
You might combine .format with for loop, consider following simple example
symbols = ["abc","xyz","123"]
for s in symbols:
url = 'https://www.example.com?symbol={}'.format(s)
print(url)
output
https://www.example.com?symbol=abc
https://www.example.com?symbol=xyz
https://www.example.com?symbol=123
You might also elect to use any other way of formatting, e.g. f-string (requires python3.6 or newer) in which case code would be
symbols = ["abc","xyz","123"]
for s in symbols:
url = f'https://www.example.com?symbol={s}'
print(url)
Alternatively you might params optional argument of requests.get function as follows
import requests
symbols = ["abc","xyz","123"]
for s in symbols:
r = requests.get('https://www.example.com', params={'symbol':s})
print(r.url)
output
https://www.example.com/?symbol=abc
https://www.example.com/?symbol=xyz
https://www.example.com/?symbol=123
I am trying to get a JSON response from the link used as a parameter to the urllib request. but it gives me an error that it can't contain control characters.
how can I solve the issue?
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
source = urllib.request.urlopen(start_url).read()
the error I get is :
URL can't contain control characters. '/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq=' (found at least ' ')
Replacing whitespace with:
url = url.replace(" ", "%20")
if the problem is with the whitespace.
Spaces are not allowed in URL, I removed them and it seems to be working now:
import urllib.request
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
url = start_url.replace(" ","")
source = urllib.request.urlopen(url).read()
Solr search strings can get pretty weird. Better use the 'quote' method to encode characters before making the request. See example below:
from urllib.parse import quote
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
source = urllib.request.urlopen(quote(start_url)).read()
Better later than never...
You probably already found out by now but let's get it written here.
There can't be any space character in the URL, and there are 2, after bundle_fq e dm_field_deadlineTo_fq
Remove those and you're good to go
Like the error message says, there are some control characters in your url, which doesn't seem to be a valid one by the way.
You need to encode the control characters inside the URL. Especially spaces need to be encoded to %20.
Parsing the url first and then encoding the url elements would work.
import urllib.request
from urllib.parse import urlparse, quote
def make_safe_url(url: str) -> str:
"""
Returns a parsed and quoted url
"""
_url = urlparse(url)
url = _url.scheme + "://" + _url.netloc + quote(_url.path) + "?" + quote(_url.query)
return url
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
start_url = make_safe_url(start_url)
source = urllib.request.urlopen(start_url).read()
The code returns the JSON-document despite the double forward-slash and the whitespace in the url.
I am trying to match or find coincidence a string in python with regex method re.search() without lucky
this is my code:
import re
request_path = '/colpos/papanicolaou2/124579/1254'
urls = ['/colpos/prescription', '/colpos/transfer', '/colpos/papanicolaou2', '/colpos/biopsia']
for url in urls:
c_url = re.compile(url)
result = re.search(c_url, request_path)
if isinstance(result, re.Match):
allowed_url = url
break
print(allowed_url) # must be /colpos/papanicolau2
what I want to happen?, if url is in request_path (in this case partially) I expect that result been re.Match object instance not None.
how can I achive this?, is any better way to know if my request_path is in urls?
the code mentioned above only works if url and request_path contains exactly the same, I dont want that. How should I use re.search() in python to achive this?
thank you
I tried checking it with the "in" keyword instead of using re module. I think it is simpler and more readable.
request_path = '/colpos/papanicolaou2/124579/1254'
urls = ['/colpos/prescription', '/colpos/transfer', '/colpos/papanicolaou2', '/colpos/biopsia']
allowed_urls = []
for url in urls:
if url in request_path:
allowed_urls.append(url)
print(allowed_urls) # this contains '/colpos/papanicolaou2' like you wanted
In case you just got 2 fixed (real) parts for your request_path, you could the following (no loops, no regex - just Python):
/colpos/papanicolaou2/124579/1254
/part_1/part_2 /param1/param2/...
Code:
urls = ['colpos/prescription', 'colpos/transfer', 'colpos/papanicolaou2', 'colpos/biopsia']
request_path = "/colpos/papanicolaou2/124579/1254"
p1, p2, params = request_path[1:].split('/', 2)
if '/'.join([p1, p2]).lower() not in urls:
#raise Error(404)
print("url not found")
Note: You would need to make it more stable for production usage :)
I have some urls
http://go.mail.ru/search?fr=vbm9&fr2=query&q=%D0%BF%D1%80%D0%BE%D0%B3%D1%83%D0%BB%D0%BA%D0%B0+%D0%B0%D0%BA%D1%82%D0%B5%D1%80%D1%8B&us=10&usln=1
https://www.google.ru/search?q=NaoOmiKi&oq=NaoOmiKi&aqs=chrome..69i57j69i61&sourceid=chrome&es_sm=0&ie=UTF-8
https://yandex.ru/search/?text=%D0%BE%D1%82%D0%BA%D1%83%D0%B4%D0%B0%20%D0%B2%D0%B5%D0%B7%D1%83%D1%82%20%D0%BE%D0%B4%D0%B5%D0%B6%D0%B4%D1%83%20%D0%B2%20%D1%81%D0%B5%D0%BA%D0%BE%D0%BD%D0%B4%20%D1%85%D0%B5%D0%BD%D0%B4&clid=2073067
When I run this url in browser I get, that it's search of:
прогулка актеры
NaoOmiKi
откуда везут одежду в секонд хенд
I want to write code to get this values. I try
get = urlparse(url)
print urllib.unquote(get[4])
But it doesn't work correctly for all url. What I should use?
urlparse parses a URL into 6 components: scheme, netloc, path, params, query, fragment. You correctly use index 4 to get the path.
The path however, is a &-separated string of key=value pairs with the values urlencoded. You try to unquote the entire string, while you are only interested in the value of the text or q key.
You can use urlparse.parse_qs to parse the querystring and look for q or text keys in the returned dict.
I have this line in my python script:
url = tree.find("//video/products/product/read_only_info/read_only_value[#key='storeURL-GB']")
but sometimes the storeURL-GB key changes the last two country code letters, so I am trying to use something like this, but it doesn't work:
url = tree.find("//video/products/product/read_only_info/read_only_value[#key='storeURL-\.*']")
Any suggestions please?
You should probably try .xpath() and starts-with():
urls = tree.xpath("//video/products/product/read_only_info/read_only_value[starts-with(#key, 'storeURL-')]")
if urls:
url = urls[0]