I have an URL of the form:
http://www.foo.com/bar?arg1=x&arg2=y
If I do:
request.url
I get:
http://www.foo.com/bar?arg1=x&arg2=y
Is it possible to get just http://www.foo.com/bar?
Looks like request.urlparts.path might be a way to do it.
Full documentation here.
Edit:
There is a way to do this via requests library
r.json()['headers']['Host']
I personally find the split function better.
You can use split function with ? as the delimiter to do this.
url = request.url.split("?")[0]
I'm not sure if this is the most effective/correct method though.
if you just want to remove the parameters to get base url do
url = url.split('?',1)[0]
this will split the url at the '?' and then give you base url
or even
url = url[:url.find('?')]
you can also use urlparse this is explained in the python docs at: https://docs.python.org/2/library/urlparse.html
Related
I have 2 function blocks in my scraper
1.Parse
2.Parse_info
In the 1st block, I got the list of URLs.
Some of the URLs are working (they already have the 'https://www.example.com/' part)
Rest URLs are not working (they do not have the 'https://www.example.com/' part)
So before passing the URL to 2nd block i.e. parse_info; I want to validate the URL
and If it is not working I want to edit it and add the required part ('https://www.example.com/' part).
You could leverage the requests module and get the status code of the website - a guide to doing that is here.
Similarly, if you're just trying to validate whether the URL contains a specific portion i.e the 'https://www.example.com/', you can perform a regex query and do that.
My interpretation from your question is that you have a list of URLs, some of which have an absolute address like 'https://www.example.com/xyz' and some only have a relative reference like '/xyz' that belongs to the 'https://www.example.com' site.
If that is the case, you can use 'urljoin' to rationalise each of the URLs, for example:
>>> from urllib.parse import urljoin
>>> url = 'https://www.example.com/xyz'
>>> print(urljoin('https://www.example.com', url))
https://www.example.com/xyz
>>> url = '/xyz'
>>> print(urljoin('https://www.example.com', url))
https://www.example.com/xyz
I've been reading and displaying a Facebook Graph api url form the Facebook Graph API like this:
facebook_info = urllib2.urlopen("https://graph.facebook.com/%s/me?fields=first_name,last_name,email&access_token=" % settings.FACEBOOK_API_VERSION + access_token)
facebook_info = facebook_info.read()
return facebook_info
I was wondering if there is a better way to do this in Python I was thinking something like Request.get(...). Where I don't use urllib2.urlopen and the the '+' sign to concatenate.
although requests library is excellent and recommended for its ease of use, I don't think using requests.get will be inherently better in this case. Your code seems fine and will work perfectly for what it does. Why do you want to change it? Style?
Or you want to build the url in a more clear way, perhaps?
url_template = "https://graph.facebook.com/{api_version}/me?fields=first_name,last_name,email&access_token={token}"
url = url_template.format(
api_version=settings.FACEBOOK_API_VERSION,
token=access_token,
)
facebook_info = requests.get(url).json()
return facebook_info
I need to make a request with a querystring, like this ?where[id]=CXP, but when i define it in the requests params (params = {'where[id]': 'CXP'}) the request returns the internal server error 500.
r = requests.get('http://myurl', params=params)
What is the correct whay to make this request?
Thanks.
Friends.. I Think I solve it.
params allow string in the request.
So, I think it's not the best way, but i´m putting as:
r = requests.get('http://myurl', params='where[service]=CXP')
And works fine!
it is very much possible that the server requires json formating. Even though 500 doesn't look like that, it is still worth a try.
From requests' version 2.4.2 onwards, you can simply use json as parameter. It will take care of encoding for you. All you need to do is:
r = requests.get('http://myurl', json=params)
If you are (for whatever reason) using an older version of requests, you can try following:
import simplejson as json
r = requests.get('http://myurl', param=json.loads(params))
I always recommend simplejson over json package as it is updated more frequently.
EDIT: If you want to prevent urlencoding, your only solution might be to pass a string.
r = requests.get('http://myurl', param='where[id]=CXP')
I can get the filler variable from the URL below just fine.
url(r'^production/(?P<filler>\w{7})/$', views.Filler.as_view()),
In my view I can retrieve the filler as expected. However, if I try to do use a URL like the one below.
url(r'^production/(?P<filler>)\w{7}/(?P<day>).*/$', views.CasesByDay.as_view()),
Both variables (filler, day) are blank.
You need to include the entire parameter in parenthesis. It looks like you did that in your first example but not the second.
Try:
url(r'^production/(?P<filler>\w{7})/(?P<day>.*)/$', views.CasesByDay.as_view()),
See the official documentation for more information and examples: URL dispatcher
I am working with a huge list of URL's. Just a quick question I have trying to slice a part of the URL out, see below:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3
How could I slice out:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234
Sometimes there is more than two parameters after the CONTENT_ITEM_ID and the ID is different each time, I am thinking it can be done by finding the first & and then slicing off the chars before that &, not quite sure how to do this tho.
Cheers
Use the urlparse module. Check this function:
import urlparse
def process_url(url, keep_params=('CONTENT_ITEM_ID=',)):
parsed= urlparse.urlsplit(url)
filtered_query= '&'.join(
qry_item
for qry_item in parsed.query.split('&')
if qry_item.startswith(keep_params))
return urlparse.urlunsplit(parsed[:3] + (filtered_query,) + parsed[4:])
In your example:
>>> process_url(a)
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This function has the added bonus that it's easier to use if you decide that you also want some more query parameters, or if the order of the parameters is not fixed, as in:
>>> url='http://www.domainname.com/page?other_value=xx¶m3&CONTENT_ITEM_ID=1234¶m1'
>>> process_url(url, ('CONTENT_ITEM_ID', 'other_value'))
'http://www.domainname.com/page?other_value=xx&CONTENT_ITEM_ID=1234'
The quick and dirty solution is this:
>>> "http://something.com/page?CONTENT_ITEM_ID=1234¶m3".split("&")[0]
'http://something.com/page?CONTENT_ITEM_ID=1234'
Another option would be to use the split function, with & as a parameter. That way, you'd extract both the base url and both parameters.
url.split("&")
returns a list with
['http://www.domainname.com/page?CONTENT_ITEM_ID=1234', 'param2', 'param3']
I figured it out below is what I needed to do:
url = "http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3"
url = url[: url.find("&")]
print url
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
Parsin URL is never as simple I it seems to be, that's why there are the urlparse and urllib modules.
E.G :
import urllib
url ="http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3"
query = urllib.splitquery(url)
result = "?".join((query[0], query[1].split("&")[0]))
print result
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This is still not 100 % reliable, but much more than splitting it yourself because there are a lot of valid url format that you and me don't know and discover one day in error logs.
import re
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3'
m = re.search('(.*?)&', url)
print m.group(1)
Look at the urllib2 file name question for some discussion of this topic.
Also see the "Python Find Question" question.
This method isn't dependent on the position of the parameter within the url string. This could be refined, I'm sure, but it gets the point across.
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3'
parts = url.split('?')
id = dict(i.split('=') for i in parts[1].split('&'))['CONTENT_ITEM_ID']
new_url = parts[0] + '?CONTENT_ITEM_ID=' + id
An ancient question, but still, I'd like to remark that query string paramenters can also be separated by ';' not only '&'.
beside urlparse there is also furl, which has IMHO better API.