I am working with a huge list of URL's. Just a quick question I have trying to slice a part of the URL out, see below:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3
How could I slice out:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234
Sometimes there is more than two parameters after the CONTENT_ITEM_ID and the ID is different each time, I am thinking it can be done by finding the first & and then slicing off the chars before that &, not quite sure how to do this tho.
Cheers
Use the urlparse module. Check this function:
import urlparse
def process_url(url, keep_params=('CONTENT_ITEM_ID=',)):
parsed= urlparse.urlsplit(url)
filtered_query= '&'.join(
qry_item
for qry_item in parsed.query.split('&')
if qry_item.startswith(keep_params))
return urlparse.urlunsplit(parsed[:3] + (filtered_query,) + parsed[4:])
In your example:
>>> process_url(a)
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This function has the added bonus that it's easier to use if you decide that you also want some more query parameters, or if the order of the parameters is not fixed, as in:
>>> url='http://www.domainname.com/page?other_value=xx¶m3&CONTENT_ITEM_ID=1234¶m1'
>>> process_url(url, ('CONTENT_ITEM_ID', 'other_value'))
'http://www.domainname.com/page?other_value=xx&CONTENT_ITEM_ID=1234'
The quick and dirty solution is this:
>>> "http://something.com/page?CONTENT_ITEM_ID=1234¶m3".split("&")[0]
'http://something.com/page?CONTENT_ITEM_ID=1234'
Another option would be to use the split function, with & as a parameter. That way, you'd extract both the base url and both parameters.
url.split("&")
returns a list with
['http://www.domainname.com/page?CONTENT_ITEM_ID=1234', 'param2', 'param3']
I figured it out below is what I needed to do:
url = "http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3"
url = url[: url.find("&")]
print url
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
Parsin URL is never as simple I it seems to be, that's why there are the urlparse and urllib modules.
E.G :
import urllib
url ="http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3"
query = urllib.splitquery(url)
result = "?".join((query[0], query[1].split("&")[0]))
print result
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This is still not 100 % reliable, but much more than splitting it yourself because there are a lot of valid url format that you and me don't know and discover one day in error logs.
import re
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3'
m = re.search('(.*?)&', url)
print m.group(1)
Look at the urllib2 file name question for some discussion of this topic.
Also see the "Python Find Question" question.
This method isn't dependent on the position of the parameter within the url string. This could be refined, I'm sure, but it gets the point across.
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3'
parts = url.split('?')
id = dict(i.split('=') for i in parts[1].split('&'))['CONTENT_ITEM_ID']
new_url = parts[0] + '?CONTENT_ITEM_ID=' + id
An ancient question, but still, I'd like to remark that query string paramenters can also be separated by ';' not only '&'.
beside urlparse there is also furl, which has IMHO better API.
Related
I have a list of urls :
print (url_list)
[https://www.reddit.com/r/funnyvideos/comments/qfvclx/winners_of_the_rfunnyvideos_oscars_2021/,
https://v.redd.it/5tmm5wx5lgz71,
https://v.redd.it/n35a5fo8cez7,
https://v.redd.it/3j0sexqs8ez71,
https://v.redd.it/cmy0l1roxfz71]
How to filter this list so that only the urls with only v.redd.it/ are provided in output and all others urls are exlcuded.
This is what I meant (possible output):
Your required urls :
https://v.redd.it/5tmm5wx5lgz71,
https://v.redd.it/n35a5fo8cez7,
https://v.redd.it/3j0sexqs8ez71,
https://v.redd.it/cmy0l1roxfz71
Excluded Urls :
https://www.reddit.com/r/funnyvideos/comments/qfvclx/winners_of_the_rfunnyvideos_oscars_2021/
I am new to regex and python, so please help me with this. Thanks in advance
There is no need for a regex to do this, instead you can filter your list based on whether the host of the URL is v.redd.it (and get proper url parsing instead of implementing a regex to attempt to do the same):
from urllib.parse import urlparse
urls = ['https://www.reddit.com/r/funnyvideos/comments/qfvclx/winners_of_the_rfunnyvideos_oscars_2021/', 'https://v.redd.it/5tmm5wx5lgz71', 'https://v.redd.it/n35a5fo8cez7']
v_redd_it = filter(lambda x: urlparse(x).netloc == 'v.redd.it', urls)
# you can then either iterate over the filter directly, or convert it to a list
video_links = list(v_redd_it)
# ['https://v.redd.it/5tmm5wx5lgz71', 'https://v.redd.it/n35a5fo8cez7']
I want to only print all the "words" that start with "/watch" from the string, and then add all the '/watch...' to a list. Thanks in advance!
# Take a random video from my youtube recommended and add it to watch2gether
import requests
from bs4 import BeautifulSoup as BS
import time
import random
# Importing libraries
num = random.randint(1, 20)
recommended = requests.get('https://www.youtube.com/results?search_query=svenska+youtube+klassiker&sp=EgIQAQ%253D%253D')
recommended_soup = BS(recommended.content, features='lxml')
recommended_vid = recommended_soup.find_all('a', href=True)
for links in recommended_vid:
print(links['href'])
Output:
/
//www.youtube.com/upload
/
/feed/trending
/feed/history
/premium
/channel/UC-9-kyTW8ZkZNDHQJ6FgpwQ
/channel/UCEgdi0XIXXZ-qJOFPf4JSKw
/gaming
/feed/guide_builder
/watch?v=PbVt_O1kFpA
/watch?v=PbVt_O1kFpA
/user/thedjdoge
/watch?v=1lcksCjvuSs
/watch?v=1lcksCjvuSs
/channel/UCn-puiDqHNMhRvq6wsU3nsQ
/watch?v=AKj_pxp2l1c
/watch?v=AKj_pxp2l1c
/watch?v=QNnEqTQD6DM
/watch?v=QNnEqTQD6DM
/channel/UCDuOAYzgiZzqqlXd2G3GAwg
....
Maybe I can use something like .remove or .replace, don't know what to do so I appreciate all help.
yea re is definitely overkill here. this is a perfect use case for filter
a_list = ["/watch/blah", "not/watch"]
new_list = filter(lambda x: x.startswith("/watch"), a_list)
print(list(new_list))
['/watch/blah']
just be aware it returns a generator, so wrap it in list if you want the list.
http://book.pythontips.com/en/latest/map_filter.html is good if you want more information on functions that do this kind of data cleaning. If you need to get really fancy with your data cleaning look into using pandas. It has a steep learning curve, but it's fantastic for complicated data cleaning.
you can do the following
for links in recommended_vid:
if "/watch" in links[href]:
print(link[href])
This should help you find all the /watch links.
import re
pattern = re.compile(r"/watch")
# pattern = re.compile(r"/watch\?v=[a-zA-Z_0-9]{11}") -- This pattern is to find all the links as well
matches = pattern.finditer(<your_string>)
for m in matches:
print(m) #will print all the locations at which /watch occurs
You can collect all the URLs in a list and proceed. Good Luck!!
Looking at your code, a simple if statement with str.startswith() should suffice to get what you want.
Assuming the links['href'] contains a str, then:
for links in recommended_vid:
href = links['href'] # I think 'href' will be of type 'str'
if href.startswith('/watch'):
print(href)
Note: .startswith() will only work if /watch is really at the start of the href; you could also try if '/watch' in href:, which will match if that string appears anywhere in href.
I have a problem. I can't figure out a way to do this so I'm asking for someone to help.
I have URL like below:
https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4
And, I want to remove the last part of URL and change /f/ with /d/ so that I can get the URL to be like below:
https://abc.xyz/d/b
Keep in mind the prefix /b. It changes regular python.
I have tried this below based on url answer but it didsn't work for me.
newurl = oldurl.replace('/f/','/d/').rsplit("/", 1)[0])
A more efficient way of splitting is to use rsplit and set maxsplit to 1. This saves you the time and hassle rejoining the string.
old_url = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
url = old_url.rsplit("/", 1)[0]
# print(url) => https://abc.xyz/f/b
This what you want?
'/'.join(url.split('/')[:-1])
You don't really need regex for this.
The following will work for a link:
k = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
print("/".join(k.split("/")[:-1]))
After split() or rsplit(), sometimes you will want and need to specify and get the list of element from the front then join() them and replace() f with d as shown below:
url = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
new_url = "/".join(url.split("/")[:5]).replace("f", "d")
print(new_url)
url = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
new_url = "/".join(url.rsplit("/")[:5]).replace("f", "d")
print(new_url)
Output:
https://abc.xyz/f/b
For example if I have https://stackoverflow.com/questions/ask I'd like to cut it to stackoverflow.com/questions/ask or if I have http://www.samsung.com/au/ I'd like to cut it to samsung.com/au/.
I want to make a template tag for this but not sure what to return:
def clean_url(url):
return ?
template
{{ url|clean_url }}
Any idea?
Here is a quick and dirty way to isolate the domain provided it starts with something//
def clean(url):
return url.partition('//')[2].partition('/')[0]
urllib.parse will do most of this for you:
import urllib.parse
def clean_url(url):
parts = list(urllib.parse.urlsplit(url))
parts[0]=""
cleaned = urllib.parse.urlunsplit(parts)[2:]
return cleaned
Note this does not cut off the "www.", but you shouldn't do that; that can be a critical part of the domain name. If you really want that, add:
if cleaned.startswith("www."):
cleaned = cleaned[4:]
For the use cases, you described. You can just split on the double backslash and go with that or work from there.
def clean_url(url):
clean = url.split('//')[1]
if clean[0:4] == 'www.':
return clean[4:]
return clean
However, because the subdomain (such as 'www') can be used as a significant part of the url, you may want to keep that in. For example, www.pizza.com and pizza.com could be links to different pages.
Other things to consider are the urlparse library or regex but they may be overkill for this.
I have an URL of the form:
http://www.foo.com/bar?arg1=x&arg2=y
If I do:
request.url
I get:
http://www.foo.com/bar?arg1=x&arg2=y
Is it possible to get just http://www.foo.com/bar?
Looks like request.urlparts.path might be a way to do it.
Full documentation here.
Edit:
There is a way to do this via requests library
r.json()['headers']['Host']
I personally find the split function better.
You can use split function with ? as the delimiter to do this.
url = request.url.split("?")[0]
I'm not sure if this is the most effective/correct method though.
if you just want to remove the parameters to get base url do
url = url.split('?',1)[0]
this will split the url at the '?' and then give you base url
or even
url = url[:url.find('?')]
you can also use urlparse this is explained in the python docs at: https://docs.python.org/2/library/urlparse.html