How to remove part of url using python? - python

I have a problem. I can't figure out a way to do this so I'm asking for someone to help.
I have URL like below:
https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4
And, I want to remove the last part of URL and change /f/ with /d/ so that I can get the URL to be like below:
https://abc.xyz/d/b
Keep in mind the prefix /b. It changes regular python.
I have tried this below based on url answer but it didsn't work for me.
newurl = oldurl.replace('/f/','/d/').rsplit("/", 1)[0])

A more efficient way of splitting is to use rsplit and set maxsplit to 1. This saves you the time and hassle rejoining the string.
old_url = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
url = old_url.rsplit("/", 1)[0]
# print(url) => https://abc.xyz/f/b

This what you want?
'/'.join(url.split('/')[:-1])

You don't really need regex for this.
The following will work for a link:
k = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
print("/".join(k.split("/")[:-1]))

After split() or rsplit(), sometimes you will want and need to specify and get the list of element from the front then join() them and replace() f with d as shown below:
url = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
new_url = "/".join(url.split("/")[:5]).replace("f", "d")
print(new_url)
url = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
new_url = "/".join(url.rsplit("/")[:5]).replace("f", "d")
print(new_url)
Output:
https://abc.xyz/f/b

Related

Which Regular Expression should be used here?

I have a list of urls :
print (url_list)
[https://www.reddit.com/r/funnyvideos/comments/qfvclx/winners_of_the_rfunnyvideos_oscars_2021/,
https://v.redd.it/5tmm5wx5lgz71,
https://v.redd.it/n35a5fo8cez7,
https://v.redd.it/3j0sexqs8ez71,
https://v.redd.it/cmy0l1roxfz71]
How to filter this list so that only the urls with only v.redd.it/ are provided in output and all others urls are exlcuded.
This is what I meant (possible output):
Your required urls :
https://v.redd.it/5tmm5wx5lgz71,
https://v.redd.it/n35a5fo8cez7,
https://v.redd.it/3j0sexqs8ez71,
https://v.redd.it/cmy0l1roxfz71
Excluded Urls :
https://www.reddit.com/r/funnyvideos/comments/qfvclx/winners_of_the_rfunnyvideos_oscars_2021/
I am new to regex and python, so please help me with this. Thanks in advance
There is no need for a regex to do this, instead you can filter your list based on whether the host of the URL is v.redd.it (and get proper url parsing instead of implementing a regex to attempt to do the same):
from urllib.parse import urlparse
urls = ['https://www.reddit.com/r/funnyvideos/comments/qfvclx/winners_of_the_rfunnyvideos_oscars_2021/', 'https://v.redd.it/5tmm5wx5lgz71', 'https://v.redd.it/n35a5fo8cez7']
v_redd_it = filter(lambda x: urlparse(x).netloc == 'v.redd.it', urls)
# you can then either iterate over the filter directly, or convert it to a list
video_links = list(v_redd_it)
# ['https://v.redd.it/5tmm5wx5lgz71', 'https://v.redd.it/n35a5fo8cez7']

Line 7, IndexError: list index out of range

Please help me fix this, this is my code which I've already tried.
I really appreciate your help.
import urllib.request
import re
search_keyword="ill%20wiat"
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch?v=(\S{11})", html.read().decode())
print("https://www.youtube.com/watch?v=" + video_ids[0])
First of all check page you try to parse. you wrote:
r"watch?v=(\S{11})"
just remember that ? char here will be parsed as REGEX operator and not string you want,
so first of all you need to write it like:
/watch[?]v=(\S{11})
so your regex will be parsed properly
Second: good practice to print your list to see what you get and iterate via list using FOR loop instead of directly accessing index [0].
in you case you get this error just because your list of id is empty.
next code is working for me
import urllib.request
import re
search_keyword="ill%20wiat"
url="https://www.youtube.com/results?search_query="+search_keyword
with urllib.request.urlopen(url) as response:
video_ids = re.findall("/watch[?]v=(\S{11})", response.read().decode())
for video in video_ids:
print("https://www.youtube.com/watch?v=" + video)
P.S don't wrap your code with try/except to catch such thrown errors
urlib won't give you data
use
import requests
html=requests.get('https://www.youtube.com/results?search_query='+search_keyword)
text=html.text
text have all html data so search from text

How to print only specific words of a string

I want to only print all the "words" that start with "/watch" from the string, and then add all the '/watch...' to a list. Thanks in advance!
# Take a random video from my youtube recommended and add it to watch2gether
import requests
from bs4 import BeautifulSoup as BS
import time
import random
# Importing libraries
num = random.randint(1, 20)
recommended = requests.get('https://www.youtube.com/results?search_query=svenska+youtube+klassiker&sp=EgIQAQ%253D%253D')
recommended_soup = BS(recommended.content, features='lxml')
recommended_vid = recommended_soup.find_all('a', href=True)
for links in recommended_vid:
print(links['href'])
Output:
/
//www.youtube.com/upload
/
/feed/trending
/feed/history
/premium
/channel/UC-9-kyTW8ZkZNDHQJ6FgpwQ
/channel/UCEgdi0XIXXZ-qJOFPf4JSKw
/gaming
/feed/guide_builder
/watch?v=PbVt_O1kFpA
/watch?v=PbVt_O1kFpA
/user/thedjdoge
/watch?v=1lcksCjvuSs
/watch?v=1lcksCjvuSs
/channel/UCn-puiDqHNMhRvq6wsU3nsQ
/watch?v=AKj_pxp2l1c
/watch?v=AKj_pxp2l1c
/watch?v=QNnEqTQD6DM
/watch?v=QNnEqTQD6DM
/channel/UCDuOAYzgiZzqqlXd2G3GAwg
....
Maybe I can use something like .remove or .replace, don't know what to do so I appreciate all help.
yea re is definitely overkill here. this is a perfect use case for filter
a_list = ["/watch/blah", "not/watch"]
new_list = filter(lambda x: x.startswith("/watch"), a_list)
print(list(new_list))
['/watch/blah']
just be aware it returns a generator, so wrap it in list if you want the list.
http://book.pythontips.com/en/latest/map_filter.html is good if you want more information on functions that do this kind of data cleaning. If you need to get really fancy with your data cleaning look into using pandas. It has a steep learning curve, but it's fantastic for complicated data cleaning.
you can do the following
for links in recommended_vid:
if "/watch" in links[href]:
print(link[href])
This should help you find all the /watch links.
import re
pattern = re.compile(r"/watch")
# pattern = re.compile(r"/watch\?v=[a-zA-Z_0-9]{11}") -- This pattern is to find all the links as well
matches = pattern.finditer(<your_string>)
for m in matches:
print(m) #will print all the locations at which /watch occurs
You can collect all the URLs in a list and proceed. Good Luck!!
Looking at your code, a simple if statement with str.startswith() should suffice to get what you want.
Assuming the links['href'] contains a str, then:
for links in recommended_vid:
href = links['href'] # I think 'href' will be of type 'str'
if href.startswith('/watch'):
print(href)
Note: .startswith() will only work if /watch is really at the start of the href; you could also try if '/watch' in href:, which will match if that string appears anywhere in href.

Bottle - Is it possible to retrieve URL without parameters?

I have an URL of the form:
http://www.foo.com/bar?arg1=x&arg2=y
If I do:
request.url
I get:
http://www.foo.com/bar?arg1=x&arg2=y
Is it possible to get just http://www.foo.com/bar?
Looks like request.urlparts.path might be a way to do it.
Full documentation here.
Edit:
There is a way to do this via requests library
r.json()['headers']['Host']
I personally find the split function better.
You can use split function with ? as the delimiter to do this.
url = request.url.split("?")[0]
I'm not sure if this is the most effective/correct method though.
if you just want to remove the parameters to get base url do
url = url.split('?',1)[0]
this will split the url at the '?' and then give you base url
or even
url = url[:url.find('?')]
you can also use urlparse this is explained in the python docs at: https://docs.python.org/2/library/urlparse.html

Slicing URL with Python

I am working with a huge list of URL's. Just a quick question I have trying to slice a part of the URL out, see below:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3
How could I slice out:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234
Sometimes there is more than two parameters after the CONTENT_ITEM_ID and the ID is different each time, I am thinking it can be done by finding the first & and then slicing off the chars before that &, not quite sure how to do this tho.
Cheers
Use the urlparse module. Check this function:
import urlparse
def process_url(url, keep_params=('CONTENT_ITEM_ID=',)):
parsed= urlparse.urlsplit(url)
filtered_query= '&'.join(
qry_item
for qry_item in parsed.query.split('&')
if qry_item.startswith(keep_params))
return urlparse.urlunsplit(parsed[:3] + (filtered_query,) + parsed[4:])
In your example:
>>> process_url(a)
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This function has the added bonus that it's easier to use if you decide that you also want some more query parameters, or if the order of the parameters is not fixed, as in:
>>> url='http://www.domainname.com/page?other_value=xx&param3&CONTENT_ITEM_ID=1234&param1'
>>> process_url(url, ('CONTENT_ITEM_ID', 'other_value'))
'http://www.domainname.com/page?other_value=xx&CONTENT_ITEM_ID=1234'
The quick and dirty solution is this:
>>> "http://something.com/page?CONTENT_ITEM_ID=1234&param3".split("&")[0]
'http://something.com/page?CONTENT_ITEM_ID=1234'
Another option would be to use the split function, with & as a parameter. That way, you'd extract both the base url and both parameters.
url.split("&")
returns a list with
['http://www.domainname.com/page?CONTENT_ITEM_ID=1234', 'param2', 'param3']
I figured it out below is what I needed to do:
url = "http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
url = url[: url.find("&")]
print url
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
Parsin URL is never as simple I it seems to be, that's why there are the urlparse and urllib modules.
E.G :
import urllib
url ="http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
query = urllib.splitquery(url)
result = "?".join((query[0], query[1].split("&")[0]))
print result
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This is still not 100 % reliable, but much more than splitting it yourself because there are a lot of valid url format that you and me don't know and discover one day in error logs.
import re
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
m = re.search('(.*?)&', url)
print m.group(1)
Look at the urllib2 file name question for some discussion of this topic.
Also see the "Python Find Question" question.
This method isn't dependent on the position of the parameter within the url string. This could be refined, I'm sure, but it gets the point across.
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
parts = url.split('?')
id = dict(i.split('=') for i in parts[1].split('&'))['CONTENT_ITEM_ID']
new_url = parts[0] + '?CONTENT_ITEM_ID=' + id
An ancient question, but still, I'd like to remark that query string paramenters can also be separated by ';' not only '&'.
beside urlparse there is also furl, which has IMHO better API.

Categories

Resources