Which Regular Expression should be used here? - python

I have a list of urls :
print (url_list)
[https://www.reddit.com/r/funnyvideos/comments/qfvclx/winners_of_the_rfunnyvideos_oscars_2021/,
https://v.redd.it/5tmm5wx5lgz71,
https://v.redd.it/n35a5fo8cez7,
https://v.redd.it/3j0sexqs8ez71,
https://v.redd.it/cmy0l1roxfz71]
How to filter this list so that only the urls with only v.redd.it/ are provided in output and all others urls are exlcuded.
This is what I meant (possible output):
Your required urls :
https://v.redd.it/5tmm5wx5lgz71,
https://v.redd.it/n35a5fo8cez7,
https://v.redd.it/3j0sexqs8ez71,
https://v.redd.it/cmy0l1roxfz71
Excluded Urls :
https://www.reddit.com/r/funnyvideos/comments/qfvclx/winners_of_the_rfunnyvideos_oscars_2021/
I am new to regex and python, so please help me with this. Thanks in advance

There is no need for a regex to do this, instead you can filter your list based on whether the host of the URL is v.redd.it (and get proper url parsing instead of implementing a regex to attempt to do the same):
from urllib.parse import urlparse
urls = ['https://www.reddit.com/r/funnyvideos/comments/qfvclx/winners_of_the_rfunnyvideos_oscars_2021/', 'https://v.redd.it/5tmm5wx5lgz71', 'https://v.redd.it/n35a5fo8cez7']
v_redd_it = filter(lambda x: urlparse(x).netloc == 'v.redd.it', urls)
# you can then either iterate over the filter directly, or convert it to a list
video_links = list(v_redd_it)
# ['https://v.redd.it/5tmm5wx5lgz71', 'https://v.redd.it/n35a5fo8cez7']

Related

How to print only specific words of a string

I want to only print all the "words" that start with "/watch" from the string, and then add all the '/watch...' to a list. Thanks in advance!
# Take a random video from my youtube recommended and add it to watch2gether
import requests
from bs4 import BeautifulSoup as BS
import time
import random
# Importing libraries
num = random.randint(1, 20)
recommended = requests.get('https://www.youtube.com/results?search_query=svenska+youtube+klassiker&sp=EgIQAQ%253D%253D')
recommended_soup = BS(recommended.content, features='lxml')
recommended_vid = recommended_soup.find_all('a', href=True)
for links in recommended_vid:
print(links['href'])
Output:
/
//www.youtube.com/upload
/
/feed/trending
/feed/history
/premium
/channel/UC-9-kyTW8ZkZNDHQJ6FgpwQ
/channel/UCEgdi0XIXXZ-qJOFPf4JSKw
/gaming
/feed/guide_builder
/watch?v=PbVt_O1kFpA
/watch?v=PbVt_O1kFpA
/user/thedjdoge
/watch?v=1lcksCjvuSs
/watch?v=1lcksCjvuSs
/channel/UCn-puiDqHNMhRvq6wsU3nsQ
/watch?v=AKj_pxp2l1c
/watch?v=AKj_pxp2l1c
/watch?v=QNnEqTQD6DM
/watch?v=QNnEqTQD6DM
/channel/UCDuOAYzgiZzqqlXd2G3GAwg
....
Maybe I can use something like .remove or .replace, don't know what to do so I appreciate all help.
yea re is definitely overkill here. this is a perfect use case for filter
a_list = ["/watch/blah", "not/watch"]
new_list = filter(lambda x: x.startswith("/watch"), a_list)
print(list(new_list))
['/watch/blah']
just be aware it returns a generator, so wrap it in list if you want the list.
http://book.pythontips.com/en/latest/map_filter.html is good if you want more information on functions that do this kind of data cleaning. If you need to get really fancy with your data cleaning look into using pandas. It has a steep learning curve, but it's fantastic for complicated data cleaning.
you can do the following
for links in recommended_vid:
if "/watch" in links[href]:
print(link[href])
This should help you find all the /watch links.
import re
pattern = re.compile(r"/watch")
# pattern = re.compile(r"/watch\?v=[a-zA-Z_0-9]{11}") -- This pattern is to find all the links as well
matches = pattern.finditer(<your_string>)
for m in matches:
print(m) #will print all the locations at which /watch occurs
You can collect all the URLs in a list and proceed. Good Luck!!
Looking at your code, a simple if statement with str.startswith() should suffice to get what you want.
Assuming the links['href'] contains a str, then:
for links in recommended_vid:
href = links['href'] # I think 'href' will be of type 'str'
if href.startswith('/watch'):
print(href)
Note: .startswith() will only work if /watch is really at the start of the href; you could also try if '/watch' in href:, which will match if that string appears anywhere in href.

Regex to parse out a part of URL using python

I am having data as follows,
data['url']
http://hostname.com/aaa/uploads/2013/11/a-b-c-d.jpg https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/e-f-g-h.gif https://www.aaa.com/
http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html
http://hostname.com/ddd/uploads/2013/11/w-e-r-t.ico
http://hostname.com/ddd/uploads/2013/11/r-t-y-u.aspx https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/t-r-w-q.jpeg https://www.aaa.com/
I want to find out the formats such as .jpg, .gif, .png, .ico, .aspx, .html, .jpeg and parse it out backwards until it finds a "/". Also I want to check for several occurance all through the string. My output should be,
data['parsed']
a-b-c-d
e-f-g-h
e-f-g-h a-a-a-a
w-e-r-t
r-t-y-u
t-r-w-q
I am thinking instead of writing individual commands for each of the formats, is there a way to write everything under a single command.
Can anybody help me in writing for theses commands? I am new to regex and any help would be appreciated.
this builds a list of name to extension pairs
import re
results = []
for link in data:
matches = re.search(r'/(\w-\w-\w-\w)\.(\w{2,})\b', link)
results.append((matches.group(1), matches.group(2)))
This pattern returns the file names. I have just used one of your urls to demonstrate, for more, you could simply append the matches to a list of results:
import re
url = "http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html"
p = r'((?:[a-z]-){3}[a-z]).'
matches = re.findall(p, url)
>>> print('\n'.join(matches))
e-f-g-h
a-a-a-a
There is the assumption that the urls all have the general form you provided.
You might try this:
data['parse'] = re.findall(r'[^/]+\.[a-z]+ ',data['url'])
That will pick out all of the file names with their extensions. If you want to remove the extensions, the code above returns a list which you can then process with list comprehension and re.sub like so:
[re.sub('\.[a-z]+$','',exp) for exp in data['parse']]
Use the .join function to create a string as demonstrated in Totem's answer

Filter a list of strings based on several possible endings

I don't know any python but I need to customize a script a little bit.
There are strings parsed in the script and put to a list (I guess).
Then these strings are filtered based on whether they start with "http". What I want to add is a filter based on their file extension as well. All links ending on html or xml shall be filtered.
This is the code that filters all hyperlinks:
links = filter (lambda x:x.startswith("http://") , links)
I don't know the proper syntax to put an OR operator for something like .endswith(".html) OR .endswith("xml")
I know this would filter all links ending on .html but I also need the .xml links.
links = filter (lambda x:x.startswith("http://") , links)
links = filter (lambda x:x.endswith(".html") , links)
If you're on at least 2.5, you can pass a tuple of suffixes to endswith. Thanks to #hcwhsa for pointing that out:
links = filter(lambda x:x.endswith((".html", ".xml")), links)
If you're using an earlier version, you can use the or operator:
links = filter(lambda x:x.endswith(".html") or x.endswith(".xml"), links)
Though you will want to lowercase x if you're not sure it's already lowercased.
I would probably do this with a generator expression rather than filter, and certainly without successive calls to filter:
links = [link for link in links if link.startswith('http://') and link.endswith(('.html', '.xml'))]
I think the best way check this with regex
>>> import re
>>> c = r"^http://.+\.(html|xml)"
>>> re.match(c, 'hello')
>>> re.match(c, 'http://data.com/word.html')
<_sre.SRE_Match object at 0x1d2a100>
And the answer will be
import re
regex = r"^http://.+\.(html|xml)"
links = filter(lambda x: re.match(regex, x), links)
links = list(filter(lambda x: x.endswith(".html"), links));

How can I create a new list with each element a regular expression of each element of another list in python?

I've been playing around with some different syntaxes but nothing seems to work, I've gotten syntax errors "TypeError: 'str' object does not support item assignment" and a few othes
for url in urls:
for i in range(len(urls)):
filename[i] = [re.search(r'/([.]+)(.pdf)') for url[i] in urls]
I've also tried:
for url in urls:
filename[i] = [re.search(r'/([.]+)(.pdf)') for url[i] in urls for i in range(len(urls))]
And a few other permutations. I've been thinking a list comprehension is the best way to go about doing this, but I'm not sure the syntax I'd need o use to do so.
Any help, please?
You were on the right track with a list comprehension, but you were making it more complex than it is:
[re.search(r'/([.]+)(.pdf)', url) for url in urls]
Note that this will be a list of matches, if you want, for example, the groups it finds, ignoring any urls it doesn't match, you could do:
[match.groups() for match in (re.search(r'/([.]+)(.pdf)', url) for url in urls) if match]
Which uses a nested generator expression to do a check if the match has been made, and extract the groups from it if it has:
>>> urls = ["http://wwww.example.com/test.html", "http://www.example.com/test.pdf"]
>>> [match.groups() for match in (re.search(r'/([^/]+)(.pdf)', url) for url in urls) if match]
[('test', '.pdf')]
Here I made a slight edit to your regex to make it work for my examples, this is done just for the example.
Of course, you could do whatever you want with the match objects here, instead of just getting match.groups().
Using list comprehension, I created a list of urls, and the url regex result is only inserted into the new list if it matches the pattern
file_regex = re.compile(r'/([a-zA-Z0-9]+)(.pdf)')
filenames = [file_regex.search(url) for url in urls if file_regex.match(url)]
print filenames

Slicing URL with Python

I am working with a huge list of URL's. Just a quick question I have trying to slice a part of the URL out, see below:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3
How could I slice out:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234
Sometimes there is more than two parameters after the CONTENT_ITEM_ID and the ID is different each time, I am thinking it can be done by finding the first & and then slicing off the chars before that &, not quite sure how to do this tho.
Cheers
Use the urlparse module. Check this function:
import urlparse
def process_url(url, keep_params=('CONTENT_ITEM_ID=',)):
parsed= urlparse.urlsplit(url)
filtered_query= '&'.join(
qry_item
for qry_item in parsed.query.split('&')
if qry_item.startswith(keep_params))
return urlparse.urlunsplit(parsed[:3] + (filtered_query,) + parsed[4:])
In your example:
>>> process_url(a)
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This function has the added bonus that it's easier to use if you decide that you also want some more query parameters, or if the order of the parameters is not fixed, as in:
>>> url='http://www.domainname.com/page?other_value=xx&param3&CONTENT_ITEM_ID=1234&param1'
>>> process_url(url, ('CONTENT_ITEM_ID', 'other_value'))
'http://www.domainname.com/page?other_value=xx&CONTENT_ITEM_ID=1234'
The quick and dirty solution is this:
>>> "http://something.com/page?CONTENT_ITEM_ID=1234&param3".split("&")[0]
'http://something.com/page?CONTENT_ITEM_ID=1234'
Another option would be to use the split function, with & as a parameter. That way, you'd extract both the base url and both parameters.
url.split("&")
returns a list with
['http://www.domainname.com/page?CONTENT_ITEM_ID=1234', 'param2', 'param3']
I figured it out below is what I needed to do:
url = "http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
url = url[: url.find("&")]
print url
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
Parsin URL is never as simple I it seems to be, that's why there are the urlparse and urllib modules.
E.G :
import urllib
url ="http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
query = urllib.splitquery(url)
result = "?".join((query[0], query[1].split("&")[0]))
print result
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This is still not 100 % reliable, but much more than splitting it yourself because there are a lot of valid url format that you and me don't know and discover one day in error logs.
import re
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
m = re.search('(.*?)&', url)
print m.group(1)
Look at the urllib2 file name question for some discussion of this topic.
Also see the "Python Find Question" question.
This method isn't dependent on the position of the parameter within the url string. This could be refined, I'm sure, but it gets the point across.
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
parts = url.split('?')
id = dict(i.split('=') for i in parts[1].split('&'))['CONTENT_ITEM_ID']
new_url = parts[0] + '?CONTENT_ITEM_ID=' + id
An ancient question, but still, I'd like to remark that query string paramenters can also be separated by ';' not only '&'.
beside urlparse there is also furl, which has IMHO better API.

Categories

Resources