I want to only print all the "words" that start with "/watch" from the string, and then add all the '/watch...' to a list. Thanks in advance!
# Take a random video from my youtube recommended and add it to watch2gether
import requests
from bs4 import BeautifulSoup as BS
import time
import random
# Importing libraries
num = random.randint(1, 20)
recommended = requests.get('https://www.youtube.com/results?search_query=svenska+youtube+klassiker&sp=EgIQAQ%253D%253D')
recommended_soup = BS(recommended.content, features='lxml')
recommended_vid = recommended_soup.find_all('a', href=True)
for links in recommended_vid:
print(links['href'])
Output:
/
//www.youtube.com/upload
/
/feed/trending
/feed/history
/premium
/channel/UC-9-kyTW8ZkZNDHQJ6FgpwQ
/channel/UCEgdi0XIXXZ-qJOFPf4JSKw
/gaming
/feed/guide_builder
/watch?v=PbVt_O1kFpA
/watch?v=PbVt_O1kFpA
/user/thedjdoge
/watch?v=1lcksCjvuSs
/watch?v=1lcksCjvuSs
/channel/UCn-puiDqHNMhRvq6wsU3nsQ
/watch?v=AKj_pxp2l1c
/watch?v=AKj_pxp2l1c
/watch?v=QNnEqTQD6DM
/watch?v=QNnEqTQD6DM
/channel/UCDuOAYzgiZzqqlXd2G3GAwg
....
Maybe I can use something like .remove or .replace, don't know what to do so I appreciate all help.
yea re is definitely overkill here. this is a perfect use case for filter
a_list = ["/watch/blah", "not/watch"]
new_list = filter(lambda x: x.startswith("/watch"), a_list)
print(list(new_list))
['/watch/blah']
just be aware it returns a generator, so wrap it in list if you want the list.
http://book.pythontips.com/en/latest/map_filter.html is good if you want more information on functions that do this kind of data cleaning. If you need to get really fancy with your data cleaning look into using pandas. It has a steep learning curve, but it's fantastic for complicated data cleaning.
you can do the following
for links in recommended_vid:
if "/watch" in links[href]:
print(link[href])
This should help you find all the /watch links.
import re
pattern = re.compile(r"/watch")
# pattern = re.compile(r"/watch\?v=[a-zA-Z_0-9]{11}") -- This pattern is to find all the links as well
matches = pattern.finditer(<your_string>)
for m in matches:
print(m) #will print all the locations at which /watch occurs
You can collect all the URLs in a list and proceed. Good Luck!!
Looking at your code, a simple if statement with str.startswith() should suffice to get what you want.
Assuming the links['href'] contains a str, then:
for links in recommended_vid:
href = links['href'] # I think 'href' will be of type 'str'
if href.startswith('/watch'):
print(href)
Note: .startswith() will only work if /watch is really at the start of the href; you could also try if '/watch' in href:, which will match if that string appears anywhere in href.
Related
I have a list of urls :
print (url_list)
[https://www.reddit.com/r/funnyvideos/comments/qfvclx/winners_of_the_rfunnyvideos_oscars_2021/,
https://v.redd.it/5tmm5wx5lgz71,
https://v.redd.it/n35a5fo8cez7,
https://v.redd.it/3j0sexqs8ez71,
https://v.redd.it/cmy0l1roxfz71]
How to filter this list so that only the urls with only v.redd.it/ are provided in output and all others urls are exlcuded.
This is what I meant (possible output):
Your required urls :
https://v.redd.it/5tmm5wx5lgz71,
https://v.redd.it/n35a5fo8cez7,
https://v.redd.it/3j0sexqs8ez71,
https://v.redd.it/cmy0l1roxfz71
Excluded Urls :
https://www.reddit.com/r/funnyvideos/comments/qfvclx/winners_of_the_rfunnyvideos_oscars_2021/
I am new to regex and python, so please help me with this. Thanks in advance
There is no need for a regex to do this, instead you can filter your list based on whether the host of the URL is v.redd.it (and get proper url parsing instead of implementing a regex to attempt to do the same):
from urllib.parse import urlparse
urls = ['https://www.reddit.com/r/funnyvideos/comments/qfvclx/winners_of_the_rfunnyvideos_oscars_2021/', 'https://v.redd.it/5tmm5wx5lgz71', 'https://v.redd.it/n35a5fo8cez7']
v_redd_it = filter(lambda x: urlparse(x).netloc == 'v.redd.it', urls)
# you can then either iterate over the filter directly, or convert it to a list
video_links = list(v_redd_it)
# ['https://v.redd.it/5tmm5wx5lgz71', 'https://v.redd.it/n35a5fo8cez7']
Please help me fix this, this is my code which I've already tried.
I really appreciate your help.
import urllib.request
import re
search_keyword="ill%20wiat"
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch?v=(\S{11})", html.read().decode())
print("https://www.youtube.com/watch?v=" + video_ids[0])
First of all check page you try to parse. you wrote:
r"watch?v=(\S{11})"
just remember that ? char here will be parsed as REGEX operator and not string you want,
so first of all you need to write it like:
/watch[?]v=(\S{11})
so your regex will be parsed properly
Second: good practice to print your list to see what you get and iterate via list using FOR loop instead of directly accessing index [0].
in you case you get this error just because your list of id is empty.
next code is working for me
import urllib.request
import re
search_keyword="ill%20wiat"
url="https://www.youtube.com/results?search_query="+search_keyword
with urllib.request.urlopen(url) as response:
video_ids = re.findall("/watch[?]v=(\S{11})", response.read().decode())
for video in video_ids:
print("https://www.youtube.com/watch?v=" + video)
P.S don't wrap your code with try/except to catch such thrown errors
urlib won't give you data
use
import requests
html=requests.get('https://www.youtube.com/results?search_query='+search_keyword)
text=html.text
text have all html data so search from text
I'm kinda new to Python. And I'm trying to find out how to do parsing in Python?
I've got a task: to do parsing with some piece of unknown for me symbols and put it to DB. I guess I can create DB and tables with help of SQLAlchemy, but I have no idea how to do parsing and what all these symbols below mean?
http://joxi.ru/YmEVXg6Iq3Q426
http://joxi.ru/E2pvG3NFxYgKrY
$$HDRPUBID 112701130020011127162536
H11127011300UNIQUEPONUMBER120011127
D11127011300UNIQUEPONUMBER100001112345678900000001
D21127011300UNIQUEPONUMBER1000011123456789AR000000001
D11127011300UNIQUEPONUMBER200002123456987X000000001
D21127011300UNIQUEPONUMBER200002123456987XIR000000000This item is inactive. 9781605600000
$$EOFPUBID 1127011300200111271625360000005
Thanks in advance those who can give me some advices what to start from and how the parsing is going on?
The best approach is to first figure out where each token begins and ends, and write a regular expression to capture these. The site RegexPal might help you design the regex.
As other suggest take a look to some regex tutorials, and also re module help.
Probably you're looking to something like this:
import re
headerMapping = {'type': (1,5), 'pubid': (6,11), 'batchID': (12,21),
'batchDate': (22,29), 'batchTime': (30,35)}
poaBatchHeaders = re.findall('\$\$HDR\d{30}', text)
parsedBatchHeaders = []
batchHeaderDict = {}
for poaHeader in poaBatchHeaders:
for key in headerMapping:
start = headerMapping[key][0]-1
end = headerMapping[key][1]
batchHeaderDict.update({key: poaHeader[start:end]})
parsedBatchHeaders.append(batchHeaderDict)
Then you have list with dicts, each dict contains data for each attribute. I assume that you have your datafile in text which is string. Each dict is made for one found structure (POA Batch Header in example).
If you want to parse it further, you have to made a function to parse each date in each attribute.
def batchDate(batch):
return (batch[0:2]+'-'+batch[2:4]+'-20'+batch[4:])
for header in parsedBatchHeaders:
header.update({'batchDate': batchDate( header['batchDate'] )})
Remember, that's an example and I don't know documentation of your data! I guess it works like that, but rest is up to you.
I am working with a huge list of URL's. Just a quick question I have trying to slice a part of the URL out, see below:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3
How could I slice out:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234
Sometimes there is more than two parameters after the CONTENT_ITEM_ID and the ID is different each time, I am thinking it can be done by finding the first & and then slicing off the chars before that &, not quite sure how to do this tho.
Cheers
Use the urlparse module. Check this function:
import urlparse
def process_url(url, keep_params=('CONTENT_ITEM_ID=',)):
parsed= urlparse.urlsplit(url)
filtered_query= '&'.join(
qry_item
for qry_item in parsed.query.split('&')
if qry_item.startswith(keep_params))
return urlparse.urlunsplit(parsed[:3] + (filtered_query,) + parsed[4:])
In your example:
>>> process_url(a)
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This function has the added bonus that it's easier to use if you decide that you also want some more query parameters, or if the order of the parameters is not fixed, as in:
>>> url='http://www.domainname.com/page?other_value=xx¶m3&CONTENT_ITEM_ID=1234¶m1'
>>> process_url(url, ('CONTENT_ITEM_ID', 'other_value'))
'http://www.domainname.com/page?other_value=xx&CONTENT_ITEM_ID=1234'
The quick and dirty solution is this:
>>> "http://something.com/page?CONTENT_ITEM_ID=1234¶m3".split("&")[0]
'http://something.com/page?CONTENT_ITEM_ID=1234'
Another option would be to use the split function, with & as a parameter. That way, you'd extract both the base url and both parameters.
url.split("&")
returns a list with
['http://www.domainname.com/page?CONTENT_ITEM_ID=1234', 'param2', 'param3']
I figured it out below is what I needed to do:
url = "http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3"
url = url[: url.find("&")]
print url
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
Parsin URL is never as simple I it seems to be, that's why there are the urlparse and urllib modules.
E.G :
import urllib
url ="http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3"
query = urllib.splitquery(url)
result = "?".join((query[0], query[1].split("&")[0]))
print result
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This is still not 100 % reliable, but much more than splitting it yourself because there are a lot of valid url format that you and me don't know and discover one day in error logs.
import re
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3'
m = re.search('(.*?)&', url)
print m.group(1)
Look at the urllib2 file name question for some discussion of this topic.
Also see the "Python Find Question" question.
This method isn't dependent on the position of the parameter within the url string. This could be refined, I'm sure, but it gets the point across.
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3'
parts = url.split('?')
id = dict(i.split('=') for i in parts[1].split('&'))['CONTENT_ITEM_ID']
new_url = parts[0] + '?CONTENT_ITEM_ID=' + id
An ancient question, but still, I'd like to remark that query string paramenters can also be separated by ';' not only '&'.
beside urlparse there is also furl, which has IMHO better API.
I have been trying to strip out some data from HTML files. I have the logic coded to get the right cells. Now I am struggling to get the actual contents of the 'cell':
here is my HTML snippet:
headerRows[0][10].contents
[<font size="+0"><font face="serif" size="1"><b>Apples Produced</b><font size="3">
</font></font></font>]
Note that this is a list item from Python [].
I need the value Apples Produced but can't get to it.
Any suggestions would be appreciated
Suggestions on a good book that explains this would earn my eternal gratitude
Thanks for that answer. However-isn't there a more general answer. What happens if my cell doesn't have a bold attribute
say it is:
[<font size="+0"><font face="serif" size="1"><I>Apples Produced</I><font size="3">
</font></font></font>]
Apples Produced
I am trying to learn to read/understand the documentation and your response will help
I really appreciate this help. The best thing about these answers is that it is a lot easier to generalize from them then I have been able to do so from the BeautifulSoup documentation. I learned to program in the Fortran era and now I am learning python and I am amazed at its power - BeautifulSoup is an example. Making a coherent whole of the documentation is tough for me.
Cheers
The BeautifulSoup documentation should cover everything you need - in this case it looks like you want to use findNext:
headerRows[0][10].findNext('b').string
A more generic solution which doesn't rely on the <b> tag would be to use the text argument to findAll, which allows you to search only for NavigableString objects:
>>> s = BeautifulSoup(u'<p>Test 1 <span>More</span> Test 2</p>')
>>> u''.join([s.string for s in s.findAll(text=True)])
u'Test 1 More Test 2'
headerRows[0][10].contents[0].find('b').string
I have a base class that I extend all Beautiful Soup classes with a bunch of methods that help me get at text within a group of elements that I don't necessarily want to rely on the structure of. One of those methods is the following:
def clean(self, val):
if type(val) is not StringType: val = str(val)
val = re.sub(r'<.*?>', '', s) #remove tags
val = re.sub("\s+" , " ", val) #collapse internal whitespace
return val.strip() #remove leading & trailing whitespace