Streaming audio (YouTube) - python

I'm writing a CLI for a music-media-platform. One of the features is going to be that you can directly play YouTube videos from the CLI. I don't really have an idea of how to do it, but this one sounded the most reasonable:
I'm going to use of those sites where you can download music from YouTube, for example, http://keepvid.com/ and then I directly stream and play this, but I have one problem. Is there any Python library capable of doing this and if so, do you have any concrete examples?
I've been looking, but I found nothing, even not with GStreamer.

You need two things to be able to download a YouTube video, the video id, which is represented by the v= section of the URL, and a hidden field t= which is present in the page source. I have no idea what this t value is, but it's what you need :)
You can then download the video using a URL in the format;
http://www.youtube.com/get_video?video_id=*******&t=*******
Where the stars represent the values obtained.
I'm guessing you can ask for the video id from user input, as it's straightforward to obtain. Your program would then download the HTML source for that video, parse the source for the t value, then download the video using the newly constructed URL.
For example, if you open this link in your browser, it should download the video, or you can use a downloading program such as Wget;
http://www.youtube.com/get_video?video_id=3HrSN7176XI&t=vjVQa1PpcFNM4c8MbEhsnGaNvYKoYERIJ-hK7ErLpUI=

It appears that KeepVid is simply a JavaScript bookmarklet that links you to a KeepVid download page where you can then download the YouTube video in any one of a variety of formats. So, unless you want to figure out how to stream the file that it links you to, it's not easily doable. You'd have to scrape the page returned and figure out which URL you wanted to download, and then you'd have to stream from that URL (and some of the formats may or may not be streamable anyway).
And as an aside, even though they don't have a terms of service specified, I'd imagine that since they appear to be mostly advertisement-supported that abusing their functionality by going around their advertisement-supported webpage would be ethically questionable.

Related

Ripping video links out of HTML pages using Python

I have a bunch of HTML pages with video players embedded in them, via various different HTML tags, using the <video> tag, but also other ones too. What's uniting in common all of these different approaches to holding video files, is that they are links to various common video websites, such as
YouTube
Rumble
Bitchute
Brighteon
Odysee
Vimeo
Dailymotion
Content videos originating from different video hosting websites may have different ways of embedding the videos. For example, they may or may not use the <video> tag.
Correction, I am not dealing with a bunch of websites, I am dealing with a bunch of HTML pages stored locally. I want to use Python to rip the links to these videos from the HTML pages, and save them into some json file, which could be later read and thrown into youtube-dl for downloading all these videos.
My question is, how exactly would I go about doing this? What kind of strategy should I pursue? Should I just attempt to read the HTML file as a plain text file using Python, and then use some kind of algorithm or regular expression to look for links to these video hosting websites. If that is the case, then I am bad at regular expressions, and would like some assistance about how to find links to the video websites in the text using regular expressions in Python.
Alternatively, I could potentially make use of HTML's DOM structure. I do not know if this is possible to do in Python or not, but basically to read the HTML not as a simple text file, but as a DOM tree, and traversing up and down the tree to specifically pick up only the tags that have the videos embedded in them. I do not know how to do that either.
I guess, what I'm trying to say here is, I first need some kind of strategy to achieve my big goal in mind, and then I actually need to know what kind of code or APIs do I need to use in order to achieve my goal, or ripping out links to video files out of the HTML, and saving them somewhere.

How can I download free videos from youtube or any other site with Python, without an external video downloader?

Problem
Hi! Most answers refer to using pytube when using Python. But the problem is that pytube doesn't work for many videos on youtube now. It's outdated, and I always get errors. I also want to be able to get other free videos from other sites that are not on youtube.
And I know there are free sites and paid programs that let you put in a url, and it'll download it for you. But I want to understand the process of what's happening.
The following code works for easy things. Obviously it can be improved, but let's keep it super simple...
import requests
good_url = 'https://www.w3schools.com/tags/movie.mp4'
bad_url = 'https://r2---sn-vgqsknes.googlevideo.com/videoplayback?expire=1585432044&ei=jHF_XoXwBI7PDv_xtsgN&ip=12.345.678.99&id=743bcee1c959e9cd&itag=244&aitags=133%2C134%2C135%2C136%2C137%2C160%2C242%2C243%2C244%2C247%2C248%2C278%2C298%2C299%2C302%2C303&source=youtube&requiressl=yes&mh=1w&mm=31%2C26&mn=sn-vgqsknes%2Csn-ab5szn7z&ms=au%2Conr&mv=m&mvi=4&pl=23&pcm2=yes&initcwndbps=3728750&vprv=1&mime=video%2Fwebm&gir=yes&clen=22135843&dur=283.520&lmt=1584701992110857&mt=1585410393&fvip=5&keepalive=yes&beids=9466588&c=WEB&txp=5511222&sparams=expire%2Cei%2Cip%2Cid%2Caitags%2Csource%2Crequiressl%2Cpcm2%2Cvprv%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=ADKhkGMwRgIhAI3WtBFTf4kklX4xl859U8yzqavSzu-2OEn8tvHPoqAWAiEAlSDPhPdb5y4xPxPoXJFCNKr-h2c4jxKU8sAaaxxa7ok%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=ABSNjpQwRQIhAJkFK4xhfLraysF13jSZpHCoklyhJrwLjNSCQ1v7IzeXAiBLpVpYf72Gp-dlvwTM2tYzMcVl4Axzm2ARd7fN1gPW-g%3D%3D&alr=yes&cpn=EvFJNwgO-zNQOWkz&cver=2.20200327.05.01&ir=1&rr=12&fexp=9466588&range=15036316-15227116&rn=14&rbuf=0'
r = requests.get(good_url, stream=True)
with open('my_video.mp4', 'wb') as file:
file.write(r.content)
This works. But when I want a youtube video (and I obviously can't use a regular youtube url because the document request is different from the video request)...
Steps taken
I'll check the network tab in the dev tools, and it's all filled with a bunch of xhr requests. The headers for them always have the very long url for the request, accept-ranges: bytes, and content-type: video/webm, or something similar for mp4, etc.
Then I'll copy the url for that xhr, change the file extension to the correct one, and run the program.
Result
Sometimes that downloads a small chunk of the video with no sound (few seconds long), and other times it will download a bigger chunk but with no image. But I want the entire video with sound.
Question
Can someone please help me understand how to do this, and explain what's happening, whether it's on another site or youtube??
Why does good_url work, but not bad_url??? I figured it might be a timeout thing, so I got that xhr, and immediately tested it from python, but still no luck.
A related question (don't worry about answering this one, unless required)...
Sometimes youtube has Blob urls in the html too, example: <video src='blob:https://www.youtube.com/f4484c06-48ed-4531-a6ee-6a3ae0291d26'>...
I've read various answers for what blobs are, and I'm not understanding it, because it looks to me like a blob url is doing an xhr to change a url in the DOM, as if it was trying to do the equivalent of an internal redirect on a webserver for a private file that would be served based on view/object level permissions. Is that what's happening? Cause I don't see the point, especially when these videos are free? The way I've done that, such as with lazy loading, is to have a data-src attribute with the correct value, and an onload event handler runs a function switching the data-src value to the src value.
you can try this video
https://www.youtube.com/watch?v=GAvr5_EtOnI
from this bad_url remove &range=3478828-4655264
try this code:
import requests
good_url = 'https://www.w3schools.com/tags/movie.mp4'
bad_url = 'https://r1---sn-gvcp5mp5u5-q5js.googlevideo.com/videoplayback?expire=1631119417&ei=2ZM4Yc33G8S8owPa1YiYDw&ip=42.0.7.242&id=o-AKG-sNstgjok92lJp_o4pF_iJ2MWD4skzEvFcTLl8LX8&itag=396&aitags=133,134,135,136,137,160,242,243,244,247,248,278,394,395,396,397,398,399&source=youtube&requiressl=yes&mh=gB&mm=31,29&mn=sn-gvcp5mp5u5-q5js,sn-i3belney&ms=au,rdu&mv=m&mvi=1&pl=24&initcwndbps=82500&vprv=1&mime=video/mp4&ns=mh_mFY1G7qq0apTltxepCQ8G&gir=yes&clen=7874090&dur=213.680&lmt=1600716258020131&mt=1631097418&fvip=1&keepalive=yes&fexp=24001373,24007246&beids=9466586&c=WEB&txp=5531432&n=Q3AfqZKoEoXUzw&sparams=expire,ei,ip,id,aitags,source,requiressl,vprv,mime,ns,gir,clen,dur,lmt&lsparams=mh,mm,mn,ms,mv,mvi,pl,initcwndbps&lsig=AG3C_xAwRQIgYZMQz5Tc2kucxFsorprl-3e4mCxJ3lpX1pbX-HnFloACIQD-CuHGtUeWstPodprweaA4sUp8ZikyxySZp1m3zlItKg==&alr=yes&sig=AOq0QJ8wRQIhAJZg4q9vLal64LO6KAyWkpY1T8OTlJRd9wNXrgDpNOuQAiB77lqm4Ka9uz2CAgrPWMSu6ApTf5Zqaoy5emABYqCB_g==&cpn=5E-Sqvee9UG2ZaNQ&cver=2.20210907.00.00&rn=19&rbuf=90320'
r = requests.get(good_url, stream=True)
with open('my_video.mp4', 'wb') as file:
file.write(r.content)

Python save Stream File to Local File

I'm working on a Python scraping project for politician videos. I have isolated this link (and others like it):
http://vod.europarl.europa.eu/wmv/nas/nasvod02/vod0804/2014/wm/VODUnit_20140414_20110100_20125200_63884a681455e14a75888c.wmv
It downloads what looks like a streaming video file, but it's only 295 bytes, and will play in VLC only. Is there a way to save that url in Python to an actual local video file?
I've opened the file with Notepad++, and the included URLs either download the same file, or result in an error, but the file itself plays after buffering with VLC.
Thank you so much in advanced for any insight!
Update: I've been trying everything I can think of to find out the actual location of the video, the full 1:50 video should take up at least a few mb...
Perhaps if someone knew how to see the source VLC uses to get the video, the 2 links in the stream file are not useful (at least to me so far). Could there be a redirect at some point?

Search/Filter/Select/Manipulate data from a website using Python

I'm working on a project that basically requires me to go to a website, pick a search mode (name, year, number, etc), search a name, select amongst the results those with a specific type (filtering in other words), pick the option to save those results as opposed to emailing them, pick a format to save them then download them by clicking the save button.
My question is, is there a way to do those steps using a Python program? I am only aware of extracting data and downloading pages/images, but I was wondering if there was a way to write a script that would manipulate the data, and do what a person would manually do, only for a large number of iterations.
I've thought of looking into the URL structures, and finding a way to generate for each iteration the accurate URL, but even if that works, I'm still stuck because of the "Save" button, as I can't find a link that would automatically download the data that I want, and using a function of the urllib2 library would download the page but not the actual file that I want.
Any idea on how to approach this? Any reference/tutorial would be extremely helpful, thanks!
EDIT: When I inspect the save button here is what I get:
Search Button
This would depend a lot on the website your targeting and how their search is implemented.
For some websites, like Reddit, they have an open API where you can add a .json extension to a URL and get a JSON string response as opposed to pure HTML.
For using a REST API or any JSON response, you can load it as a Python dictionary using the json module like this
import json
json_response = '{"customers":[{"name":"carlos", "age":4}, {"name":"jim", "age":5}]}'
rdict = json.loads(json_response)
def print_names(data):
for entry in data["customers"]:
print(entry["name"])
print_names(rdict)
You should take a look at the Library of Congress docs for developers. If they have an API, you'll be able to learn about how you can do search and filter through their API. This will make everything much easier than manipulating a browser through something like Selenium. If there's an API, then you could easily scale your solution up or down.
If there's no API, then you have
Use Selenium with a browser(I prefer Firefox)
Try to get as much info generated, filtered, etc. without actually having to push any buttons on that page by learning how their search engine works with GET and POST requests. For example, if you're looking for books within a range, then manually conduct this search and look at how the URL changes. If you're lucky, you'll see that your search criteria is in the URL. Using this info you can actually conduct a search by visiting that URL which means your program won't have to fill out a form and push buttons, drop-downs, etc.
If you have to use the browser through Selenium(for example, if you want to save the whole page with html, css, js files then you have to press ctrl+s then click "save" button), then you need to find libraries that allow you to manipulate the keyboard within Python. There are such libraries for Ubuntu. These libraries will allow you to press any keys on the keyboard and even do key combinations.
An example of what's possible:
I wrote a script that logs me in to a website, then navigates me to some page, downloads specific links on that page, visits every link, saves every page, avoids saving duplicate pages, and avoids getting caught(i.e. it doesn't behave like a bot by for example visiting 100 pages per minute).
The whole thing took 3-4 hours to code and it actually worked in a virtual Ubuntu machine I had running on my Mac which means while it was doing all that work I could do use my machine. If you don't use a virtual machine, then you'll either have to leave the script running and not interfere with it or make a much more robust program that IMO is not worth coding since you can just use a virtual machine.

Retrieving media (images, videos etc.) from links in Perl

Similar to Reddit's r/pic sub-reddit, I want to aggregate media from various sources. Some sites use OEmbed specs to expose media on the page but not all sites do it. I was browsing through Reddit's source because essentially they 'scrape' links that users submit, retrieve images, videos etc. They create thumbnails which are then displayed along the link on their site. Now, I would like to do something similar and I looked at their code[1] and it seems that they have custom scrapers for each domain that they recognize and then they have a generic Scraper class that uses simple logic to get images from any domain (basically they retrieve the web-page, parse the html and then determine the largest image on the page which they then use to generate a thumbnail).
Since it's open source I can probably reuse the code for my application but unfortunately I have chosen Perl as this is a hobby project and I'm trying to learn Perl. Is there a Perl module which has similar functionality? If not, is there a Perl module that is similar to Python Imaging Library? It would be handy to determine the image sizes without actually downloading the whole image & thumbnail generation.
Thanks!
[1] https://github.com/reddit/reddit/blob/master/r2/r2/lib/scraper.py
Image::Size is the specialised module for determining image sizes from various format. It should be enough to read the first 1000 octets or so from a resource, enough for the diverse image headers, into a buffer and operating on that. I have not tested this.
I do not know any general scraping module that has an API for HTTP range requests in order to avoid downloading the whole image resource, but it is easy to subclass WWW::Mechanize.
Try PerlMagick, installation instruction is also listed there.

Categories

Resources