Ripping video links out of HTML pages using Python - python

I have a bunch of HTML pages with video players embedded in them, via various different HTML tags, using the <video> tag, but also other ones too. What's uniting in common all of these different approaches to holding video files, is that they are links to various common video websites, such as
YouTube
Rumble
Bitchute
Brighteon
Odysee
Vimeo
Dailymotion
Content videos originating from different video hosting websites may have different ways of embedding the videos. For example, they may or may not use the <video> tag.
Correction, I am not dealing with a bunch of websites, I am dealing with a bunch of HTML pages stored locally. I want to use Python to rip the links to these videos from the HTML pages, and save them into some json file, which could be later read and thrown into youtube-dl for downloading all these videos.
My question is, how exactly would I go about doing this? What kind of strategy should I pursue? Should I just attempt to read the HTML file as a plain text file using Python, and then use some kind of algorithm or regular expression to look for links to these video hosting websites. If that is the case, then I am bad at regular expressions, and would like some assistance about how to find links to the video websites in the text using regular expressions in Python.
Alternatively, I could potentially make use of HTML's DOM structure. I do not know if this is possible to do in Python or not, but basically to read the HTML not as a simple text file, but as a DOM tree, and traversing up and down the tree to specifically pick up only the tags that have the videos embedded in them. I do not know how to do that either.
I guess, what I'm trying to say here is, I first need some kind of strategy to achieve my big goal in mind, and then I actually need to know what kind of code or APIs do I need to use in order to achieve my goal, or ripping out links to video files out of the HTML, and saving them somewhere.

Related

Getting potentially large amounts of data from a website: Should I use Scrapy or urllib2?

I'm not new to programming—but am (very) new to web-scraping. I'd like to get data from this website in this manner:
Get the team-data from the given URL and store it in some text file.
"Click" the links of each of the team members and store that data in some other text file.
Click various other specific links and store data in its own separate text file.
Again, I'm quite new to this. I have tried opening the specified website with urllib2 (in hopes of being able to parse it with BeautifulSoup), but opening it resulted in a timeout.
Ultimately, I'd like to do something like specify a team's URL to a script, and have said script update associated text files of the team, its players, and various other things in different links.
Considering what I want to do, would it be better to learn how to create a web-crawler, or directly do things via urllib2? I'm under the impression that a spider is faster, but will basically click on links at random unless told to do otherwise (I do not know whether or not this impression is accurate).

Using Beautifulsoup and regex to traverse javascript in page

I'm fetching webpages with a bunch of javascript on it, and I'm interested in parsing through the javascript portion of the pages for certain relevant info. Right now I have the following code in Python/BeautifulSoup/regex:
scriptResults = soup('script',{'type' : 'text/javascript'})
which yields an array of scripts, of which I can use a for loop to search for text I'd like:
for script in scriptResults:
for block in script:
if *patterniwant* in block:
**extract pattern from line using regex**
(Text in asterisks is pseudocode, of course.)
I was wondering if there was a better way for me to just use regex to find the pattern in the soup itself, searching only through the scripts themselves? My implementation works, but it just seems really clunky so I wanted something more elegant and/or efficient and/or Pythonic.
Thanks in advance!
I lot of website have client side data in JSON format. I that case I would suggest to extract JSON part from JavaScirpt code and parse it using Python's json modules (e.g. json.json.loads ). As a result you will get standard dictionary object.
Another option is to check with your browser what sort of AJAX requests application makes. Quite often it also returns structured data in JSON.
I would also check if page has any structured data already available (e.g. OpenGraph, microformats, RDFa, RSS feeds). A lot of web sites include this to improve pages SEO and make it better integrating with social network sharing.

Retrieving media (images, videos etc.) from links in Perl

Similar to Reddit's r/pic sub-reddit, I want to aggregate media from various sources. Some sites use OEmbed specs to expose media on the page but not all sites do it. I was browsing through Reddit's source because essentially they 'scrape' links that users submit, retrieve images, videos etc. They create thumbnails which are then displayed along the link on their site. Now, I would like to do something similar and I looked at their code[1] and it seems that they have custom scrapers for each domain that they recognize and then they have a generic Scraper class that uses simple logic to get images from any domain (basically they retrieve the web-page, parse the html and then determine the largest image on the page which they then use to generate a thumbnail).
Since it's open source I can probably reuse the code for my application but unfortunately I have chosen Perl as this is a hobby project and I'm trying to learn Perl. Is there a Perl module which has similar functionality? If not, is there a Perl module that is similar to Python Imaging Library? It would be handy to determine the image sizes without actually downloading the whole image & thumbnail generation.
Thanks!
[1] https://github.com/reddit/reddit/blob/master/r2/r2/lib/scraper.py
Image::Size is the specialised module for determining image sizes from various format. It should be enough to read the first 1000 octets or so from a resource, enough for the diverse image headers, into a buffer and operating on that. I have not tested this.
I do not know any general scraping module that has an API for HTTP range requests in order to avoid downloading the whole image resource, but it is easy to subclass WWW::Mechanize.
Try PerlMagick, installation instruction is also listed there.

Streaming audio (YouTube)

I'm writing a CLI for a music-media-platform. One of the features is going to be that you can directly play YouTube videos from the CLI. I don't really have an idea of how to do it, but this one sounded the most reasonable:
I'm going to use of those sites where you can download music from YouTube, for example, http://keepvid.com/ and then I directly stream and play this, but I have one problem. Is there any Python library capable of doing this and if so, do you have any concrete examples?
I've been looking, but I found nothing, even not with GStreamer.
You need two things to be able to download a YouTube video, the video id, which is represented by the v= section of the URL, and a hidden field t= which is present in the page source. I have no idea what this t value is, but it's what you need :)
You can then download the video using a URL in the format;
http://www.youtube.com/get_video?video_id=*******&t=*******
Where the stars represent the values obtained.
I'm guessing you can ask for the video id from user input, as it's straightforward to obtain. Your program would then download the HTML source for that video, parse the source for the t value, then download the video using the newly constructed URL.
For example, if you open this link in your browser, it should download the video, or you can use a downloading program such as Wget;
http://www.youtube.com/get_video?video_id=3HrSN7176XI&t=vjVQa1PpcFNM4c8MbEhsnGaNvYKoYERIJ-hK7ErLpUI=
It appears that KeepVid is simply a JavaScript bookmarklet that links you to a KeepVid download page where you can then download the YouTube video in any one of a variety of formats. So, unless you want to figure out how to stream the file that it links you to, it's not easily doable. You'd have to scrape the page returned and figure out which URL you wanted to download, and then you'd have to stream from that URL (and some of the formats may or may not be streamable anyway).
And as an aside, even though they don't have a terms of service specified, I'd imagine that since they appear to be mostly advertisement-supported that abusing their functionality by going around their advertisement-supported webpage would be ethically questionable.

RSS screen scraper

Can anyone point me towards a ready made RSS screen scraper, preferably in Python in order to get full text RSS feeds?
There's a good list of them here, which mentions Feed Parser, which you use like this:
import feedparser
python_wiki_rss_url = "http://www.python.org/cgi-bin/moinmoin/" \
"RecentChanges?action=rss_rc"
feed = feedparser.parse( python_wiki_rss_url )
You can then do things like:
for item in feed["items"]:
print item["title"]
feedparser.org is great
Sorry but it doesn't exist in python, though they do in php. You are more then welcome to use and improve the one I made named scraped. Though it does not do all sites, it is a recipe based system that currently only handles the NYT, WSJ and the Economist. I am working on an all inclusive algorithm, but its a major undertaking. It includes a ton of analysis to the different types of html and xml. Even the 3 sites mentioned above, have vastly different algorithms on how to scrape their sites WSJ being the most complex by far. They screw their HTML up with so much useless crap, mainly to just stop you.
Here is the program I was talking about, it requires lxml but it explains everything in the readme. It reads the config files, parses partial rss feeds, takes links and then scrapes those links, formulating in the end a RSS 2.0 xml file. Which I mainly convert into a ebook for my kindle. I utilize lxml, BeautifulSoup and feedparser.
http://tinyurl.com/yh3s9pa
You can also look at the calibre project, which uses a similar method to the way I do it, on recipes.

Categories

Resources