I'm working on a Python scraping project for politician videos. I have isolated this link (and others like it):
http://vod.europarl.europa.eu/wmv/nas/nasvod02/vod0804/2014/wm/VODUnit_20140414_20110100_20125200_63884a681455e14a75888c.wmv
It downloads what looks like a streaming video file, but it's only 295 bytes, and will play in VLC only. Is there a way to save that url in Python to an actual local video file?
I've opened the file with Notepad++, and the included URLs either download the same file, or result in an error, but the file itself plays after buffering with VLC.
Thank you so much in advanced for any insight!
Update: I've been trying everything I can think of to find out the actual location of the video, the full 1:50 video should take up at least a few mb...
Perhaps if someone knew how to see the source VLC uses to get the video, the 2 links in the stream file are not useful (at least to me so far). Could there be a redirect at some point?
Related
Good morning all.
I have a generic question about the best approach to handle large files with Django.
I created a python project where the user is able to read a binary file (usually the size is between 30-100MB). Once the file is read, the program processes the file and shows relevant metrics to the user. Basically it outputs the max, min, average, std of the data.
At the moment, you can only run this project from the cmd line. I'm trying to create a user interface so that anyone can use it. I decided to create a webpage using django. The page is very simple. The user uploads files, he then selects which file he wants to process and it shows the metrics to the user.
Working on my local machine I was able to implement it. I upload the files (it saves on the user's laptop and then it processes it). I then created an S3 account, and now the files are all uploaded to S3. The problem that I'm having is that when I try to get the file (I'm using smart_open (https://pypi.org/project/smart-open/)) it is really slow to read the file (for a 30MB file it's taking 300sec), but if I download the file and read it, it only takes me 8sec.
My question is: What is the best approach to retrieve files from S3, and process them? I'm thinking of simply downloading the file to my server, process it, and then deleting it. I've tried this on my localhost and it's fast. Downloading from S3 takes 5sec and processing takes 4sec.
Would this be a good approach? I'm a bit afraid that for instance if I have 10 users at the same time and each one creates a report then I'll have 10*30MB = 300MB of space that the server needs. Is this something practical, or will I fill up the server?
Thank you for your time!
Edit
To give a bit more of a context, what's making it show is the f.read() line. Due to the format of the binary file. I have to read the file in the following way:
name = f.read(30)
unit = f.read(5)
data_length = f.read(2)
data = f.read(data_length) <- This is the part that is taking a lot of time when I read it directly from S3. If I download the file, then this is super fast.
All,
After some experimenting, I found a solution that works for me.
with open('temp_file_name', 'wb') as data:
s3.download_fileobj(Bucket='YOURBUCKETNAME', Key='YOURKEY', data)
read_file('temp_file_name')
os.remove('temp_file_name')
I don't know if this is the best approach or what are the possible downfalls of this approach. I'll use it and come back to this post if I end up using a different solution.
The problem with my previous approach was that f.read() was taking too long, the problem seems to be that every time I need to read a new line, the program needs to connect to S3 (or something) and this is taking too long. What ended up working for me, was to download the file directly to my server, read it, and then deleting it once I read the file. Using this solution I was able to get the speeds that I was getting when working on a localserver (reading directly from my laptop).
If you are working with medium size files (30-50mb) this approach seems to work. My only concern is if we try to download a really large file if the server will run out of disk space.
I have written a program which reads a text file line by line containing links to pdf files which then downloads the file, split all the pages into another pdf and then covert it to jpeg format. I am now asked to add a pause/resume functionality to the same. The sequence is it first reads one line then download from the link then split the pdf to pdf of pages then makes images of all the pages then go to another loop, the problem is we don't know the data size and have to further work on the images so we have to pause the program in-between and have to resume after sometime, it can be an hour, 4 hours or even more, please help. It has to start and stop on user command at any point and time when user like.
Thanks in advance
A quick update:- Haven't found the exact way for integrating pause/resume functionality to the script. But did a workaround i.e., Created a log file after downloading pdf file creating images of all the pages of the file and wrote the link in the log file. In the starting of the program, I am comparing the two files i.e., log file and the file containing links and then looping through the difference and downloading file. So, when I stop the function and start it again it will compare links which are in log file(For which the process is completed i.e., images of all the pages is created) and in the file containing all the links and start downloading from the link for which the process was interrupted. Please do let me know if there is a better approach or the pause/resume functionality is present.
Hello guys !
I'm working on a personal project, and I have a little problem which is the following one:
I'm using the moviepy module to build videos, with sound; When I execute my program,
I have the result that I expected, I mean the video I wanted to have, and I can read it well
from my PC as a mp4 video using VLC.
But,
When I try to export this video to my iPhone (I tried to use Google Drive, Dropbox, emails, etc),
the sound disappear.
What makes me think the mistake comes from my code ?
When I'm on my phone, I can read the video very well straightly from Google Drive (or something else), but when I trying to save it on my gallery (pictures/videos), the sound disappears at this moment. Even when I post my video directly from Dropbox to Instagram, it says that "this video has no sound".
Obviously, I've tried different .mp4 videos (not created with moviepy), and it works very well.
I'm so exhausted of spending hours trying to figured out where the problem could come from, that's why I'm asking for your help.
Here is the function that build my videos with the sound, and where I'm 100% sure the the problem is from :
def build_sentence_video(self):
logo = mpy.ImageClip(self.logo_path).set_position('center', 0).resize(width=100, height=100).set_position((10,5))
clip = mpy.VideoClip(self.make_simple_frame, duration=10)
video = mpy.CompositeVideoClip([clip, logo], size=VIDEO_SIZE).on_color(color=WHITE, col_opacity=1).set_duration(4)
video_with_new_audio = video.set_audio(AudioFileClip(self.audio, buffersize=200000, fps=44100))
video_with_new_audio.write_videofile(self.fr_word+".mp4",
fps=2,
)
Sorry in advance if my english is awful.
Thank you for your help !
After a long research, I can't find a way to get the mpd url for a YouTube video. My question is, is it even possible to get it?
In theory, YouTube has to have a manifest file, so that it could play out the segment files. My research has led to some old posts where people have allegedly found a way to do it. A python script that is supposed to extract the mpd file. After analyzing, what exactly it is done, I have found that line 68, he is looking for a dashmpd, which is not contained in the code. I thought maybe the name of the parameter is changed, and tried to look for some URL, but without success.
mpdurl = html[html.find("dashmpd"):]
Topic where MPD file is used., another similar topic.
So again, my question is is it even possible to extract the mpd url/file from a YouTube video? Or is it encrypted and not possible any more? Does it make a difference if the video is in webm or mp4 format?
I'm writing a CLI for a music-media-platform. One of the features is going to be that you can directly play YouTube videos from the CLI. I don't really have an idea of how to do it, but this one sounded the most reasonable:
I'm going to use of those sites where you can download music from YouTube, for example, http://keepvid.com/ and then I directly stream and play this, but I have one problem. Is there any Python library capable of doing this and if so, do you have any concrete examples?
I've been looking, but I found nothing, even not with GStreamer.
You need two things to be able to download a YouTube video, the video id, which is represented by the v= section of the URL, and a hidden field t= which is present in the page source. I have no idea what this t value is, but it's what you need :)
You can then download the video using a URL in the format;
http://www.youtube.com/get_video?video_id=*******&t=*******
Where the stars represent the values obtained.
I'm guessing you can ask for the video id from user input, as it's straightforward to obtain. Your program would then download the HTML source for that video, parse the source for the t value, then download the video using the newly constructed URL.
For example, if you open this link in your browser, it should download the video, or you can use a downloading program such as Wget;
http://www.youtube.com/get_video?video_id=3HrSN7176XI&t=vjVQa1PpcFNM4c8MbEhsnGaNvYKoYERIJ-hK7ErLpUI=
It appears that KeepVid is simply a JavaScript bookmarklet that links you to a KeepVid download page where you can then download the YouTube video in any one of a variety of formats. So, unless you want to figure out how to stream the file that it links you to, it's not easily doable. You'd have to scrape the page returned and figure out which URL you wanted to download, and then you'd have to stream from that URL (and some of the formats may or may not be streamable anyway).
And as an aside, even though they don't have a terms of service specified, I'd imagine that since they appear to be mostly advertisement-supported that abusing their functionality by going around their advertisement-supported webpage would be ethically questionable.