Make requests download a video file faster - python

I have the following code to download .mp4 file, however it's taking too long per video. This doesn't have anything to do with my internet as wget is ramming through it like a sharp knife, while the following code is chugging along like an old cow.
I think the code is viewing the video in real-time and writing it to the computer, is that correct? Is there any other way to download the video as a whole instead, or a means to speed this download up via any other module?
def download_file(link,temp):
r = requests.get(link,stream=True)
folder_name = r"K:\Archive\Videos"
existing_files = [file.replace(".mp4","") for file in os.listdir(folder_name)]
if temp not in existing_files:
with open(f"{folder_name}/{temp}.mp4","wb") as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
Going through the code, I'm starting to think that perhaps it's the part that's checking if the file exists that's taking up the time. I have 400,000 files in the folder, so perhaps this checking mechanism isn't efficient?

I think your assessment is right. Enumerating 400,000 files takes a lot of time. So, you should cache it in a global.
existing_files = None
def download_files(link,temp):
global existing_files
if not existing_files:
existing_files = [file.replace(... etc

Related

Python in-memory files for caching large files

I am doing very large data processing (16GB) and I would like to try and speed up the operations through storing the entire file in RAM in order to deal with disk latency.
I looked into the existing libraries but couldn't find anything that would give me the flexibility of interface.
Ideally, I would like to use something with integrated read_line() method so that the behavior is similar to the standard file reading interface.
Homer512 answered my question somewhat, mmap is a nice option that results in caching.
Alternatively, Numpy has some interfaces for parsing byte streams from a file directly into memory and it also comes with some string-related operations.
Just read the file once. The OS will cache it for the next read operations. It's called the page cache.
For example, run this code and see what happens (replace the path with one pointing to a file on your system):
import time
filepath = '/home/mostafa/Videos/00053.MTS'
n_bytes = 200_000_000 # read 200 MB
t0 = time.perf_counter()
with open(filepath, 'rb') as f:
f.read(n_bytes)
t1 = time.perf_counter()
with open(filepath, 'rb') as f:
f.read(n_bytes)
t2 = time.perf_counter()
print(f'duration 1 = {t1 - t0}')
print(f'duration 2 = {t2 - t1}')
On my Linux system, I get:
duration 1 = 2.1419905399670824
duration 2 = 0.10992361599346623
Note how faster the second read operation is, even though the file was closed and reopened. This is because, in the second read operation, the file is being read from RAM, thanks to the Linux kernel. Also note that if you run this program again, you'll see that both operations finish quickly, because the file was cached from the previous run:
duration 1 = 0.10143039299873635
duration 2 = 0.08972924604313448
So in short, just use this code to preload the file into RAM:
with open(filepath, 'rb') as f:
f.read()
Finally, it's worth mentioning that page cache is practically non-deterministic, i.e. your file might not get fully cached, especially if the file size is big compared to the RAM. But in cases like that, caching the file might not be practical after all. In short, default OS page caching is not the most sophisticated way, but it is good enough for many use cases.

How to monitor changes to a particular website CSS?

I wish to monitor the CSS of a few websites (these websites aren't my own) for changes and receive a notification of some sort when they do. If you could please share any experience you've had with this to point me in the right direction for how to code it I'd appreciate it greatly.
I'd like this script/app to notify a Slack group on change, which I assume will require a webhook.
Not asking for code, just any advice about particular APIs and other tools that may be of benefit.
I would suggest a modification of tschaefermedia's answer.
Crawl website for .css files, save.
Take an md5 of each file.
Then compare md5 of the new file will the old file.
If the md5 is different then the file changed.
Below is a function to take md5 of large files.
def md5(file_name):
# make a md5 hash object
hash_md5 = hashlib.md5()
# open file as binary and read only
with open(file_name, 'rb') as f:
i = 0
# read 4096 bytes at a time and take the md5 hash of it and add it to the hash total
# b converts string literal to bytes
for chunk in iter(lambda: f.read(4096), b''):
i += 1
# get sum of md5 hashes
# m.update(a); m.update(b) is equivalent to m.update(a+b)
hash_md5.update(chunk)
# check for correct number of iterations
file_size = os.path.getsize(file_name)
expected_i = int(math.ceil(float(file_size) / float(4096)))
correct_i = i == expected_i
# check if md5 correct
md5_chunk_file = hash_md5.hexdigest()
return md5_chunk_file
I would suggest using Github in your workflow. That gives you a good idea of changes and ways to revert back to older versions.
One possible solution:
Crawl website for .css files, save change dates and/or filesize.
After each crawl compare information and if changes detected use slack API to notify. I haven't worked with slack that, for this part of the solution maybe someone else can give advice.

Memory issues while parsing json file in ijson

This tutorial https://www.dataquest.io/blog/python-json-tutorial/ has a 600MB file that they work with, however when I run their code
import ijson
filename = "md_traffic.json"
with open(filename, 'r') as f:
objects = ijson.items(f, 'meta.view.columns.item')
columns = list(objects)
I'm running into 10+ minutes of waiting for the file to be read into ijson and I'm really confused how this is supposed to be reasonable. Shouldn't there be parsing? Am I missing something?
The main problem is not that you are creating a list after parsing (that only collects the individual results into a single structure), but that you are using the default pure-python backend provided by ijson.
There are other backends that can be used which are way faster. In ijson's homepage it is explained how you can import those. The yajl2_cffi backend is the fastest currently available at the moment, but I've created a new yajl2_c backend (there's a pull request pending acceptance) that performs even better.
In my laptop (Intel(R) Core(TM) i7-5600U) using the yajl2_cffi backend your code runs in ~1.5 minutes. Using the yajl2_c backend it runs in ~10.5 seconds (python 3) and ~15 seconds (python 2.7.12).
Edit: #lex-scarisbrick is of course also right in that you can quickly break out of the loop if you are only interested in the column names.
This looks like a direct copy/paste of the tutorial found here:
https://www.dataquest.io/blog/python-json-tutorial/
The reason it's taking so long is the list() around the output of the ijson.items function. This effectively forces parsing of the entire file before returning any results. Taking advantage of the ijson.items being a generator, the first result can be returned almost immediately:
import ijson
filename = "md_traffic.json"
with open(filename, 'r') as f:
for item in ijson.items(f, 'meta.view.columns.item'):
print(item)
break
EDIT: The very next step in the tutorial is print(columns[0]), which is why I included printing the first item in the answer. Also, it's not clear whether the question was for Python 2 or 3, so the answer uses syntax that works in both, albeit inelegantly.
I tried running your code and I killed the program after 25 minutes. So yes 10 minutes it's reasonable fast.

Reading JSON from gigabytes of .txt files and add to the same list

I have 300 txt files (each between 80-100mb) that I should have to put to a list object and using the all content in the same time. I already created a working solution, but unfortunately it crashes due MemoryError when I load more than 3 txt-s. I'm not sure that it matters but I have a lot of ram so I could easily load 30GB to the memory if it can solve the problem.
Basically I would like to loop through the 300 txt files inside the same for loop. Is it possible to create a list object that holds 30GB of content? Or achieve it in any different way? I would really appreciate if somebody could explain me the ideal solution or any useful tips.
Here is how I tried, it produces the Memory Error after loading 3 txt.
def addContentToList(filenm):
with open(filenm, encoding="ISO-8859-1") as v:
jsonContentTxt.extend(json.load(v))
def createFilenameList(name):
for r in range(2,300):
file_str = "%s%s.txt" % (name, r,)
filenames.append(file_str)
filename1 = 'log_1.txt'
filename2 = 'log_'
filenames = []
jsonContentTxt = []
with open(filename, encoding="ISO-8859-1") as f:
jsonContentTxt = json.load(f)
createFilenameList(filename2)
for x in filenames:
addContentToList(x)
json_data = json.dumps(jsonContentTxt)
content_list = json.loads(json_data)
print (content_list)
Put down the chocolate-covered banana and step away from the European currency systems.
Text files are a really bad idea to store data like this. You should use a database. I recommend PostgreSQL and SQLite.
Apart from that, your error is probably due to using a 32-bit version of Python (which will cap your memory allocation to 2GB), use 64-bit instead. Even so I think you'd be better off by using a more proper tool for the job and not allocating 30GB of memory space.

Extract video from .swf using Python

I've written code that generated the links to videos such as the one below.
Once obtained, I try to download it in this manner:
import urllib.request
import os
url = 'http://www.videodetective.net/flash/players/?customerid=300120&playerid=351&publishedid=319113&playlistid=0&videokbrate=750&sub=RTO&pversion=5.2%22%20width=%22670%22%20height=%22360%22'
response = urllib.request.urlopen(url).read()
outpath = os.path.join(os.getcwd(), 'video.mp4')
videofile = open(outpath , 'wb')
videofile.write(response)
videofile.close()
All I get is a 58kB file in that directory that can't be read. Could someone point me in the right direction?
With your code, you aren't downloading the encoded video file here, but the flash application (in CWS-format) that is used to play the video. It is executed in the browser and dynamically loads and plays the video. You'd need to apply some reverse-engineering to figure out the actual video source. The following is my attempt at it:
Decompressing the SWF file
First, save the 58K file you mentioned on your hard disk under the name test.swf (or similiar).
You can then use the small Perl script cws2fws for that:
perl cws2fws test.swf
This results in a new file named test.fws.swf in the same directory
Searching for the configuration URL in the FWS file
I used a simple
strings test.fws.swf | grep http
Which yields:
...
cookieOhttp://www.videodetective.net/flash/players/flashconfiguration.aspx?customerid=
...
Interesting. Let's try to put our known customerid, playerid and publishedid arguments to this URL:
http://www.videodetective.net/flash/players/flashconfiguration.aspx?customerid=300120&playerid=351&publishedid=319113
If we open that in a browser, we can see the player configuration XML, which in turn points us to
http://www.videodetective.net/flash/players/playlist.aspx?videokbrate=450&version=4.6&customerid=300120&fmt=3&publishedid=&sub=
Now if we open that, we can finally see the source URL:
http://cdn.videodetective.net/svideo/mp4/450/6993/293732.mp4?c=300120&r=450&s=293732&d=153&sub=&ref=&fmt=4&e=20111228220329&h=03e5d78201ff0d2f7df9a
Now we can download this h264 video file and we are finished.
Automating the whole process in a Python script
This is an entirely different task (left as an exercise to the reader).

Categories

Resources