Why subprocess ffmpeg corrupt the file? - python

I have the following code, that reads a video and saves it in another path, the problem is that when the file is saved this is not reproducible?
import subprocess
import shlex
from io import BytesIO
file = open("a.mkv", "rb")
with open('a.mkv', 'rb') as fh:
buf = BytesIO(fh.read())
args = shlex.split('ffmpeg -i pipe: -codec copy -f rawvideo pipe:')
proc = subprocess.Popen(args, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = proc.communicate(input=buf.getbuffer())
proc.wait()
f = open("a.mp4", "wb")
f.write(out)
f.close()
I need keep the buffers so, the video has the correct size, how can I solve this?

You could use ffmpeg-python
pip install ffmpeg-python
And then:
import subprocess
import shlex
import ffmpeg
from io import BytesIO
file = open("a.mkv", "rb")
with open('a.mkv', 'rb') as fh:
buf = BytesIO(fh.read())
process = (
ffmpeg
.input('pipe:') \
.output('a.mp4') \
.overwrite_output() \
.run_async(pipe_stdin=True) \
)
process.communicate(input=buf.getbuffer())

and isn't there any way of making a pipe of stream data like a node.js, for example, start a pipe that download bytes from s3, process with ffmpeg and upload to s3, that in node.js I guess can do byte by byte instead of fill the ram, so then in python the idea is create a temp file in the backend server and write then the files
Yes, there is, but there is no prescribed mechanism in Python like in Node.js. You need to run your own threads (or asyncio coroutines), one to send data to and other to receive data from FFmpeg process. Here is a sketch of what I would do
from threading import Thread
import subprocess as sp
# let's say getting mp4 and output mkv, copying all the streams
# NOTE: you cannot pipe out mp4
args = ['-f','mp4','-','-c','copy','-f','matroska','-']
proc = sp.Popen(args,stdin=sp.PIPE,stdout=sp.PIPE)
def writer():
# get downloaded bytes data
data = ...
while True:
# get next data block
data = self._queue.get()
self._queue.task_done()
if data is None:
break
try:
nbytes = proc.stdin.write(data)
except:
# stdout stream closed/FFmpeg terminated, end the thread as well
break
if not nbytes and proc.stdin.closed: # just in case
break
def reader():
# output block size
blocksize = ... # set to something reasonable
# I use the frame byte size for rawdata in but would be
# different for receiving encoded data
while True:
try:
data = proc.stdout.read(blocksize)
except:
# stdout stream closed/FFmpeg terminated, end the thread as well
break
if not data: # done no more data
break
# upload the data
...
writer = Thread(target=writer)
reader = Thread(target=reader)
writer.start()
reader.start()
writer.join() # first wait until all the data are written
proc.stdin.close() # triggers ffmpeg to stop waiting for input and wrap up its encoding
proc.wait() # waits for ffmpeg
reader.join() # wait till all the ffmpeg outputs are processed
I tried out multiple different attempts using different approaches for my ffmpegio.streams.SimpleFilterBase class and settled on this approach.

Related

Read and write from pigz and subprocess with python3

I am trying to use the pigz function from linux to speed up file decompression and compression. I managed to open a file with pigz using subprocess.Popen() function but after different tries, I don't manage to read the stream from Popen() make modifications on some lines, and write it directly on a new file using pigz and subprocess as well. In the end, I use the gzip.open() function from the gzip library to write the new file and the process is as slow as reading and writing directly from the gzip.open() function.
Question:
On the following code is there a way to modify the data from the output of subprocess and directly write it to a compressed file using subprocess and pigz in order to speed up the whole operation?
inputFile = "file1.txt.gz"
outputFile = "file2.txt.gz"
def pigzStream2(inputFile, outputFile):
cmd = f'pigz -dkc {inputFile} ' # -dkc: decompress, k: do not delete original file, c:Write all processed output to stdout (won't delete)
if not sys.platform.startswith("win"):
cmd = shlex.split(cmd)
res = Popen(cmd, stdout=PIPE, stdin=PIPE, bufsize=1, text=True)
with res.stdout as f_in:
with gzip.open(outputFile, 'ab') as f_out:
count = 0
while True:
count += 1
line = f_in.readline()
if line.startswith('#'):
line = f"line {count} changed"
if not line:
print(count)
break
f_out.write(line.encode())
return 0```

scipy.io.wavfile.read() the stdout from FFmpeg

After searching for a long time, I still cannot find the solution to use scipy.io.wavfile.read() to read the bytes from the stdout of FFmpeg 3.3.6.
Here is the example code working perfectly. However, it needs to save a converted file to disk.
import subprocess
import scipy.io.wavfile as wavfile
command = 'ffmpeg -i in.mp3 out.wav'
subprocess.run(command)
with open('out.wav', 'rb') as wf:
rate, signal = wavfile.read(wf)
print(rate, signal)
And here is the code I try to get the FFmpeg output from stdout and load it into scipy wavfile.
import io
import subprocess
import scipy.io.wavfile as wavfile
command = 'ffmpeg -i in.mp3 -f wav -'
proc = subprocess.run(command, stdout=subprocess.PIPE)
rate, signal = wavfile.read(io.BytesIO(proc.stdout))
print(rate, signal)
Sadly, it raises a ValueError.
Traceback (most recent call last):
File ".\err.py", line 8, in <module>
rate, signal = wavfile.read(io.BytesIO(proc.stdout))
File "C:\Users\Sean Wu\AppData\Local\Programs\Python\Python36\lib\site-
packages\scipy\io\wavfile.py", line 246, in read
raise ValueError("Unexpected end of file.")
ValueError: Unexpected end of file.
Are there any methods to solve this problem?
Apparently when the output of ffmpeg is sent to stdout, the program does not fill in the RIFF chunk size of the file header. Instead, the four bytes where the chunk size should be are all 0xFF. scipy.io.wavfile.read() expects that value to be correct, so it thinks the length of the chunk is 0xFFFFFFFF bytes.
When you give ffmpeg an output file to write, it correctly fills in the RIFF chunk size, so wavfile.read() is able to read the file in that case.
A work-around for your code is to patch the RIFF chunk size manually before the data is passed to wavfile.read() via an io.BytesIO() object. Here's a modification of your script that does that. Note: I had to use command.split() for the first argument of subprocess.run(). I'm using Python 3.5.2 on Mac OS X. Also, my test file name is "mpthreetest.mp3".
import io
import subprocess
import scipy.io.wavfile as wavfile
command = 'ffmpeg -i mpthreetest.mp3 -f wav -'
proc = subprocess.run(command.split(), stdout=subprocess.PIPE)
riff_chunk_size = len(proc.stdout) - 8
# Break up the chunk size into four bytes, held in b.
q = riff_chunk_size
b = []
for i in range(4):
q, r = divmod(q, 256)
b.append(r)
# Replace bytes 4:8 in proc.stdout with the actual size of the RIFF chunk.
riff = proc.stdout[:4] + bytes(b) + proc.stdout[8:]
rate, signal = wavfile.read(io.BytesIO(riff))
print("rate:", rate)
print("len(signal):", len(signal))
print("signal min and max:", signal.min(), signal.max())

Use StringIO as stdin with Popen

I have the following shell script that I would like to write in Python (of course grep . is actually a much more complex command):
#!/bin/bash
(cat somefile 2>/dev/null || (echo 'somefile not found'; cat logfile)) \
| grep .
I tried this (which lacks an equivalent to cat logfile anyway):
#!/usr/bin/env python
import StringIO
import subprocess
try:
myfile = open('somefile')
except:
myfile = StringIO.StringIO('somefile not found')
subprocess.call(['grep', '.'], stdin = myfile)
But I get the error AttributeError: StringIO instance has no attribute 'fileno'.
I know I should use subprocess.communicate() instead of StringIO to send strings to the grep process, but I don't know how to mix both strings and files.
p = subprocess.Popen(['grep', '...'], stdin=subprocess.PIPE,
stdout=subprocess.PIPE)
output, output_err = p.communicate(myfile.read())
Don't use bare except, it may catch too much. In Python 3:
#!/usr/bin/env python3
from subprocess import check_output
try:
file = open('somefile', 'rb', 0)
except FileNotFoundError:
output = check_output(cmd, input=b'somefile not found')
else:
with file:
output = check_output(cmd, stdin=file)
It works for large files (the file is redirected at the file descriptor level -- no need to load it into the memory).
If you have a file-like object (without a real .fileno()); you could write to the pipe directly using .write() method:
#!/usr/bin/env python3
import io
from shutil import copyfileobj
from subprocess import Popen, PIPE
from threading import Thread
try:
file = open('somefile', 'rb', 0)
except FileNotFoundError:
file = io.BytesIO(b'somefile not found')
def write_input(source, sink):
with source, sink:
copyfileobj(source, sink)
cmd = ['grep', 'o']
with Popen(cmd, stdin=PIPE, stdout=PIPE) as process:
Thread(target=write_input, args=(file, process.stdin), daemon=True).start()
output = process.stdout.read()
The following answer uses shutil as well --which is quite efficient--,
but avoids a running a separate thread, which in turn never ends and goes zombie when the stdin ends (as with the answer from #jfs)
import os
import subprocess
import io
from shutil import copyfileobj
file_exist = os.path.isfile(file)
with open(file) if file_exists else io.StringIO("Some text here ...\n") as string_io:
with subprocess.Popen("cat", stdin=subprocess.PIPE, stdout=subprocess.PIPE, universal_newlines=True) as process:
copyfileobj(string_io, process.stdin)
# the subsequent code is not executed until copyfileobj ends,
# ... but the subprocess is effectively using the input.
process.stdin.close() # close or otherwise won't end
# Do some online processing to process.stdout, for example...
for line in process.stdout:
print(line) # do something
Alternatively to close and parsing, if the output is known to fit in memory:
...
stdout_text , stderr_text = process.communicate()

fifo - reading in a loop

I want to use os.mkfifo for simple communication between programs. I have a problem with reading from the fifo in a loop.
Consider this toy example, where I have a reader and a writer working with the fifo. I want to be able to run the reader in a loop to read everything that enters the fifo.
# reader.py
import os
import atexit
FIFO = 'json.fifo'
#atexit.register
def cleanup():
try:
os.unlink(FIFO)
except:
pass
def main():
os.mkfifo(FIFO)
with open(FIFO) as fifo:
# for line in fifo: # closes after single reading
# for line in fifo.readlines(): # closes after single reading
while True:
line = fifo.read() # will return empty lines (non-blocking)
print repr(line)
main()
And the writer:
# writer.py
import sys
FIFO = 'json.fifo'
def main():
with open(FIFO, 'a') as fifo:
fifo.write(sys.argv[1])
main()
If I run python reader.py and later python writer.py foo, "foo" will be printed but the fifo will be closed and the reader will exit (or spin inside the while loop). I want reader to stay in the loop, so I can execute the writer many times.
Edit
I use this snippet to handle the issue:
def read_fifo(filename):
while True:
with open(filename) as fifo:
yield fifo.read()
but maybe there is some neater way to handle it, instead of repetitively opening the file...
Related
Getting readline to block on a FIFO
You do not need to reopen the file repeatedly. You can use select to block until data is available.
with open(FIFO_PATH) as fifo:
while True:
select.select([fifo],[],[fifo])
data = fifo.read()
do_work(data)
In this example you won't read EOF.
A FIFO works (on the reader side) exactly this way: it can be read from, until all writers are gone. Then it signals EOF to the reader.
If you want the reader to continue reading, you'll have to open again and read from there. So your snippet is exactly the way to go.
If you have mutliple writers, you'll have to ensure that each data portion written by them is smaller than PIPE_BUF on order not to mix up the messages.
The following methods on the standard library's pathlib.Path class are helpful here:
Path.is_fifo()
Path.read_text/Path.read_bytes
Path.write_text/Path.write_bytes
Here is a demo:
# reader.py
import os
from pathlib import Path
fifo_path = Path("fifo")
os.mkfifo(fifo_path)
while True:
print(fifo_path.read_text()) # blocks until data becomes available
# writer.py
import sys
from pathlib import Path
fifo_path = Path("fifo")
assert fifo_path.is_fifo()
fifo_path.write_text(sys.argv[1])

How to test a directory of files for gzip and uncompress gzipped files in Python using zcat?

I'm in my 2nd week of Python and I'm stuck on a directory of zipped/unzipped logfiles, which I need to parse and process.
Currently I'm doing this:
import os
import sys
import operator
import zipfile
import zlib
import gzip
import subprocess
if sys.version.startswith("3."):
import io
io_method = io.BytesIO
else:
import cStringIO
io_method = cStringIO.StringIO
for f in glob.glob('logs/*'):
file = open(f,'rb')
new_file_name = f + "_unzipped"
last_pos = file.tell()
# test for gzip
if (file.read(2) == b'\x1f\x8b'):
file.seek(last_pos)
#unzip to new file
out = open( new_file_name, "wb" )
process = subprocess.Popen(["zcat", f], stdout = subprocess.PIPE, stderr=subprocess.STDOUT)
while True:
if process.poll() != None:
break;
output = io_method(process.communicate()[0])
exitCode = process.returncode
if (exitCode == 0):
print "done"
out.write( output )
out.close()
else:
raise ProcessException(command, exitCode, output)
which I've "stitched" together using these SO answers (here) and blogposts (here)
However, it does not seem to work, because my test file is 2.5GB and the script has been chewing on it for 10+mins plus I'm not really sure if what I'm doing is correct anyway.
Question:
If I don't want to use GZIP module and need to de-compress chunk-by-chunk (actual files are >10GB), how do I uncompress and save to file using zcat and subprocess in Python?
Thanks!
This should read the first line of every file in the logs subdirectory, unzipping as required:
#!/usr/bin/env python
import glob
import gzip
import subprocess
for f in glob.glob('logs/*'):
if f.endswith('.gz'):
# Open a compressed file. Here is the easy way:
# file = gzip.open(f, 'rb')
# Or, here is the hard way:
proc = subprocess.Popen(['zcat', f], stdout=subprocess.PIPE)
file = proc.stdout
else:
# Otherwise, it must be a regular file
file = open(f, 'rb')
# Process file, for example:
print f, file.readline()

Categories

Resources