Backblaze video duration issue in the merged large video file - python

I have a video file of roughly 100 Mb I have split it into 3 parts of 35Mb, 35Mb, 30Mb each.
Steps I have done,
I have called the start_large_file and I got the fileId.
Successfully uploaded all the video parts using upload_part and provided the fileId, part_number, sha1, content length, and input_stream.
Finally called the finish_large_file API with fileId and sha1 array of all the parts. The API gave a successful response and action equals upload.
Now, when I hit the merged file URL the video duration is equal to that of part 1 but the size is equal to 100Mb.
So the issue is with the merged video duration. The video duration should be equal to the all the parts combined.

Splitting a video file with FFmpeg will result in multiple shorter videos, each with its own header containing the length of the video, amongst other metadata. When the parts are recombined, the player will look at the header at the beginning of the file for the video length. It doesn't know that there is additional content after that first part.
If you're on Linux or a Mac, you can use the split command, like this:
split -b 35M my_video.mp4 video_parts_
This will result in three output files:
video_parts_aa - 35MB
video_parts_ab - 35MB
video_parts_ac - 30MB
These are the files you should upload (in order!). When they are recombined the result will be identical to the original file.
The easiest way to do this on Windows seems to be to obtain the split command via Cygwin.

Related

How to merge wma files to mp3 (with header editing)?

I have some .wma file which I am trying to merge into a single one...
I started with python reading files in bytes and writing them into a new one, just as I tried the cmd command copy /b file1.wma + file2.wma + else.wma total.wma
all came up with the same result: my total file was as large in byte as real total of my segments, but when I try to open the file it plays the first segment both in length(time) and content -meaning that I have a 15 MB 10 second voice :-))
I tried to do that with different .wma files but each time it is the first one in length and content and total of them in size.
My assumption is that probably some were my .wma data frame (maybe in file header) there is a data about length of current file, so that after merging the file when the player attempts to play the file reads that data about time and stops after the time. or some like that.
so I need to edit those data frame or header (if even exist) in a way that matches my final output or just simply ignore that.
but I don't know whether it is right or how I can do that
.wma file sample: https://github.com/Fsunroo/PowerPointVoiceExtract (media1.wma and media2.wma for example)
note: there is no such problem with web applications that join songs (maybe they do editing header??!)
Note2: it is a part of my code witch extract voice from a power point file.
I solved the problem by using moviepy.editor
the corrected project is accessible at: https://github.com/Fsunroo/PowerPointVoiceExtract

How to Decompress a TAR file into TXT (read a CEL file) in either Python or R

I was wondering if anyone knows how to decompress TAR files in R and how to extrapolate data from large numbers of GZ files? In addition, does anyone know how to read large amounts of data (around the 100's) simultaneously while maintaining the integrity of the data files (at some point, my computer can't handle the amount of data and begins to write down scribbles)?
As a novice programmer still learning about programming. I was given an assignment to analyze and cross-reference data on similar genes found between different cell structures for a disease trait. I managed to access TXT dataset files to work and formatted it to be recognized by another program known as GSEA.
1.) I installed a software known as "WinZip" and it helped me decompress my TAR files into GZ files.
I stored these files into an newly created folder under "Downloads"
2.) I then tried to use R to access the files with this code:
>untar("file.tar", list=TRUE)
And it produced approximately 170 results (it converted TAR -> GZ files)
3.) When I tried to input one of the GZ files, it generated over a thousand lines of single alpha-numerical letters and numbers unintelligible to me.
>989 ™šBx
>990 33BŸ™šC:LÍC\005€
>991 LÍB¬
>992 B«™šBꙚB™™šB¯
>993 B¡
>994 BŸ
>995 C\003
>996 BŽ™šBð™šB¦
>997 B(
>998 LÍAòffBó
>999 LÍBñ™šBó
>1000 €
> [ reached 'max' / getOption("max.print") -- omitted 64340 rows ]
Warning messages:
>1: In read.table("GSM2458563_Control_1_0.CEL.gz") :
line 1 appears to contain embedded nulls
>2: In read.table("GSM2458563_Control_1_0.CEL.gz") :
line 2 appears to contain embedded nulls
>3: In read.table("GSM2458563_Control_1_0.CEL.gz") :
line 3 appears to contain embedded nulls
>4: In read.table("GSM2458563_Control_1_0.CEL.gz") :
line 4 appears to contain embedded nulls
>5: In read.table("GSM2458563_Control_1_0.CEL.gz") :
line 5 appears to contain embedded nulls
>6: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
What I am trying to do is simultaneously access all of these files without information overload on the computer and maintain the integrity of the data. Then, I want to access the information properly where it would resemble some sort of data table (ideally, I was wondering if conversions from TAR to TXT file would have been possible for GSEA to read and identify such data).
Does anyone know any programs compatible with window that could properly decompress and read such files or any R commands that would help me generate or convert such data files?
Backgound Research
So I've been working on it around an hour - here are the results.
The file that you are trying to open is GSM2458563_Control_1_0 is compressed inside .gz file, which contains a .CELL file, therefore it's unreadable.
Such files are published by the "National Center for Biotechnology Information".
Seen a Python 2 code to open them:
from Bio.Affy import CelFile
with open('GSM2458563_Control_1_0.CEL') as file:
c = CelFile.read(file)
I've found documentation about Bio.Affy on version 1.74 of biopython.
Yet current biopython readme says:
"...Biopython 1.76 was our final release to support Python 2.7 and Python 3.5."
Nowadays Python 2 is deprecated, not to mention that the library mentioned above has evolved and changed tremendously.
Solution
So I found another way around it, using R.
My Specs:
Operation System : Windows 64
RStudio : Version 1.3.1073
R Version : R-4.0.2 for Windows
I've pre-installed the dependencies mentioned below.
Use the GEOquery.getGEO function to fetch from NCBI GEO the file.
# Presequites
# Download and install Rtools custom from http://cran.r-project.org/bin/windows/Rtools/
# Install BiocManager
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("GEOquery")
library(GEOquery)
# Download and open the data
gse <- getGEO("GSM2458563", GSEMatrix = TRUE)
show(gse)
# ****** Data Table ******
# ID_REF VALUE
# 1 7892501 1.267832
# 2 7892502 3.254963
# 3 7892503 1.640587
# 4 7892504 7.198422
# 5 7892505 2.226013

Error in my python script produces 2 - 3 times too many jpgs (pdf2image) sometimes, but not always

I am using pdf2image to change pdfs to jpgs in about 1600 folders. I have looked around and adapted code from many SO answers, but this one section seems to be overproducing jpgs in certain folders (hard to tell which).
In one particular case, using an Adobe Acrobat tool to make pdfs creates 447 jpgs (correct amount) but my script makes 1059. I looked through and found some pdf pages are saved as jpgs multiple times and inserted into the page sequences of other pdf files.
For example:
PDF A has 1 page and creates PDFA_page_1.jpg.
PDF B has 44 pages and creates PDFB_page_1.jpg through ....page_45.jpg because PDF A shows up again as page_10.jpg. If this is confusing, let me know.
I have tried messing with the index portion of the loop (specifically, taking the +1 away, using pages instead of page, placing the naming convention as a variable rather than directly into the .save and .move functions.
I also tried using the fmt='jpg' parameter in pdf2image.py but was unable to produce the correct naming scheme because I am unsure how to iterate the page numbers without the for page in pages loop.
for pdf_file in os.listdir(pdf_dir):
if pdf_file.endswith(".pdf") and pdf_file.startswith("602024"):
#Convert function from pdf2image
pages = convert_from_path(pdf_file, 72, output_folder=final_directory)
print(pages)
pdf_file = pdf_file[:-4]
for page in pages:
#save with designated naming scheme <pdf file name> + page index
jpg_name = "%s-page_%d.jpg" % (pdf_file,pages.index(page)+1)
page.save(jpg_name, "JPEG")
#Moves jpg to the mini_jpg folder
shutil.move(jpg_name, 'mini_jpg')
#no_Converted += 1
# Delete ppm files
dir_name = final_directory
ppm_remove_list = os.listdir(dir_name)
for ppm_file in ppm_remove_list:
if ppm_file.endswith(".ppm"):
os.remove(os.path.join(dir_name, ppm_file))
There are no error messages, just 2 - 3 times as many jpgs as I expected in just SOME cases. Folders with many single-page pdfs do not experience this problem, nor do folders with a single multi-page pdf. Some folders with multiple multi-page pdfs also function correctly.
If you can create a reproducible example, feel free to open an issue on the official repository, I am not sure that I understand how that could happen: https://github.com/Belval/pdf2image
Do provide PDF examples otherwise, I can't test.
As an aside, instead of pages.index use for i, page in enumerate(pages) and page number will be i + 1.

How to only write 'x' amount of a Byte object (raw .ogg sound file) to a file in Python

I have a script to write musical notes to a directory based on the byte object returned from my HTTP request. The problem I am having is that the .ogg sound file is 5 seconds long and ideally I would like to shorten this to 0.5 seconds. Is this possible to do so by simply dropping chunks of the byte object?
I know via pysoundfile it is possible to use the frames and sample rate to calculate duration and therefore write 'x' frames. This is only possible for static rates however and the sample rate for these files are not known due to the musical notes being extracted in raw form.
Some of the code I have written is below.
for notenumbers in range(48, 64+1):
note = requests.get(url.format(instrument, notenumbers))
notebinary = note.content
time.sleep(3)
with open("E:\\useraccount\\x\\x\\"+str(dirname)+"\\"+str(instrname)+"\\"+str(instrname) +"-" +str(notenumbers) +".ogg", "wb") as o:
print("Creating file named: " +str(instrname) +":" +str(notenumbers) +".ogg")
o.write(notebinary)
Thank you if you are able to help with this!

GitHub API response with fewer files

I am new to GitHub API.
I am writing a Python program (using requests) that should list all the changed/added files of a pull request in a given repository.
Using the API I am able to list all the pull requests and get their numbers. However, when I try to get the information about the files, the response does not contain all the files in the pull request.
pf = session.get(f'https://api.github.com/repos/{r}/pulls/{pull_num}/files')
pj = pf.json()
pprint.pprint(pf.json())
for i in range(len(pj)):
print(fj[i]['filename']))
(I know there might be a prettier way, Python is not really my cup of coffee yet, but when I compare the pf.text with the output of this snippet, the result is identical.)
I know that there is a limit of 300 files as mentioned in the documentation, but the problem occurs even if their total number is less that 300.
I created a test repo with a single pull request that adds files called file1, file 2, ..., file222 and after I send the GET request, the response only contains filenames of:
file1, file10, file100, file101, file102, file103, file104, file105, file106, file107, file108, file109, file11, file110, file111, file112, file113, file114, file115, file116, file117, file118, file119, file12, file120, file121, file122, file123, file124, file125
Is there another limit that I don't know about? Or why would the response contain only those filenames? How do I get all of them?
I found a solution a while after I posted the question. The API sends a few entries (filenames) and a link to another page in the header of the response. The files from the question are the first few in the alphabetical order, the first page.

Categories

Resources