I'm currently working on a little python script to equalize MP3 file.
I've read some docs about MP3 file format (at https://en.wikipedia.org/wiki/ID3)
And i've noticed that in the ID3v2 format there is a field for Equalization (EQUA, EQU2)
Using the python librarie mutagen i've tried to extract theses information from the MP3 but the field isn't present.
What's the right way to equalize MP3 file regardless of the ID3 version ?
Thank in advance. Creekorful
There are two high-level approaches you can take: modify the encoded audio stream, or put metadata on it describing the desired change. Modifying the audio stream is the most compatible, but generally less desirable. However, ID3v1 has no place for this metadata, only ID3v2.2 and up do.
Depending on what you mean by equalize, you might want equalization information stored in the EQA/EQUA/EQU2 frames, or a replay gain volume adjustment stored in the RVA/RVAD/RVA2 frames. Mutagen supports the linked frames, so all but EQA/EQUA. If you need them, it should be straightforward to add them from the information in the actual specification (see 4.12 on http://id3.org/id3v2.4.0-frames). With tests they could likely be contributed back to the project.
Note that Quod Libet, the player paired with Mutagen, has taken a preference for reading and storing replay gain information in a TXXX frame.
Related
I have lot of PDF, DOC[X], TIFF and others files (scans from a shared folder). Each file converted into pack of text files: one text file per page.
Each pack of files could contain multiple documents (for example thee contracts). Document kind could be not only contract.
During the processing the pack of the files I don't know what kind of the documents current pack contains and it's possible that one pack contains multiple document kinds (contracts, invoices, etc).
I'm looking for some possible approaches to solve this programmatically.
I'm tried to search something like that but without any success.
UPD: I tried to create binary classificator with scikit-learn and now looking for another solution.
This at its basis, being they are "scans" sounds more like something that could be approached with computer vision, however this is currently far far above my current level of programming.
E.g. projects like SimpleCV may be a good starting point,
http://www.simplecv.org/
Or possibly you could get away with OCR reading the "scans" and working based on the contents. pytesseract seems popular for this type of task,
https://pypi.org/project/pytesseract/
However that still lacks defining how you would tell your program that this part of the image means that this is 3 separate contracts, Is there anything about these files in particular that make this clear, e.g. "1 of 3" on the pages,, a logo or otherwise? that will be the main part that determines how complex a problem you are trying to solve.
Best solution was to create binary classifier (SGDClassifier) and train it on classes first-page and not-first-page. Each item from the dataset was trimmed to 100 tokens (words)
I am trying to find databases like the LJ Speech Dataset made by Keith Ito. I need to use these datasets in TacoTron 2 (Link), so I think datasets need to be structured in a certain way. the LJ database is linked directly into the tacotron 2 github page, so I think it's safe to assume it's made to work with it. So I think Databases should have the same structure as the LJ. I downloaded the Dataset and I found out that it's structured like this:
main folder:
-wavs
-001.wav
-002.wav
-etc
-metadata.csv: This file is a csv file which contains all the things said in every .wav, in a form like this **001.wav | hello etc.**
So, my question is: Are There other datasets like this one for further training?
But I think there might be problems, for example, the voice from one dataset would be different from the one in one another, would this cause too much problems?
And also different slangs or things like that can cause problems?
There a few resources:
The main ones I would look at are Festvox (aka CMU artic) http://www.festvox.org/dbs/index.html and LibriVoc https://librivox.org/
these guys seem to be maintaining a list
https://github.com/candlewill/Speech-Corpus-Collection
And I am part of a project that is collecting more (shameless self plug): https://github.com/Idlak/Living-Audio-Dataset
Mozilla includes a database of several datasets you can download and use, if you don't need your own custom language or voice: https://voice.mozilla.org/data
Alternatively, you could create your own dataset following the structure you outlined in your OP. The metadata.csv file needs to contain at least two columns -- the first is the path/name of the WAV file (without the .wav extension), and the second column is the text that has been spoken.
Unless you are training Tacotron with speaker embedding/a multi-speaker model, you'd want all the recordings to be from the same speaker. Ideally, the audio quality should be very consistent with a minimum amount of background noise. Some background noise can be removed using RNNoise. There's a script in the Mozilla Discourse group that you can use as a reference. All the recordings files need to be short, 22050 Hz, 16-bit audio clips.
As for slag or local colloquialisms -- not sure; I suspect that as long as the word sounds match what's written (i.e. the phonemes match up), I would expect the system to be able to handle it. Tacotron is able to handle/train on multiple languages.
If you don't have the resources to produce your own recordings, you could use audio from a permissively licensed audiobook in the target language. There's a tutorial on this very topic here: https://medium.com/#klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf
The tutorial has you:
Download the audio from the audiobook.
Remove any parts that aren't useful (e.g. the introduction, foreward, etc) with Audacity.
Use Aeneas to fine-tune and then export a forced alignment between the audio and the text of the e-book, so that the audio can be exported sentence by sentence.
Create the metadata.csv file containing the map from audio to segments. (The format that the post describes seems to include extra columns that aren't really needed for training and are mainly for use by Mozilla's online database).
You can then use this dataset with systems that support LJSpeech, like Mozilla TTS.
I've got a PDF file that I'm trying to obtain specific data from.
I've been able to parse the PDF via PyPDF2 into one long string but searching for specific data is difficult because of - I assume - formatting in the original PDF.
What I am looking to do is to retrieve specific known fields and the data that immediately follows (as formatted in the PDF) and then store these in seperate variables.
The PDFs are bills and hence are all presented in the exact same way, with defined fields and images. So what I am looking to do is to extract these fields.
What would be the best way to achieve this?
I've got a PDF file that I'm trying to obtain specific data from.
In general, it is probably impossible (or extremely difficult), and details (than you don't mention) are very important. Study in details the complex PDF specification. Notice that PDF is (more or less accidentally) Turing complete (so your problem is undecidable in general, since equivalent to the halting problem).
For example, a normal human reader could read digits in the document as text, or as a JPEG image, etc. And in practice many PDF documents have such kind of data.... Practically speaking, PDF is an output-only format and is designed for screen displaying and printing, not for extracting data from it.
You need to understand how exactly that PDF file was generated (with what exact software, from what actual data). That could take a lot of time (maybe several years of full time reverse-engineering work) without help.
A much better approach is to contact the person or entity providing that PDF file and negotiate some way of accessing the actual data (or at least get detailed explanation about the generation of that particular PDF file). For example, if the PDF file is computed from some database, you'll better access that database.
Perhaps using metadata or comments in your PDF file might help in guessing how it was generated.
The source of the data might produce various kinds of PDF file. For example, my cheap scanner is able to produce PDF. But your program would have hard time in extracting some numerical data from it (because that kind of PDF is essentially wrapping a pixelated image à la JPEG) and would need to deploy image recognition techniques (i.e. OCR) to do so.
I used the python wave module and read the first frame from a .wav file and it returned this :
b'\x00\x00\x00\x00\x00\x00'
What does each byte mean and will it be the same for every frame or for just some?
I've done some research into the subject and have found that there are bytes that give information about the .wav file in front of the sound data, so does python miss out this information and skip straight to the sound data or do I have to manually separate it?
There are 2 channels and a sample width of 3 according to python.
UPDATE
I have successfully created the waveform for the wav file, it wasn't as difficult as I first thought, now to show it whilst the song is playing....
The wave module reads the header for you, which is why it can tell you how many channels there are, and what the sample width is.
Reading frames gives you direct access to the raw sample data, but because the WAV format is a bit of a mixed, confused beast it depends on the sample width and channel count how you need to interpret each frame. See this article for a good in-depth discussion on that.
Is there a good way to identify (or at least approximate) the graphics program used to obtain a particular image? For instance, I want to know if there is a certain signature that these programs embed into an image. Any suggestions?
If not, is there a reference where I can find what all meta-information can be extracted out of an image?
Certain image file formats do have meta-data. It is format dependent. Digital cameras usually write some of their information into the meta-data. EXIF is what comes to mind. Images not acquired through a digital camera may or may not have relevant meta-data, so you can't consider meta-data of any sort to be a guaranteed reliable identifier. That's about as much as I can give as an answer, alas. I'm sure someone else may have more details.