Reading a wave file in Python

Reading a wave file in Python - python

I have created a morse code generator that converts English sentences into morse code. It also converts this text based morse code into an audio file. If the character is a dot, I append a dot.wave file to the output wave file followed by a dash.wav file if the next character is a dash.
I now want to open this wave file and read its content to figure out the order in which these dashes and dots are placed.
I have tried the following code:
waveFile = wave.open(r"C:\Users\Gaurav Keswani\Documents\Eclipse\Morse Code Converter\src\resources\sound\morse.wav", 'r')
x =waveFile.readframes(20)
print (struct.unpack("<40H", x))
This gives me the output as:
(65089, 65089, 3093, 3093, 11895, 11895, 18629, 18629, 25196, 25196,
29325, 29325, 31986, 31986, 32767, 32767, 31265, 31265, 27532, 27532,
22485, 22485, 15762, 15762, 7895, 7895, 103, 103, 57228, 57228, 49571,
49571, 42790, 42790, 37667, 37667, 34362, 34362, 32776, 32776)
I don't know what to make of this output. Can anyone help?

If you want a general solution to detecting Morse code, you are going to have to take a look at what it looks like as a waveform (tom10's link to this question should help here if you can install numpy and matplotlib; if not, you can use the stdlib's csv module to export a file that you can use in your favorite spreadsheet program); work out how you as a human can distinguish dots, dashes, and spaces; turn that into an algorithm (a series of steps that even a literal-minded moron can follow); then turn that algorithm into code. Or you may be able to find a library that's already done this for you.
But for your specific case, you only need to detect exact copies of the contents of dot.wav and dash.wav within your larger file. (At least assuming you're not using any lossy compression, which usually you aren't in .wav files.) So, this is really just a substring search.
Think about how you'd detect the strings 'dot' and 'dash' within a string like 'dash dash dash dash dash dot dash dot dot dot dot dot '. For such a simple problem, you could use a stupid brute-force algorithm, and it would be fine:
def find(haystack, needle, start):
for i in range(start, len(haystack)):
if haystack[i:i+len(needle)] == needle:
return i
return len(haystack)
def decode_morse(morse):
i = 0
while i < len(morse):
next_dot = find(morse, 'dot', i)
next_dash = find(morse, 'dash', i)
if next_dot < next_dash:
if next_dot < len(morse):
yield '.'
i = next_dot
else:
if next_dash < len(morse):
yield '-'
i = next_dash
Now, if you're searching a list of numbers instead of a string, how does this have to change? Barely at all; you can slice a list, compare two lists, etc. just like you can with strings.
The only real problem you'll run into is that you don't have the whole list in memory at once, just 20 frames at a time. What happens if a dot starts in frame 19 and ends in frame 20? If your files aren't too big, this is easy to solve: just read all the frames into memory in one giant list, then search the whole thing. But otherwise, you have to do some buffering.
For example (ignoring error handling and dealing with the end of the file properly, and dealing only with dashes for simplicity—of course you have to do both of those properly in your real code):
buf = []
while True:
while len(buf) < 2*len(dash):
buf.extend(waveFile.readFrames(20))
next_dash = find(buf, dot)
if next_dash < len(buf):
yield '.'
buf = buf[next_dash:]
else:
buf = buf[-len(dash):]
We're making sure we always have at least two dash lengths in our buffer. And we always keep the leftover after the first dot or dash (if one was found) or a full dash length (if not) in the buffer, and add the next buffer to that. That's actually overkill; think it through and think through out exactly what you need to make sure we never miss a dash that falls between two buffers. But the point is, as long as you get that right, you can't miss any dots or dashes.

Related

Parsing comma delimited strings with embedded commas using Python

I'm looking for a way to optimize an algorithm that I have already developed. As the title of my question says, I am dealing with comma delimited strings that will sometimes contain any number of embedded commas. This is all being done in the context of big data so speed is important. What I have here does everything I need it to, however, I have to believe there would be a faster way of doing it. If you have any suggestions I would love to hear them. Thank you in advance.
code:
import os,re
commaProblemA=re.compile('^"[\s\w\-()/*.#!#%^\'&$\{\}|<>:0-9]+$')
commaProblemB=re.compile('^[\s\w\-()/*.#!#%^\'&$\{\}|<>:0-9]*"$')
#example string
#these are read from a file in practice
z=',,"N/A","DWIGHT\'s BEET FARM,INC.","CAMUS,ALBERT",35.00,0.00,"NIETZSCHE,FRIEDRICH","God, I hope this works, fast.",,,35.00,,,"",,,,,,,,,,,"20,4,2,3,2,33","223,2,3,,34 00:00:00:000000",,,,,,,,,,,,0,,,,,,"ERW-400",,,,,,,,,,,,,,,1,,,,,,,"BLA",,"IGE6560",,,,'
testList=z.split(',')
for i in testList:
if re.match(commaProblemA,i):
startingIndex=testList.index(i)
endingIndex=testList.index(i)
count=0
while True:
endingIndex+=1
if re.match(commaProblemB,testList[endingIndex]):
diff=endingIndex-startingIndex
while count<diff:
testList[startingIndex]=(testList[startingIndex]+","+testList[startingIndex+1])
testList.pop(startingIndex+1)
count+=1
break
print(str(lineList))
print(len(lineList))

If you really want to do this yourself instead of using a library, first some tips:
don't use split() on csv data. (also bad for performance)
for performance: don't use regEx.
The regular way to scan the data would be like this (pseudo code, assuming single line csv):
for each line
bool insideQuotes = false;
while not end of line {
if currentChar == '"'
insideQuotes = !insideQuotes; // ( ! meaning 'not')
// this also handles the case of escaped quotes inside the field
// (if escaped with an extra quote)
else if currentChar == ',' and !insideQuotes
// seperator found - handle field
}
For even better performance you could open the file in binary mode and handle the newlines yourself while scanning. This way you don't need to scan for a line, copy it in a buffer (for example with getline() or a similar function) and then scan that buffer again to extract the fields.

python writing a list to a file incorrectly

I am having an issue with writing a list to a file. I am annotating certain files to change them into a certain format, so I read sequence alignment files, store them in lists, do necessary formatting, and then write them to a new file. The problem is that while my list, containing sequence alignments is structured correctly, the output produced when it writes them to new files is incorrect (it does not replicate my list structure). I include only a section of my output and what it should look like because the list itself if far too long to post.
OUTPUT WRITTEN TO FILE:
>
TRFE_CHICK
From XALIGN
MKLILCTVLSLGIAAVCFAAP (seq spans multiple lines) ...
ADYIKAVSNLRKCS--TSRLLEAC*> (end of sequence, * should be on a newline, followed by > on a newline as well)
OUTPUT IS SUPPOSED TO BE WRITTEN AS:
>
TRFE_CHICK
From XALIGN
MKLILCTVLSLGIAAVCFAAP (seq spans many lines) ...
ADYIKAVSNLRKCS--TSRLLEAC
*
>
It does this misformatting multiple times over. I have tried pickling and unpickling the list but that misformats it further.
My code for producing the list and writing to file:
new = []
for line in alignment1:
if line.endswith('*\n'):
new.append(line.strip('*\n'))
new.append('*')
else:
new.append(line)
new1 = []
for line in new:
if line.startswith('>'):
twolines = line[0] + '\n' + line[1:]
new1.append(twolines)
continue
else:
new1.append(line)
for line in new1:
alignfile_annot.write(line)
Basically, I have coded it so that it reads the alignment file, inserts a line between the end of the sequence and the * character and also so that > followed by the ID code are always on new lines. This is the way my list is built but not the way it is written to file. Anyone know why the misformatting?
Apologies for the long text, I tried to keep it as short as possible to make my issue clear
I'm running Python 2.6.5

new.append(line.strip('*\n'))
new.append('*')
You have a list of lines (with newline terminators each), so you need to include \n for these two lines, too:
new.append(line[:-2] + "\n") # slice as you just checked line.endswith("*\n")
new.append("*\n")
Remember the strip (or slice, as I've changed it to) will remove the newline, so splitting a single item in the list with a value of "...*\n" into two items of "..." and "*" actually removes a newline from what you had originally.

image processing

This is an assignment, i have put good effort since i am new to python programming:
I am running the following function which takes in image and phrase (spaces will be removed so just text) as arguments, i have already been given all the import and preprocessing code, i just need to implement this function. I can only use getpixel, putpixel, load, and save. That is why coding this has been a hard task for me.
def InsertoImage(srcImage, phrase):
pix = srcImage.load()
for index,value in enumerate(phrase):
pix[10+index,15] = phrase[index]
srcImage.save()
pass
This code is giving "system error" which says that "new style getargs format but argument is not tuple"
Edit:
C:\Users\Nave\Desktop\a1>a1_template.py lolmini.jpg Hi
Traceback (most recent call last):
File "C:\Users\Nave\Desktop\a1\a1_template.py", line 31, in <module>
doLOLImage(srcImage, phrase)
File "C:\Users\Nave\Desktop\a1\a1_template.py", line 23, in doLOLImage
pix[10+index,15] = phrase[index]
SystemError: new style getargs format but argument is not a tuple
Edit:
Ok Thanks, i understood and now posting code but i am getting error for the if statement not sure why the if statement is not working, here is full code sorry for not adding it entirely before:
from future import division
letters, numbers, and punctation are dictionaries mapping (uppercase)
characters to Images representing that character
NOTE: There is no space character stored!
from imageproc import letters, numbers, punctuation, preProcess
This is the function to implement
def InserttoImage(srcImage, phrase):
pix = srcImage.load()
for index,value in enumerate(phrase):
if value in letters:
pix[10+index, 15] = letters[value]
elif value in numbers:
pix[10+index, 15] = numbers[value]
elif value in punctuation:
pix[10+index, 15] = punctuation[value]
srcImage.save()
pass
This code is performed when this script is called from the command line via:
'python .py'
if name == 'main':
srcImage, phrase = preProcess()
InserttoImage(srcImage, phrase)
Thanks, letter, numbers, and punctuation are dictionaries which see the key element and open the image (font).
But still there is an issue with pix[10+index, 15] as it is giving error:
pix[10+index, 15] = letters[value]
SystemError: new style getargs format but argument is not a tuple

You seem to be confusing two very different concepts. Following from the sample code you posted, let's assume that:
srcImage = A Python Image Library image, generated from lolmini.jpg.
phrase = A string, 'Hi'.
You're trying to get phrase to appear as text written on top of srcImage. Your current code shows that you plan on doing this by accessing the individual pixels of the image, and assigning a letter to them.
This doesn't work for a few reasons. The primary two are that:
You're working with single pixels. A pixel is a picture element. It only ever displays one colour at a time. You cannot represent a letter with a single pixel. The pixel is just a dot. You need multiple pixels together, to form a coherent shape that we recognize as a letter.
What does your text of Hi actually look like? When you envision it being written on top of the image, are the letters thin? Do they vary in their size? Are they thick and chunky? Italic? Do they look handwritten? These are all attributes of a font face. Currently, your program has no idea what those letters should look like. You need to give your program the name of a font, so that it knows how to draw the letters from phrase onto the image.
The Python Imaging Library comes with a module specifically for helping you draw fonts. The documentation for it is here:
The ImageFont Module
Your code shows that you have the general idea correct — loop through each letter, place it in the image, and increment the x value so that the next letter doesn't overlap it. Instead of working with the image's pixels, though, you need to load in a font and use the methods shown in the above-linked library to draw them onto the image.
If you take a look at the draw.text() function in the linked documentation, you'll see that you can in fact skip the need to loop through each letter, instead passing the entire string to be used on the image.
I could've added sample code, but as this is a homework assignment I've intentionally left any out. With the linked documentation and your existing code, you hopefully shouldn't have any troubles seeing this through to completion.
Edit:
Just read your comment to another answer, indicating that you are only allowed to use getpixel() and putpixel() for drawing onto the source image. If this is indeed the case, your workload has just increased exponentially.
My comments above stand — a single pixel will not be able to represent a letter. Assuming you're not allowed any outside source code, you will need to create data structures that contain the locations of multiple pixels, which are then all drawn in a specific location in order to represent a particular letter.
You will then need to do this for every letter you want to support.
If you could include the text of the assignment verbatim, I think it would help those here to better understand all of your constraints.

Actually, upon further reading, I think the problem is that you are trying to assign a character value to a pixel. You have to figure out some kind of way to actually draw the characters on the image (and within the images boundaries).
Also as a side note since you are using
for index,value in enumerate(phrase):
You could use value instead of phrase[index]

My suggestion to the general problem is to create an image that contains all of the characters, at known coordinates (top, bottom, left, right) and then transfer the appropriate parts of the character image into the new output image.

Just try this:
pix[10+index:15] = letters[value]
Use ":" instead of ","

how to load a big file and cut it into smaller files?

I have file about 4MB (which i called as big one)...this file has about 160000 lines..in a specific format...and i need to cut them at regular interval(not at equal intervals) i.e at the end of a certain format and write the part into another file..
Basically,what i wanted is to copy the information for the big file into the many smaller files ...as i read the big file keep writing the information into one file and after the a certain pattern occurs then end this and starting writing for that line into another file...
Normally, if it is a small file i guess it can be done dont know if i can perform file.readline() to read each line check if pattern end if not then write it to a file if patter end then change the file name open new file..so on but how to do it for this big file..
thanks in advance..
didnt mention the file format as i felt it is not neccesary will mention if required..

I would first read all of the allegedly-big file in memory as a list of lines:
with open('socalledbig.txt', 'rt') as f:
lines = f.readlines()
should take little more than 4MB -- tiny even by the standard of today's phones, much less ordinary computers.
Then, perform whatever processing you need to determine the beginning and ending of each group of lines you want to write out to a smaller files (I'm not sure by your question's text whether such groups can overlap or leave gaps, so I'm offering the most general solution where they're fully allowed to -- this will also cover more constrained use cases, with no real performance penalty, though code might be a tad simpler if the constraints were very rigid).
Say that you put these numbers in lists starts (index from 0 of first line to write, included), ends (index from 0 of first line to NOT write -- may legitimately and innocuosly be len(lines) or more), names (filenames to which you want to write), all lists having the same length of course.
Then, lastly:
assert len(starts) == len(ends) == len(names)
for s, e, n in zip(starts, ends, names):
with open(n, 'wt') as f:
f.writelines(lines[s:e])
...and that's all you need to do!
Edit: the OP seems to be confused by the concept of having these lists, so let me try to give an example: each block written out to a file starts at a line containing 'begin' (included) and ends at the first immediately succeeding line containing 'end' (also included), and the names of the files to be written are to be result0.txt, result1.txt, and so on.
It's an error if the number of "closing ends" differ from that of "opening begins" (and remember, the first immediately succeeding "end" terminates all pending "begins"); no line is allowed to contain both 'begin' and 'end'.
A very arbitrary set of conditions, to be sure, but then, the OP leaves us totally in the dark about the actual specifics of the problem, so what else can we do but guess most wildly?-)
outfile = 0
starts = []
ends = []
names = []
for i, line in enumerate(lines):
if 'begin' in line:
if 'end' in line:
raise ValueError('Both begin and end: %r' % line)
starts.append(i)
names.append('result%d.txt' % outfile)
outfile += 1
elif 'end' in line:
ends.append(i + 1) # remember ends are EXCLUDED, hence the +1
That's it -- the assert about the three lists having identical lengths will take care of checking that the constraints are respected.
As the constraints and specs are changed, so of course will this snippet of code change accordingly -- as long as it fills the three equal-length lists starts, ends, and names, exactly how it does so matters not in the least to the rest of the code.

A 4MB file is very small, it fits in memory for sure. The fastest approach would be to read it all and then iterate over each line searching for the pattern, writing out the line to the appropriate file depending on the pattern (your approach for small files.)

I'm not going to get into the actual code, but pseudo code would do this.
BIGFILE="filename"
SMALLFILE="smallfile1"
while(readline(bigfile)) {
write(SMALLFILE, line)
if(line matches pattern) {
SMALLFILE="smallfile++"
}
}
Which is really bad code, but maybe you get the point. I should also have said that it doesn't matter how big your file is since you have to read the file anyway.

Escape quotes contained within certain html tags

I've done a mysqldump of a large database, ~300MB. It has made an error though, it has not escaped any quotes contained in any <o:p>...</o:p> tags. Here's a sample:
...Text here\' escaped correctly, <o:p> But text in here isn't. </o:p> Out here all\'s well again...
Is it possible to write a script (preferably in Python, but I'll take anything!) that would be able to scan and fix these errors automatically? There's quite a lot of them and Notepad++ can't handle a file of that size very well...

If the "lines" your file is divided into are of reasonable lengths, and there are no binary sequences in it that "reading as text" would break, you can use fileinput's handy "make believe I'm rewriting a file in place" functionality:
import re
import fileinput
tagre = re.compile(r"<o:p>.*?</o:p>")
def sub(mo):
return mo.group().replace(r"'", r"\'")
for line in fileinput.input('thefilename', inplace=True):
print tagre.sub(sub, line),
If not, you'll have to simulate the "in-place rewriting" yourself, e.g. (oversimplified...):
with open('thefilename', 'rb') as inf:
with open('fixed', 'wb') as ouf:
while True:
b = inf.read(1024*1024)
if not b: break
ouf.write(tagre.sub(sub, b))
and then move 'fixed' to take place of 'thefilename' (either in code, or manually) if you need that filename to remain after the fixing.
This is oversimplified because one of the crucial <o:p> ... </o:p> parts might end up getting split between two successive megabyte "blocks" and therefore not identified (in the first example, I'm assuming each such part is always fully contained within a "line" -- if that's not the case then you should not use that code, but the following, anyway). Fixing this requires, alas, more complicated code...:
with open('thefilename', 'rb') as inf:
with open('fixed', 'wb') as ouf:
while True:
b = getblock(inf)
if not b: break
ouf.write(tagre.sub(sub, b))
with e.g.
partsofastartag = '<', '<o', '<o:', '<o:p'
def getblock(inf):
b = ''
while True:
newb = inf.read(1024 * 1024)
if not newb: return b
b += newb
if any(b.endswith(p) for p in partsofastartag):
continue
if b.count('<o:p>') != b.count('</o:p>'):
continue
return b
As you see, this is pretty delicate code, and therefore, what with it being untested, I can't know that it is correct for your problem. In particular, can there be cases of <o:p> that are NOT matched by a closing </o:p> or vice versa? If so, then a call to getblock could end up returning the whole file in quite a costly way, and even the RE matching and substitution might backfire (the latter would also occur if SOME of the single-quotes in such tags are already properly escaped, but not all).
If you have at least a GB or so, avoiding the delicate issues with block division, at least, IS feasible, since everything should fit in memory, making the code much simpler:
with open('thefilename', 'rb') as inf:
with open('fixed', 'wb') as ouf:
b = inf.read()
ouf.write(tagre.sub(sub, b))
However, the other issues mentioned above (possible unbalanced opening/closing tags, etc) might remain -- only you can study your existing defective data and see if it affords such a reasonably simple approach at fixing!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.