Python: Check if number at specific position is larger? - python

i have a very large file, which can not be opened by kind of texteditor or something.
And i need to Check if (1) the line starts with a specific string and (2) if a number at a specific position (col 148 (3 digits)) is smaller than a predefined number. This complete line should be printed then
so i tried the following code. but it doesnt work.
fobj = open("test2.txt")
for line in fobj:
if (line.startswith("ABS")) and (fp.seek(3, 148) < 400):
print line.rstrip()
Can anyone help me?

To compare a number with a string you need to convert it:
int(fp.seek(3, 148)) < 400
You have to check the string to contain only numbers.
But seek() is not the function you are looking for, you can use it to skip the bytes of a file to a specific point.
Look here: seek() function?
If your number is always on the same position you can use:
int(line[148:150]) < 400
Try it with regular expressions and string operations:
http://pymotw.com/2/re/

Related

Trouble with indexing a string in Python

I am trying to check the first character in each line from a separate data file. This is the loop that I am using, but for some reason I get an error that says string index out of range.
for line_no in length:
line_being_checked = linecache.getline(file_path, line_no)
print(line_being_checked[0])
From what I understand (not very in english), lenght is the number of lines you want to check in the files.
You could do something like that:
for line in open("file.txt", "r").read().splitlines():
print(line[0])
This way, you'll be sure that the lenght is correct.
For the error, it is possible that you have an empty line, so you could len(line) to check if it is the case.

hex header of file, magic numbers, python

I have a program in Python which analyses file headers and decides which file type it is. (https://github.com/LeoGSA/Browser-Cache-Grabber)
The problem is the following:
I read first 24 bytes of a file:
with open (from_folder+"/"+i, "rb") as myfile:
header=str(myfile.read(24))
then I look for pattern in it:
if y[1] in header:
shutil.move (from_folder+"/"+i,to_folder+y[2]+i+y[3])
where y = ['/video', r'\x47\x40\x00', '/video/', '.ts']
y[1] is the pattern and = r'\x47\x40\x00'
the file has it inside, as you can see from the picture below.
the program does NOT find this pattern (r'\x47\x40\x00') in the file header.
so, I tried to print header:
You see? Python sees it as 'G#' instead of '\x47\x40'
and if i search for 'G#'+r'\x00' in header - everything is ok. It finds it.
Question: What am I doing wrong? I want to look for r'\x47\x40\x00' and find it. Not for some strange 'G#'+r'\x00'.
OR
why python sees first two numbers as 'G#' and not as '\x47\x40', though the rest of header it sees in HEX? Is there a way to fix it?
with open (from_folder+"/"+i, "rb") as myfile:
header=myfile.read(24)
header = str(binascii.hexlify(header))[2:-1]
the result I get is:
And I can work with it
4740001b0000b00d0001c100000001efff3690e23dffffff
P.S. But anyway, if anybody will explain what was the problem with 2 first bytes - I would be grateful.
In Python 3 you'll get bytes from a binary read, rather than a string.
No need to convert it to a string by str.
Print will try to convert bytes to something human readable.
If you don't want that, convert your bytes to e.g. hex representations of the integer values of the bytes by:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (''.join ([hex (aByte) for aByte in aBytes]))
Output as redirected from the console:
b'\x00G#\x00\x13\x00\x00\xb0'
0x00x470x400x00x130x00x00xb0
You can't search in aBytes directly with the in operator, since aBytes isn't a string but an array of bytes.
If you want to apply a string search on '\x00\x47\x40', use:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (r'\x'.join ([''] + ['%0.2x'%aByte for aByte in aBytes]))
Which will give you:
b'\x00G#\x00\x13\x00\x00\xb0'
\x00\x47\x40\x00\x13\x00\x00\xb0
So there's a number of separate issues at play here:
print tries to print something human readable, which succeeds only for the first two chars.
You can't directly search for bytearrays in bytearrays with in, so convert them to a string containing fixed length hex representations as substrings, as shown.

Import string that looks like a list "[0448521958, +61439800915]" from JSON into Python and make it an actual list?

I am extracting a string out of a JSON document using python that is being sent by an app in development. This question is similar to some other questions, but I'm having trouble just using x = ast.literal_eval('[0448521958, +61439800915]') due to the plus sign.
I'm trying to get each phone number as a string in a python list x, but I'm just not sure how to do it. I'm getting this error:
raise ValueError('malformed string')
ValueError: malformed string
your problem is not just the +
the first number starts with 0 which is an octal number ... it only supports 0-7 ... but the number ends with 8 (and also has other numbers bigger than 8)
but it turns out your problems dont stop there
you can use regex to fix the plus
fixed_string = re.sub('\+(\d+)','\\1','[0445521757, +61439800915]')
ast.literal_eval(fixed_string)
I dont know what you can do about the octal number problem however
I think the problem is that ast.literal_eval is trying to interpret the phone numbers as numbers instead of strings. Try this:
str = '[0448521958, +61439800915]'
str.strip('[]').split(', ')
Result:
['0448521958', '+61439800915']
Technically that string isn't valid JSON. If you want to ignore the +, you could strip it out of the file or string before you evaluate it. If you want to preserve it, you'll have to enclose the value with quotes.

Reading a wave file in Python

I have created a morse code generator that converts English sentences into morse code. It also converts this text based morse code into an audio file. If the character is a dot, I append a dot.wave file to the output wave file followed by a dash.wav file if the next character is a dash.
I now want to open this wave file and read its content to figure out the order in which these dashes and dots are placed.
I have tried the following code:
waveFile = wave.open(r"C:\Users\Gaurav Keswani\Documents\Eclipse\Morse Code Converter\src\resources\sound\morse.wav", 'r')
x =waveFile.readframes(20)
print (struct.unpack("<40H", x))
This gives me the output as:
(65089, 65089, 3093, 3093, 11895, 11895, 18629, 18629, 25196, 25196,
29325, 29325, 31986, 31986, 32767, 32767, 31265, 31265, 27532, 27532,
22485, 22485, 15762, 15762, 7895, 7895, 103, 103, 57228, 57228, 49571,
49571, 42790, 42790, 37667, 37667, 34362, 34362, 32776, 32776)
I don't know what to make of this output. Can anyone help?
If you want a general solution to detecting Morse code, you are going to have to take a look at what it looks like as a waveform (tom10's link to this question should help here if you can install numpy and matplotlib; if not, you can use the stdlib's csv module to export a file that you can use in your favorite spreadsheet program); work out how you as a human can distinguish dots, dashes, and spaces; turn that into an algorithm (a series of steps that even a literal-minded moron can follow); then turn that algorithm into code. Or you may be able to find a library that's already done this for you.
But for your specific case, you only need to detect exact copies of the contents of dot.wav and dash.wav within your larger file. (At least assuming you're not using any lossy compression, which usually you aren't in .wav files.) So, this is really just a substring search.
Think about how you'd detect the strings 'dot' and 'dash' within a string like 'dash dash dash dash dash dot dash dot dot dot dot dot '. For such a simple problem, you could use a stupid brute-force algorithm, and it would be fine:
def find(haystack, needle, start):
for i in range(start, len(haystack)):
if haystack[i:i+len(needle)] == needle:
return i
return len(haystack)
def decode_morse(morse):
i = 0
while i < len(morse):
next_dot = find(morse, 'dot', i)
next_dash = find(morse, 'dash', i)
if next_dot < next_dash:
if next_dot < len(morse):
yield '.'
i = next_dot
else:
if next_dash < len(morse):
yield '-'
i = next_dash
Now, if you're searching a list of numbers instead of a string, how does this have to change? Barely at all; you can slice a list, compare two lists, etc. just like you can with strings.
The only real problem you'll run into is that you don't have the whole list in memory at once, just 20 frames at a time. What happens if a dot starts in frame 19 and ends in frame 20? If your files aren't too big, this is easy to solve: just read all the frames into memory in one giant list, then search the whole thing. But otherwise, you have to do some buffering.
For example (ignoring error handling and dealing with the end of the file properly, and dealing only with dashes for simplicity—of course you have to do both of those properly in your real code):
buf = []
while True:
while len(buf) < 2*len(dash):
buf.extend(waveFile.readFrames(20))
next_dash = find(buf, dot)
if next_dash < len(buf):
yield '.'
buf = buf[next_dash:]
else:
buf = buf[-len(dash):]
We're making sure we always have at least two dash lengths in our buffer. And we always keep the leftover after the first dot or dash (if one was found) or a full dash length (if not) in the buffer, and add the next buffer to that. That's actually overkill; think it through and think through out exactly what you need to make sure we never miss a dash that falls between two buffers. But the point is, as long as you get that right, you can't miss any dots or dashes.

Python: using int() on a string that is not an integer literal

Note: I was using the wrong source file for my data - once that was fixed, my issue was resolved. It turns out, there is no simple way to use int(..) on a string that is not an integer literal.
This is an example from the book "Machine Learning In Action", and I cannot quite figure out what is wrong. Here's some background:
from numpy import as *
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines())
returnMat = zeros((numberOfLines,3))
classLabelVector = []
fr = open(filename)
index = 0
for line in fr.readlines():
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1])) # Problem here.
index += 1
return returnMat,classLabelVector
The .txt file is as follows:
40920 8.326976 0.953952 largeDoses
14488 7.153469 1.673904 smallDoses
26052 1.441871 0.805124 didntLike
75136 13.147394 0.428964 didntLike
38344 1.669788 0.134296 didntLike
...
I am getting an error on the line classLabelVector.append(int(listFromLine[-1])) because, I believe, int(..) is trying to parse over a String (ie "largeDoses") that is a not a literal integer. Am I missing something?
I looked up the documentation for int(), but it only seems to parse numbers and integer literals:
http://docs.python.org/2/library/functions.html#int
Also, an excerpt from the book explains this section as follows:
Finally, you loop over all the lines in the file and strip off the return line character with line.strip(). Next, you split the line
into a list of elements delimited by the tab character: '\t'. You take
the first three elements and shove them into a row of your matrix, and
you use the Python feature of negative indexing to get the last item
from the list to put into classLabelVector. You have to explicitly
tell the interpreter that you’d like the integer version of the last
item in the list, or it will give you the string version. Usually,
you’d have to do this, but NumPy takes care of those details for you.
strings like "largeDoses" could not be converted to integers. In folder Ch02 of that code project, you have two data files, use the second one datingTestSet2.txt instead of loading the first
You can use ast.literal_eval and catch the exception ValueError the malformed string (by the way int('9.4') will raise an exception)

Categories

Resources