Prevent datetime.strptime from exit in case of format mismatch

Prevent datetime.strptime from exit in case of format mismatch - python

I am parsing dates from a measurement file (about 200k lines). The format is a date and a measurement. The date format is "2013-08-07-20-46" or in time format "%Y-%m-%d-%H-%M". Ever so often the time stamp has a bad character. (The data came from a serial link which had interruptions). The entry would look like : 201-08-11-05-15 .
My parsing line to convert the time string into seconds is :
time.mktime(datetime.datetime.strptime(dt, "%Y-%m-%d-%H-%M").timetuple())
I got it online and don't fully understand how it work. (But it works)
My problem is to prevent the program from throwing error exit when a format mismatch happens. Is there a way to prevent the strptime to no exit but gracefully return an error flag in which case I would simple discard the data line and move on to the next. Yes, I can perform a pattern check with regexp but I was wondering if some smart mismatch handling is already built into strptime.
Append # Anand S Kumar
It worked for a few bad lines but then it failed.
fp = open('bmp085.dat', 'r')
for line in fp:
[dt,t,p]= string.split(line)
try:
sec= time.mktime(datetime.datetime.strptime(dt, "%Y-%m-%d-%H-%M").timetuple()) - sec0
except ValueError:
print 'Bad data : ' + line
continue #If you are doing this in a loop (looping over the lines) so that it moves onto next iteration
print sec, p ,t
t_list.append(sec)
p_list.append(p)
fp.close()
Output:
288240.0 1014.48 24.2
288540.0 1014.57 24.2
288840.0 1014.46 24.2
Bad data : �013-08-11-05-05 24.2! 1014.49
Bad data : 2013=0▒-11-05-10 �24.2 1014.57
Bad data : 201�-08-11-05-15 24.1 1014.57
Bad data : "0�#-08-1!-p5-22 24.1 1014.6
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
ValueError: too many values to unpack
>>>
Append # Anand S Kumar
It crashed again.
for line in fp:
print line
dt,t,p = line.split(' ',2)
try:
sec= time.mktime(datetime.datetime.strptime(dt, "%Y-%m-%d-%H-%M").timetuple()) - sec0
except ValueError:
print 'Bad data : ' + line
continue #If you are doing this in a loop (looping over the lines) so that it moves onto next iteration
print sec, p ,t
Failed :
2013-08-11�06-t5 03/9 9014.y
Bad data : 2013-08-11�06-t5 03/9 9014.y
2013-08-11-06-50 (23. 1014.96
295440.0 (23. 1014.96
2013-08-11-06%55 23.9 !�1015.01
Traceback (most recent call last):
File "<stdin>", line 5, in <module>
TypeError: must be string without null bytes, not str
>>> fp.close()
>>>

You can use try..except catching any ValueError and if any such value error occurs, move onto the next line. Example -
try:
time.mktime(datetime.datetime.strptime(dt, "%Y-%m-%d-%H-%M").timetuple())
except ValueError:
continue #If you are doing this in a loop (looping over the lines) so that it moves onto next iteration
If you are doing something else (maybe like a function call for each line , then return None or so in the except block)
The second ValueError you are getting should be occuring in line -
[dt,t,p]= string.split(line)
This issue is occur because there maybe a particular line that is resulting in more than 3 elements. One thing you can do for this would be to use the maxspplit argument from str.split() to split maximum 3 times. Example -
dt,t,p = line.split(None,2)
Or if you really want to use string.split() -
[dt,t,p]= string.split(line,None,2)
Or if you are not expecting space inside any of the fields, you can include the line causing the ValueError inside the try..except block and treat it as a bad line.

Use try - except in a for-loop:
for dt in data:
try:
print time.mktime(datetime.datetime.strptime(dt, "%Y-%m-%d-%H-%M").timetuple())
except ValueError:
print "Wrong format!"
continue
Output for data = ["1998-05-14-15-45","11998-05-14-15-45","2002-05-14-15-45"]:
895153500.0
Wrong format!
1021383900.0

Related

Placing variable in single quotes

I am receiving an integer error when reading from my CSV sheet. Its giving me problems reading the last column. I know theres characters in the last column but how do I define digit as a character. The API function psspy.two_winding_chg_4 requires an input using single quotes ' ' as shown below in that function(3rd element of the array)
Traceback (most recent call last):
File "C:\Users\RoszkowskiM\Desktop\win4.py", line 133, in <module>
psspy.two_winding_chng_4(from_,to,'%s'%digit,[_i,_i,_i,_i,_i,_i,_i,_i,_i,_i,_i,_i,_i,_i,_i],[_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f, max_value, min_value,_f,_f,_f],[])
File ".\psspy.py", line 25578, in two_winding_chng_4
TypeError: an integer is required
ValueError: invalid literal for int() with base 10: 'T1'
The code:
for row in data:
data_location, year_link, from_, to, min_value,max_value,name2,tla_2,digit = row[5:14]
output = 'From Bus #: {}\tTo Bus #: {}\tVMAX: {} pu\tVMIN: {} pu\t'
if year_link == year and data_location == location and tla_2==location:
from_=int(from_)
to=int(to)
min_value=float(min_value)
max_value=float(max_value)
digit=int(digit)
print(output.format(from_, to, max_value, min_value))
_i=psspy.getdefaultint()
_f=psspy.getdefaultreal()
psspy.two_winding_chng_4(from_,to,'%s'%digit,[_i,_i,_i,_i,_i,_i,_i,_i,_i,_i,_i,_i,_i,_i,_i],[_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f,_f, max_value, min_value,_f,_f,_f],[])

The easiest and probable most usable option would be to used your own function to filter on only digits. Example:
def return_digits(string):
return int(''.join([x for x in string if x.isdigit()]))

my_string[0] throws string index out of range in Python

I am extracting api calls from a log as strings, storing them in a file then in another process reading them from this file.
I want to catch the lines that exist, which length is above 5 and starting with a double quote.
if (input_string):
if (len(input_string)>5):
if (input_string[0] == '"'):
Throws the following exception:
File "/home/api_calls_simulator.py", line 58, in doWork
if (input_string[0] == '"'):
IndexError: string index out of range
input_string comes from:
call_file = open("./myfile.txt")
for input_string in call_file:
example of matching line:
"/api/v2/product/id/2088?user_name=website&key=secretkey&format=json" 200 2790 "-" hit
What am i missing here ?

Why am I getting an IndexError in Python 3 when indexing a string and not slicing?

I'm new to programming, and experimenting with Python 3. I've found a few topics which deal with IndexError but none that seem to help with this specific circumstance.
I've written a function which opens a text file, reads it one line at a time, and slices the line up into individual strings which are each appended to a particular list (one list per 'column' in the record line). Most of the slices are multiple characters [x:y] but some are single characters [x].
I'm getting an IndexError: string index out of range message, when as far as I can tell, it isn't. This is the function:
def read_recipe_file():
recipe_id = []
recipe_book = []
recipe_name = []
recipe_page = []
ingred_1 = []
ingred_1_qty = []
ingred_2 = []
ingred_2_qty = []
ingred_3 = []
ingred_3_qty = []
f = open('recipe-file.txt', 'r') # open the file
for line in f:
# slice out each component of the record line and store it in the appropriate list
recipe_id.append(line[0:3])
recipe_name.append(line[3:23])
recipe_book.append(line[23:43])
recipe_page.append(line[43:46])
ingred_1.append(line[46])
ingred_1_qty.append(line[47:50])
ingred_2.append(line[50])
ingred_2_qty.append(line[51:54])
ingred_3.append(line[54])
ingred_3_qty.append(line[55:])
f.close()
return recipe_id, recipe_name, recipe_book, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, ingred_3, \
ingred_3_qty
This is the traceback:
Traceback (most recent call last):
File "recipe-test.py", line 84, in <module>
recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, ingred_3, ingred_3_qty = read_recipe_file()
File "recipe-test.py", line 27, in read_recipe_file
ingred_1.append(line[46])
The code which calls the function in question is:
print('To show list of recipes: 1')
print('To add a recipe: 2')
user_choice = input()
recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, \
ingred_3, ingred_3_qty = read_recipe_file()
if int(user_choice) == 1:
print_recipe_table(recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty,
ingred_2, ingred_2_qty, ingred_3, ingred_3_qty)
elif int(user_choice) == 2:
#code to add recipe
The failing line is this:
ingred_1.append(line[46])
There are more than 46 characters in each line of the text file I am trying to read, so I don't understand why I'm getting an out of bounds error (a sample line is below). If I change to the code to this:
ingred_1.append(line[46:])
to read a slice, rather than a specific character, the line executes correctly, and the program fails on this line instead:
ingred_2.append(line[50])
This leads me to think it is somehow related to appending a single character from the string, rather than a slice of multiple characters.
Here is a sample line from the text file I am reading:
001Cheese on Toast Meals For Two 012120038005002
I should probably add that I'm well aware this isn't great code overall - there are lots of ways I could generally improve the program, but as far as I can tell the code should actually work.

This will happen if some of the lines in the file are empty or at least short. A stray newline at the end of the file is a common cause, since that comes up as an extra blank line. The best way to debug a case like this is to catch the exception, and investigate the particular line that fails (which almost certainly won't be the sample line you reproduced):
try:
ingred_1.append(line[46])
except IndexError:
print(line)
print(len(line))
Catching this exception is also usually the right way to deal with the error: you've detected a pathological case, and now you can consider what to do. You might for example:
continue, which will silently skip processing that line,
Log something and then continue
Bail out by raising a new, more topical exception: eg raise ValueError("Line too short").
Printing something relevant, with or without continuing, is almost always a good idea if this represents a problem with the input file that warrants fixing. Continuing silently is a good option if it is something relatively trivial, that you know can't cause flow-on errors in the rest of your processing. You may want to differentiate between the "too short" and "completely empty" cases by detecting the "completely empty" case early such as by doing this at the top of your loop:
if not line:
# Skip blank lines
continue
And handling the error for the other case appropriately.
The reason changing it to a slice works is because string slices never fail. If both indexes in the slice are outside the string (in the same direction), you will get an empty string - eg:
>>> 'abc'[4]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>> 'abc'[4:]
''
>>> 'abc'[4:7]
''

Your code fails on line[46] because line contains fewer than 47 characters. The slice operation line[46:] still works because an out-of-range string slice returns an empty string.
You can verify that the line is too short by replacing
ingred_1.append(line[46])
with
try:
ingred_1.append(line[46])
except IndexError:
print('line = "%s", length = %d' % (line, len(line)))

wordcount: reducer python program throws ValueError

I get this error whenever I try running Reducer python program in Hadoop system. The Mapper program is perfectly running though. Have given the same permissions as my Mapper program. Is there a syntax error?
Traceback (most recent call last):
File "reducer.py", line 13, in
word, count = line.split('\t', 1)
ValueError: need more than 1 value to unpack
#!/usr/bin/env python
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
continue
try:
word2count[word] = word2count[word]+count
except:
word2count[word] = count
# write the tuples to stdout
# Note: they are unsorted
for word in word2count.keys():
print '%s\t%s'% ( word, word2count[word] )

The error ValueError: need more than 1 value to unpack is thrown when you do a multi-assign with too few values on the right hand side. So it looks like line has no \t in it, so line.split('\t',1) results in a single value, causing something like word, count = ("foo",).

I cannot answer in detail.
However, I solved the same issue I had when I removed some extra print I had added in the mapper. Probably it is related with how print works for sys.stdin.
I know probably you have already solved the issue now

I changed line.split('\t', 1) to line.split(' ', 1) and it worked.
It seems that the space is not clear, to be perfectly clear: It should be line.split('(one space here)', 1).

Why is ElementTree raising a ParseError?

I have been trying to parse a file with xml.etree.ElementTree:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError
def analyze(xml):
it = ET.iterparse(file(xml))
count = 0
last = None
try:
for (ev, el) in it:
count += 1
last = el
except ParseError:
print("catastrophic failure")
print("last successful: {0}".format(last))
print('count: {0}'.format(count))
This is of course a simplified version of my code, but this is enough to break my program. I get this error with some files if I remove the try-catch block:
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
from yparse import analyze; analyze('file.xml')
File "C:\Python27\yparse.py", line 10, in analyze
for (ev, el) in it:
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1258, in next
self._parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
raise err
ParseError: reference to invalid character number: line 1, column 52459
The results are deterministic though, if a file works it will always work. If a file fails, it always fails and always fails at the same point.
The strangest thing is I'm using the trace to find out if I have any malformed XML that's breaking the parser. I then isolate the node that caused the failure. But when I create an XML file containing that node and a few of its neighbors, the parsing works!
This doesn't seem to be a size problem either. I have managed to parse much larger files with no problems.
Any ideas?

Here are some ideas:
(0) Explain "a file" and "occasionally": do you really mean it works sometimes and fails sometimes with the same file?
Do the following for each failing file:
(1) Find out what is in the file at the point that it is complaining about:
text = open("the_file.xml", "rb").read()
err_col = 52459
print repr(text[err_col-50:err_col+100]) # should include the error text
print repr(text[:50]) # show the XML declaration
(2) Throw your file at a web-based XML validation service e.g. http://www.validome.org/xml/ or http://validator.aborla.net/
and edit your question to display your findings.
Update: Here is the minimal xml file that illustrates your problem:
[badcharref.xml]
<a></a>
[Python 2.7.1 output]
>>> import xml.etree.ElementTree as ET
>>> it = ET.iterparse(file("badcharref.xml"))
>>> for ev, el in it:
... print el.tag
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python27\lib\xml\etree\ElementTree.py", line 1258, in next
self._parser.feed(data)
File "C:\python27\lib\xml\etree\ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "C:\python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 3
>>>
Not all valid Unicode characters are valid in XML. See the XML 1.0 Specification.
You may wish to examine your files using regexes like r'&#([0-9]+);' and r'&#x([0-9A-Fa-f]+);', convert the matched text to an int ordinal and check against the valid list from the spec i.e. #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
... or maybe the numeric character reference is syntactically invalid e.g. not terminated by a ;', &#not-a-digit etc etc
Update 2 I was wrong, the number in the ElementTree error message is counting Unicode code points, not bytes. See the code below and snippets from the output from running it over the two bad files.
# coding: ascii
# Find numeric character references that refer to Unicode code points
# that are not valid in XML.
# Get byte offsets for seeking etc in undecoded file bytestreams.
# Get unicode offsets for checking against ElementTree error message,
# **IF** your input file is small enough.
BYTE_OFFSETS = True
import sys, re, codecs
fname = sys.argv[1]
print fname
if BYTE_OFFSETS:
text = open(fname, "rb").read()
else:
# Assumes file is encoded in UTF-8.
text = codecs.open(fname, "rb", "utf8").read()
rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")
endpos = len(text)
pos = 0
while pos < endpos:
m = rx.search(text, pos)
if not m: break
mstart, mend = m.span()
target = m.group(1)
if target:
num = int(target)
else:
num = int(m.group(2), 16)
# #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
print mstart, m.group()
pos = mend
Output:
comments.xml
6615405 
10205764 
10213901 
10213936 
10214123 
13292514 
...
155656543 
155656564 
157344876 
157722583 
posts.xml
7607143 
12982273 
12982282 
12982292 
12982302 
12982310 
16085949 
16085955 
...
36303479 
36303494  <<=== whoops
38942863 
...
785292911 
801282472 
848911592

As #John Machin suggested, the files in question do have dubious numeric entities in them, though the error messages seem to be pointing at the wrong place in the text. Perhaps the streaming nature and buffering are making it difficult to report accurate positions.
In fact, all of these entities appear in the text:
set(['', '', '', '', '', '', '
', '', '', '', '', '', '', '', '
', '', '', ' ', '', '', '', '', ''])
Most are not allowed. Looks like this parser is quite strict, you'll need to find another that is not so strict, or pre-process the XML.

I'm not sure if this answers your question, but if you want to use an exception with the ParseError raised by element tree, you would do this:
except ET.ParseError:
print("catastrophic failure")
print("last successful: {0}".format(last))
Source: http://effbot.org/zone/elementtree-13-intro.htm

I felt it might also be important to note here that you could rather easily catch your error and avoid having to completely stop your program by simply using what you're already using later on in the function, placing your statement:
it = ET.iterparse(file(xml))
inside a try & except bracket:
try:
it = ET.iterparse(file(xml))
except:
print('iterparse error')
Of course, this will not fix your XML file or pre-processing technique, but could help in identifying which file (if you're parsing lots) is causing your error.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Prevent datetime.strptime from exit in case of format mismatch - python

Use try - except in a for-loop: for dt in data: try: print time.mktime(datetime.datetime.strptime(dt, "%Y-%m-%d-%H-%M").timetuple()) except ValueError: print "Wrong format!" continue Output for data = ["1998-05-14-15-45","11998-05-14-15-45","2002-05-14-15-45"]: 895153500.0 Wrong format! 1021383900.0

Related

Placing variable in single quotes

my_string[0] throws string index out of range in Python

Why am I getting an IndexError in Python 3 when indexing a string and not slicing?

wordcount: reducer python program throws ValueError

Why is ElementTree raising a ParseError?

Categories

Resources