Python - Output creating blank spaces - python

I'm encountering a very strange issue in a python script I've written. This is the piece of code that is producing abnormal results:
EDIT: I've included the entire loop in the code segment now.
data = open(datafile,'r')
outERROR = open(outERRORfile,'w')
precision=[0]
scale=[0]
lines = data.readlines()
limit = 0
if filetype == 'd':
for line in lines:
limit += 1
if limit > checklimit:
break
columns = line.split(fieldDelimiter)
for i in range(len(columns) - len(precision)):
precision.append(0)
for i in range(len(columns) - len(scale)):
scale.append(0)
if len(datatype) != len(precision):
sys.exit() #Exits the script if the number of data types (fields found in the DDL file) doesn't match the number of columns found in the data file
i = -1
for eachcolumn in columns:
i += 1
if len(rstrip(columns[i])) > precision[i]:
precision[i] = len(rstrip(columns[i]))
if columns[i].find('.') != -1 and (len(rstrip(columns[i])) - rstrip(columns[i]).find('.')) > scale[i]:
scale[i] = len(rstrip(columns[i])) - rstrip(columns[i]).find('.') -1
if datatype[i][0:7] == 'integer':
if int(columns[i]) < -2147483648 or int(columns[i]) > 2147483647:
outERROR.write("Integer value too high or too low to fit inside Integer data type, column: " + str(i + 1) + ", value: " + columns[i] + "\n")
if datatype[i][0:9] == 'smallint':
if int(columns[i]) < -32768 or int(columns[i]) > 32767:
outERROR.write("Smallint value too high or too low to fit inside Smallint data type, column: " + str(i + 1) + ", value: " + columns[i] + "\n")
if datatype[i][0:7] == 'byteint':
if int(columns[i]) < -128 or int(columns[i]) > 127:
outERROR.write("Byteint value too high or too low to fit inside Byteint data type, column: " + str(i + 1) + ", value: " + columns[i] + "\n")
if datatype[i][0:4] == 'date':
if DateParse(columns[i],format1[i]) > -1:
pass
elif DateParse(columns[i],format2[i]) > -1:
pass
elif DateParse(columns[i],format3[i]) > -1:
pass
else:
outERROR.write('Date format error, column: ' + str(i + 1) + ', value: ' + columns[i])
if datatype[i][0:9] == 'timestamp':
if DateParse(columns[i],timestamp1[i]) > -1:
pass
elif DateParse(columns[i],timestamp2[i]) > -1:
pass
elif DateParse(columns[i],timestamp3[i]) > -1:
pass
else:
outERROR.write('Timestamp format error, column: ' + str(i + 1) + ', value: ' + columns[i] + '\n')
if (datatype[i][0:7] == 'decimal'
or datatype[i][0:7] == 'integer'
or datatype[i][0:7] == 'byteint'
or datatype[i][0:5] == 'float'
or datatype[i][0:8] == 'smallint'):
try:
y = float(columns[i])
except ValueError:
outERROR.write('Character found in numeric data type, column: ' + str(i + 1) + ', value: ' + columns[i] + "\n")
else:
pass
This is part of a loop that reads a data file, basically its checking the type of the data (to determine if its supposed to be a numeric type) and 'trying' to turn it into a float to see if its actually numeric data in the data file. If its not numeric data it outputs the error you see above to a text file (defined currently as outERROR). Now when I wrote this and tested it on a small data file (4 lines) it worked fine, but when I run this on a larger file (several thousand rows) my error file is suddenly filling with a bunch of blank spaces, and only a few of the error messages are being created.
Here is what the error file looks like when I run the script with 4 rows:
Character found in numeric data type, column: 6, value: 24710a35
Character found in numeric data type, column: 7, value: 0a04
Character found in numeric data type, column: 8, value: 0a02
Character found in numeric data type, column: 6, value: 56688a12
Character found in numeric data type, column: 7, value: 0a09
Character found in numeric data type, column: 8, value: 0a06
Character found in numeric data type, column: 6, value: 12301a04
Character found in numeric data type, column: 7, value: 0a10
Character found in numeric data type, column: 8, value: 0a02
Character found in numeric data type, column: 6, value: 25816a56
Character found in numeric data type, column: 7, value: 0a09
Character found in numeric data type, column: 8, value: 0a06
This is the expected output.
When I run it on larger files, I start to get blank spaces at the top of the error file, and only the last 40-50 or so error writes actually get output as text in the file. The larger the file, the more blank spaces it outputs. I'm completely lost on this, I've read some of the other questions regarding mysterious blank lines and spaces on stackoverflow.com here but they dont seem to address my issue.
EDIT: outERROR is the name I've given to the error file that the output is writing to. It is a simple .txt file.
This is a sample of the data file:
257|1463|64|1|7|9551a22|0a05|0a02|N|O|1998-06-18|1998-05-15|1998-06-27|COLLECT COD|FOB|ackages sleep bold realmsa f|
258|1062|68|1|8|7704a48|0a00|0a07|R|F|1994-01-20|1994-03-21|1994-02-09|NONE|REG AIR|ully about the fluffily silent dependencies|
258|1962|95|2|40|74558a40|0a10|0a01|A|F|1994-03-13|1994-02-23|1994-04-05|DELIVER IN PERSON|FOB|silent frets nod daringly busy, bold|
258|1618|19|3|45|68382a45|0a07|0a07|R|F|1994-03-04|1994-02-13|1994-03-30|DELIVER IN PERSON|TRUCK|regular excuses-- fluffily ruthl|
Specifically the columns that are causing output to the error file are:
|9551a22|0a05|0a02|
|7704a48|0a00|0a07|
|74558a40|0a10|0a01|
|68382a45|0a07|0a07|
So each line should cause 3 writes to the error file, specifying these values. It works fine for a small number of lines, but when it reads a large number of lines I start getting these mysterious blank spaces. This problem occurs only when I have numeric fields that contain characters.

At a guess, perhaps you have control characters in the input stream that cause some unexpected behaviour. No sure exactly what outERROR is in the above context, but you can imagine, for example, that a form feed character in the input could have this sort of effect.
Try cleaning the data of non-printable characters first and see if that helps.

Call open with 'rb' and 'wb' to ensure binary mode, otherwise the data can be altered by the system trying to mess with line endings

Related

Summing a column in csv using Python

I work with large csv files and wanted to test if we can sum a numeric
column using Python. I generated a random data set:
id,first_name,last_name,email,gender,money
1,Clifford,Casterou,ccasterou0#dropbox.com,Male,53
2,Ethyl,Millichap,emillichap1#miitbeian.gov.cn,Female,58
3,Jessy,Stert,jstert2#gnu.org,Female,
4,Doy,Beviss,dbeviss3#dedecms.com,Male,80
5,Josee,Rust,jrust4#epa.gov,Female,13
6,Hedvige,Ahlf,hahlf5#vkontakte.ru,Female,67
On line 3 you will notice that value is missing(i removed that data on
purpose to test.)
I wrote the code :
import csv
with open("mock_7.txt","r+",encoding='utf8') as fin:
headerline = fin.readline()
amount = 0
debit = 0
value = 0
for row in csv.reader(fin):
# var = row.rstrip()
value =row[5].replace('',0)
value= float(value)
debit+=value
print (debit)
I got the error :
Traceback (most recent call last):
File "sum_csv1_v2.py", line 11, in <module>
value+= float(value)
TypeError: must be str, not float
As i am new to Python, my plan was to convert the empty cells with zero but I think i am missing something here. Also my script is based on comma separated files but i'm sure it wont work for other delimited files. Can you help me improve this code?
The original exception, now lost in the edit history,
TypeError: replace() argument 2 must be str, not int
is the result of str.replace() expecting string arguments, but you're passing an integer zero. Instead of replace you could simply check for empty string before conversion:
value = row[5]
value = float(value) if value else 0.0
Another option is to catch the potential ValueError:
try:
value = float(row[5])
except ValueError:
value = 0.0
This might hide the fact that the column contains "invalid" values other than just missing values.
Note that had you passed string arguments the end result would probably not have been what you expected:
In [2]: '123'.replace('', '0')
Out[2]: '0102030'
In [3]: float(_)
Out[3]: 102030.0
As you can see an empty string as the "needle" ends up replacing around each and every character in the string.
The latest exception in the question, after fixing the other errors, is the result of the float(value) conversion working and
value += float(value)
being equal to:
value = value + float(value)
and as the exception states, strings and floats don't mix.
Problem with your code is you're calling replace() without checking if its row[5] is empty or not.
Fixed code:
import csv
with open("mock_7.txt","r+",encoding='utf8') as fin:
headerline = fin.readline()
amount = 0
debit = 0
value = 0
for row in csv.reader(fin):
# var = row.rstrip()
if row[5].strip() == '':
row[5] = 0
value = float(row[5])
value += float(value)
debit += value
print (debit)
output:
542.0

The String is Not Read Fully

I wrote a programme to generate a string of number, consisting of 0,1,2,and 3 with the length s and write the output in decode.txt file. Below is the code :
import numpy as np
n_one =int(input('Insert the amount of 1: '))
n_two =int(input('Insert the amount of 2: '))
n_three = int(input('Insert the amount of 3: '))
l = n_one+n_two+n_three
n_zero = l+1
s = (2*(n_zero))-1
data = [0]*n_zero + [1]*n_one + [2]*n_two + [3]*n_three
print ("Data string length is %d" % len(data))
while data[0] == 0 and data[s-1]!=0:
np.random.shuffle(data)
datastring = ''.join(map(str, data))
datastring = str(int(datastring))
files = open('decode.txt', 'w')
files.write(datastring)
files.close()
print("Data string is : %s " % datastring)
The problem occur when I try to read the file from another program, the program don't call the last value of the string.
For example, if the string generated is 30112030000 , the other program will only call 3011203000, means the last 0 is not called.
But if I key in 30112030000 directly to the .txt file, all value is read. I can't figure out where is wrong in my code.
Thank you
Some programs might not like the fact that the file doesn't end with a newline. Try adding files.write('\n') before you close it.

Python - splitting lines in txt file by semicolon in order to extract a text title...except sometimes the title has semicolons in it

So, I have an extremely inefficient way to do this that works, which I'll show, as it will help illustrate the problem more clearly. I'm an absolute beginner in python and this is definitely not "the python way" nor "remotely sane."
I have a .txt file where each line contains information about a large number of .csv files, following format:
File; Title; Units; Frequency; Seasonal Adjustment; Last Updated
(first entry:)
0\00XALCATM086NEST.csv;Harmonized Index of Consumer Prices: Overall Index Excluding Alcohol and Tobacco for Austria©; Index 2005=100; M; NSA; 2015-08-24
and so on, repeats like this for a while. For anyone interested, this is the St.Louis Fed (FRED) data.
I want to rename each file (currently named the alphanumeric code # the start, 00XA etc), to the text name. So, just split by semicolon, right? Except, sometimes, the text title has semicolons within it (and I want all of the text).
So I did:
data_file_data_directory = 'C:\*****\Downloads\FRED2_csv_3\FRED2_csv_2'
rename_data_file_name = 'README_SERIES_ID_SORT.txt'
rename_data_file = open(data_file_data_directory + '\\' + rename_data_file_name)
for line in rename_data_file.readlines():
data = line.split(';')
if len(data) > 2 and data[0].rstrip().lstrip() != 'File':
original_file_name = data[0]
These last 2 lines deal with the fact that there is some introductory text that we want to skip, and we don't want to rename based on the legend # the top (!= 'File'). It saves the 00XAL__.csv as the oldname. It may be possible to make this more elegant (I would appreciate the tips), but it's the next part (the new, text name) that gets really ugly.
if len(data) ==6:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
if len(data) ==7:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[2][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
if len(data) ==8:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[2].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[3][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
if len(data) ==9:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[2].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[3].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[4][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
if len(data) ==10:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[2].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[3].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[4].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[5][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
(etc)
What I'm doing here is that there is no way to know for each line in the .csv how many items are in the list created by splitting it by semicolons. Ideally, the list would be length 6 - as follows the key # the top of my example of the data. However, for every semicolon in the text name, the length increases by 1...and we want everything before the last four items in the list (counting backwards from the right: date, seasonal adjustment, frequency, units/index) but after the .csv code (this is just another way of saying, I want the text "title" - everything for each line after .csv but before units/index).
Really what I want is just a way to save the entirety of the text name as "new_name" for each line, even after I split each line by semicolon, when I have no idea how many semicolons are in each text name or the line as a whole. The above code achieves this, but OMG, this can't be the right way to do this.
Please let me know if it's unclear or if I can provide more info.

parsing array contents and adding the values

I have several files that end in ".log". Last but three lines contain the data of interest.
Example File contents (Last four lines. fourth line is blank):
Total: 150
Success: 120
Error: 30
I am reading these contents into an array and trying to find an elegant way to:
1)extract the numeric data for each category (Total, Success, Error). Error out if numeric data is not there in the second part
2)Add them all up
I came up with the following code (getLastXLines function excluded for brevity) that returns the aggregate:
def getSummaryData(testLogFolder):
(path, dirs, files) = os.walk(testLogFolder).next()
#aggregate = [grandTotal, successTotal, errorTotal]
aggregate = [0, 0, 0]
for currentFile in files:
fullNameFile = path + "\\" + currentFile
if currentFile.endswith(".log"):
with open(fullNameFile,"r") as fH:
linesOfInterest=getLastXLines(fH, 4)
#If the file doesn't contain expected number of lines
if len(linesOfInterest) != 4:
print fullNameFile + " doesn't contain the expected summary data"
else:
for count, line in enumerate(linesOfInterest[0:-1]):
results = line.split(': ')
if len(results)==2:
aggregate[count] += int(results[1])
else:
print "error with " + fullNameFile + " data. Not adding the total"
return aggregate
Being relatively new to python, and seeing the power of it, I feel there may be a more powerful and efficient way to do this. May be there is a short list comprehension to do this kind of stuff? Please help.
def getSummaryData(testLogFolder):
summary = {'Total':0, 'Success':0, 'Error':0}
(path, dirs, files) = os.walk(testLogFolder).next()
for currentFile in files:
fullNameFile = path + "\\" + currentFile
if currentFile.endswith(".log"):
with open(fullNameFile,"r") as fH:
for pair in [line.split(':') for line in fH.read().split('\n')[-5:-2]]:
try:
summary[pair[0].strip()] += int(pair[1].strip())
except ValueError:
print pair[1] + ' is not a number'
except KeyError:
print pair[0] + ' is not "Total", "Success", or "Error"'
return summary
Piece by peice:
fH.read().split('\n')[-5:-2]
Here we take the last 4 lines except the very last of the file
line.split(':') for line in
From those lines, we break by the colon
try:
summary[pair[0].strip()] += int(pair[1].strip())
Now we try to get a number from the second, and a key from the first and add to our total
except ValueError:
print pair[1] + ' is not a number'
except KeyError:
print pair[0] + ' is not "Total", "Success", or "Error"'
And if we find something that isn't a number, or a key that isn't what we are looking for, we print an error

Converting/concatenating integer to strying with python

I'm trying to read the last line from a text file. Each line starts with a number, so the next time something is inserted, the new number will be incremented by 1.
For example, this would be a typical file
1. Something here date
2. Something else here date
#next entry would be "3. something date"
If the file is blank I can enter an entry with no problem. However, when there are already entries I get the following error
LastItemNum = lineList[-1][0:1] +1 #finds the last item's number
TypeError: cannon concatenate 'str' and 'int objects
Here's my code for the function
def AddToDo(self):
FILE = open(ToDo.filename,"a+") #open file for appending and reading
FileLines = FILE.readlines() #read the lines in the file
if os.path.getsize("EnteredInfo.dat") == 0: #if there is nothing, set the number to 1
LastItemNum = "1"
else:
LastItemNum = FileLines[-1][0:1] + 1 #finds the last items number
FILE.writelines(LastItemNum + ". " + self.Info + " " + str(datetime.datetime.now()) + '\n')
FILE.close()
I tried to convert LastItemNum to a string but I get the same "cannot concatenate" error.
LastItemNum = int(lineList[-1][0:1]) +1
then you've to convert LastItemNum back to string before writing to file, using :
LastItemNum=str(LastItemNum) or instead of this you can use string formatting.

Categories

Resources