Summing a column in csv using Python - python

I work with large csv files and wanted to test if we can sum a numeric
column using Python. I generated a random data set:
id,first_name,last_name,email,gender,money
1,Clifford,Casterou,ccasterou0#dropbox.com,Male,53
2,Ethyl,Millichap,emillichap1#miitbeian.gov.cn,Female,58
3,Jessy,Stert,jstert2#gnu.org,Female,
4,Doy,Beviss,dbeviss3#dedecms.com,Male,80
5,Josee,Rust,jrust4#epa.gov,Female,13
6,Hedvige,Ahlf,hahlf5#vkontakte.ru,Female,67
On line 3 you will notice that value is missing(i removed that data on
purpose to test.)
I wrote the code :
import csv
with open("mock_7.txt","r+",encoding='utf8') as fin:
headerline = fin.readline()
amount = 0
debit = 0
value = 0
for row in csv.reader(fin):
# var = row.rstrip()
value =row[5].replace('',0)
value= float(value)
debit+=value
print (debit)
I got the error :
Traceback (most recent call last):
File "sum_csv1_v2.py", line 11, in <module>
value+= float(value)
TypeError: must be str, not float
As i am new to Python, my plan was to convert the empty cells with zero but I think i am missing something here. Also my script is based on comma separated files but i'm sure it wont work for other delimited files. Can you help me improve this code?

The original exception, now lost in the edit history,
TypeError: replace() argument 2 must be str, not int
is the result of str.replace() expecting string arguments, but you're passing an integer zero. Instead of replace you could simply check for empty string before conversion:
value = row[5]
value = float(value) if value else 0.0
Another option is to catch the potential ValueError:
try:
value = float(row[5])
except ValueError:
value = 0.0
This might hide the fact that the column contains "invalid" values other than just missing values.
Note that had you passed string arguments the end result would probably not have been what you expected:
In [2]: '123'.replace('', '0')
Out[2]: '0102030'
In [3]: float(_)
Out[3]: 102030.0
As you can see an empty string as the "needle" ends up replacing around each and every character in the string.
The latest exception in the question, after fixing the other errors, is the result of the float(value) conversion working and
value += float(value)
being equal to:
value = value + float(value)
and as the exception states, strings and floats don't mix.

Problem with your code is you're calling replace() without checking if its row[5] is empty or not.
Fixed code:
import csv
with open("mock_7.txt","r+",encoding='utf8') as fin:
headerline = fin.readline()
amount = 0
debit = 0
value = 0
for row in csv.reader(fin):
# var = row.rstrip()
if row[5].strip() == '':
row[5] = 0
value = float(row[5])
value += float(value)
debit += value
print (debit)
output:
542.0

Related

Keep Getting ValueError: not enough values to unpack (expected 2, got 1) for a text file for sentiment analysis?

I am trying to turn this text file into a dictionary using the code below:
with open("/content/corpus.txt", "r") as my_corpus:
wordpoints_dict = {}
for line in my_corpus:
key, value = line.split('')
wordpoints_dict[key] = value
print(wordpoints_dict)
It keeps returning:
ValueError Traceback (most recent call last)
<ipython-input-18-8cf5e5efd882> in <module>()
2 wordpoints_dict = {}
3 for line in my_corpus:
----> 4 key, value = line.split('-')
5 wordpoints_dict[key] = value
6 print(wordpoints_dict)
ValueError: not enough values to unpack (expected 2, got 1)
The data in the text file looks like this:
Text Data
You are trying to split a text value at ‘-‘. And unpack it to two values (key (before the dash), value (after the dash)). However, some lines in your txt file do not contain a dash so there is not two values to unpack. Try checking for blank lines as this could be a cause of the issue.
Your code doesn't match the error message. I'm going to assume that the error message is the correct one...
Just add a little logic to handle the case where there isn't a - on a line. I wouldn't be surprised if you fixed that problem and then hit the other side of that problem, where the line has more than one -. If that occurs in your file, you'll have to deal with that case as well, as you'll get a "too many values to unpack" error then. Here's your code with the added boilerplate for doing both of these things:
with open("/content/corpus.txt", "r") as my_corpus:
wordpoints_dict = {}
for line in my_corpus:
parts = line.split('-')
if len(parts) == 1:
parts = (line, '') # If no '-', use an empty second value
elif len(parts) > 2:
parts = parts[:2] # If too many items from split, use the first two
key, value = [x.strip() for x in parts] # strip leading and trailing spaces
wordpoints_dict[key] = value
print(wordpoints_dict)

Hyphens in csv columns/unknown data causing int conversion errors

I know how to convert between data types. Unfortunately, something in the data is obviating my str to int conversion during the cleaning process.
My code executes normally when I don't cast to an int. When I examined the csv file I realized that there are hyphens in the BeginDate and EndDate columns. I thought this was the reason for me ValueError, but have learned in the comments that this is not the case.
raw text
from csv import reader
opened_file = open('/Users/tymac/Artworks.csv')
read_data = reader(opened_file)
moma = list(read_data)
moma_header = moma[0]
moma = moma[1:]
for row in moma:
bd = row[5] # BeginDate
bd = bd.replace("(", "")
bd = bd.replace(")", "")
#bd = int(bd)
# I've stopped the loop after the first row "moma[0]",
# therefore no other cells should be causing the error.
if row == moma[0]:
print(bd)
print(type(bd))
As per the comments section, you discovered that the parenthesis represents a negative number. Almost certainly, you have a cell that that is not an integer type. An easy way to find the issue is to wrap your conversion in a try/except. For now, just print the cell - later, you will need to decide what to do with it.
from csv import reader
opened_file = open('/Users/tymac/Artworks.csv')
read_data = reader(opened_file)
moma = list(read_data)
moma_header = moma[0]
moma = moma[1:]
for row in moma:
bd = row[5]
bd = bd.replace("(", "")
bd = bd.replace(")", "")
try:
bd = int(bd)
except ValueError:
print(bd) # Just to find your bad cell, otherwise choose what to do with it.
For example, if I have a csv with the following data;
FName, LName, Number
James, Jones, (20)
Sam, Smith, (30)
Someone, Else, nan
and I run the code (changing to row[2] instead of row[5]), I will get a printed result of "nan" because the conversion to int fails. This tells me that I have a row that contains something other than an iteger.
Adding my own answer because this was the solution in code. SteveJ's comments led me to ask myself questions resulting in absolute filters so I marked his answer as correct.
I didn't know a number with a leading zero is not an integer in Python. Some of the cells started with a leading zero and certainly looked like an integer eg 0196. In addition, I tried to use 0000 as a placeholder cell for unknown dates. The exceptions to the leading zero rule in Python are numbers that contain all zeros like 0000. However, since I was filtering out zeros with other conditions, it was safer to use 1111 as my placeholder integer.
I had to get aggressive with the cleaning and create filters that eliminated all possible outliers even though I could not see them. A "Just In Case Filter" to filter out everything that did not leave me with a 4 digit number string. Now I have 4 digit year integers with 1111 integer placeholder cells so all is good.
In the end, I was able to clean it using these filters.
def clean_date(string):
bad_chars = ["(", ")", "\n", "\r", "\t"]
for char in bad_chars:
string = string.replace(char, "")
if len(string) > 4:
string = string[:4]
elif len(string) < 4:
string = "1111" # Don't use "0000" for padding, placeholders etc.
elif " " in string:
string = "1111"
elif string.isdigit() == False:
string = "1111"
elif len(string.split('1', 1)[0]):
string = "1111"
return string
for row in moma:
bd = row[5] # BeginDate/Birth Date
bd = clean_date(bd)
bd = int(bd) # Conversion
if row == moma[0]:
print(bd)
print(type(bd))
# Date of birth as an int
# 1841 <class 'int'>

generate string with length equal to length of time in file, with 1 label per second , python

I have a file like this:
https://gist.github.com/manbharae/70735d5a7b2bbbb5fdd99af477e224be
What I want to do is generate 1 label for 1 second.
Since this above file is 160 seconds long, there should be 160 labels. in other words I want to generate string of length 160.
However I'm ending up having an str of len 166 instead of 160.
My code :
filename = './test_file.txt'
ann = []
with open(filename, 'r') as f:
for line in f:
_, end, label = line.strip().split('\t')
ann.append((int(float(end)), 'MIT' if label == 'MILAN' else 'not-MIT'))
str = ''
prev_value = 0
for s in ann:
value = s[0]
letter = 'M' if s[1] == 'MIT' else 'x'
str += letter * (value - prev_value)
print str
prev_value = value
name_of_file, file_ext = os.path.splitext(os.path.basename(filename))
print "\n\nfile_name processed:", name_of_file
print str
print "length of string", len(str),"\n\n"
My final output:
xxxxxxxMxMMMMxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxMMMMMMMMMMMMMMMMMMMMxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
166.
Which is wrong. Str should be 160characters with each character per second, because file is 160 seconds long.
There is some small bug somewhere, unable to find it.
Please advise what's going wrong here?
Thanks.
Few things that I tried were , trying to include an if condition to break out of the loop once length of 160 is reached like this:
if ann[len(ann)-1][0] == len(str):
break;
AFAIK, something is going wrong in the last iteration, because until then everything is fine.
however it didn't help.
I looked at : https://stackoverflow.com/a/14927311/4932791
https://stackoverflow.com/a/1424016/4932791
The reason it doesn't add up is because you have two occasions which should add a negative amount of letters because the value is lower than the previous number:
(69, 'not-MIT')
(68, 'not-MIT')
(76, 'not-MIT')
(71, 'not-MIT')
For future reference: it's better not to call your variables 'str' as 'str()' already is a defined function in python.

Convert string (letter) from file text to integer

I'm the new one to machine learning. I got some problem when trying to use int for letters. I use Python 3.5 on Mac OS. This is my code:
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines())
returnMat = zeros((numberOfLines, 3))
classLabelVector = []
fr = open(filename)
index=0
for line in fr.readlines():
line = line.strip()
listFromLine1 = line.split('\t')
listFromLine = zeros(3)
i = 0
for value in listFromLine1:
if value.isdigit():
valueAsInt = int(value)
listFromLine[i] = valueAsInt
i += 1
returnMat[index, :] = listFromLine[0:3]
classLabelVector.append(int(listFromLine1[-1]))
index += 1
return returnMat, classLabelVector
This is my txt file:
23 8 1 f
7 8 5 j
5 9 1 j
6 6 6 f
This is the error:
classLabelVector.append(int(listFromLine1[-1])) ValueError: invalid literal for int() with base 10: 'f'
Can anybody help me with these problems?
If I understand your desired outcome correctly, you want to return a list with n lists in it. Each list will be along the line of [23. 8. 1.]. Then you want a second list that takes the last number of each list like this: [1, 5, 1, 6].
Assuming this is all correct, the reason you are getting classLabelVector.append(int(listFromLine1[-1])) ValueError: invalid literal for int() with base 10: 'f' is because you are not returning any numbers, but a string instead. I found 3 issues that should fix the error.
First, I found no '\t' in your text document. I instead used listFromLine1 = line.split(' ') and it split based on spaces. This could simply be from the way it copied over when you posted, though.
Second, when you assign a value for each position in listFromLine you then ignore it and append from listFromLine1 which you did nothing to, so it remains a string.
Third, try using if value.isnumeric(): instead of if value.isdigit():.
Fixing these few problems should get the program working. Also, you open the file and run fr.readlines() twice and never tell it to close. Your making the program work twice for the same information. You should try to rewrite it to only open once and use with open() as fr: because it will close when done.
EDIT: if you want the second list to be the letters instead [f, j, j, f] then keep it as listFromLine1 and use str() instead of int(): classLabelVector.append(str(listFromLine1[-1]))

Python - Output creating blank spaces

I'm encountering a very strange issue in a python script I've written. This is the piece of code that is producing abnormal results:
EDIT: I've included the entire loop in the code segment now.
data = open(datafile,'r')
outERROR = open(outERRORfile,'w')
precision=[0]
scale=[0]
lines = data.readlines()
limit = 0
if filetype == 'd':
for line in lines:
limit += 1
if limit > checklimit:
break
columns = line.split(fieldDelimiter)
for i in range(len(columns) - len(precision)):
precision.append(0)
for i in range(len(columns) - len(scale)):
scale.append(0)
if len(datatype) != len(precision):
sys.exit() #Exits the script if the number of data types (fields found in the DDL file) doesn't match the number of columns found in the data file
i = -1
for eachcolumn in columns:
i += 1
if len(rstrip(columns[i])) > precision[i]:
precision[i] = len(rstrip(columns[i]))
if columns[i].find('.') != -1 and (len(rstrip(columns[i])) - rstrip(columns[i]).find('.')) > scale[i]:
scale[i] = len(rstrip(columns[i])) - rstrip(columns[i]).find('.') -1
if datatype[i][0:7] == 'integer':
if int(columns[i]) < -2147483648 or int(columns[i]) > 2147483647:
outERROR.write("Integer value too high or too low to fit inside Integer data type, column: " + str(i + 1) + ", value: " + columns[i] + "\n")
if datatype[i][0:9] == 'smallint':
if int(columns[i]) < -32768 or int(columns[i]) > 32767:
outERROR.write("Smallint value too high or too low to fit inside Smallint data type, column: " + str(i + 1) + ", value: " + columns[i] + "\n")
if datatype[i][0:7] == 'byteint':
if int(columns[i]) < -128 or int(columns[i]) > 127:
outERROR.write("Byteint value too high or too low to fit inside Byteint data type, column: " + str(i + 1) + ", value: " + columns[i] + "\n")
if datatype[i][0:4] == 'date':
if DateParse(columns[i],format1[i]) > -1:
pass
elif DateParse(columns[i],format2[i]) > -1:
pass
elif DateParse(columns[i],format3[i]) > -1:
pass
else:
outERROR.write('Date format error, column: ' + str(i + 1) + ', value: ' + columns[i])
if datatype[i][0:9] == 'timestamp':
if DateParse(columns[i],timestamp1[i]) > -1:
pass
elif DateParse(columns[i],timestamp2[i]) > -1:
pass
elif DateParse(columns[i],timestamp3[i]) > -1:
pass
else:
outERROR.write('Timestamp format error, column: ' + str(i + 1) + ', value: ' + columns[i] + '\n')
if (datatype[i][0:7] == 'decimal'
or datatype[i][0:7] == 'integer'
or datatype[i][0:7] == 'byteint'
or datatype[i][0:5] == 'float'
or datatype[i][0:8] == 'smallint'):
try:
y = float(columns[i])
except ValueError:
outERROR.write('Character found in numeric data type, column: ' + str(i + 1) + ', value: ' + columns[i] + "\n")
else:
pass
This is part of a loop that reads a data file, basically its checking the type of the data (to determine if its supposed to be a numeric type) and 'trying' to turn it into a float to see if its actually numeric data in the data file. If its not numeric data it outputs the error you see above to a text file (defined currently as outERROR). Now when I wrote this and tested it on a small data file (4 lines) it worked fine, but when I run this on a larger file (several thousand rows) my error file is suddenly filling with a bunch of blank spaces, and only a few of the error messages are being created.
Here is what the error file looks like when I run the script with 4 rows:
Character found in numeric data type, column: 6, value: 24710a35
Character found in numeric data type, column: 7, value: 0a04
Character found in numeric data type, column: 8, value: 0a02
Character found in numeric data type, column: 6, value: 56688a12
Character found in numeric data type, column: 7, value: 0a09
Character found in numeric data type, column: 8, value: 0a06
Character found in numeric data type, column: 6, value: 12301a04
Character found in numeric data type, column: 7, value: 0a10
Character found in numeric data type, column: 8, value: 0a02
Character found in numeric data type, column: 6, value: 25816a56
Character found in numeric data type, column: 7, value: 0a09
Character found in numeric data type, column: 8, value: 0a06
This is the expected output.
When I run it on larger files, I start to get blank spaces at the top of the error file, and only the last 40-50 or so error writes actually get output as text in the file. The larger the file, the more blank spaces it outputs. I'm completely lost on this, I've read some of the other questions regarding mysterious blank lines and spaces on stackoverflow.com here but they dont seem to address my issue.
EDIT: outERROR is the name I've given to the error file that the output is writing to. It is a simple .txt file.
This is a sample of the data file:
257|1463|64|1|7|9551a22|0a05|0a02|N|O|1998-06-18|1998-05-15|1998-06-27|COLLECT COD|FOB|ackages sleep bold realmsa f|
258|1062|68|1|8|7704a48|0a00|0a07|R|F|1994-01-20|1994-03-21|1994-02-09|NONE|REG AIR|ully about the fluffily silent dependencies|
258|1962|95|2|40|74558a40|0a10|0a01|A|F|1994-03-13|1994-02-23|1994-04-05|DELIVER IN PERSON|FOB|silent frets nod daringly busy, bold|
258|1618|19|3|45|68382a45|0a07|0a07|R|F|1994-03-04|1994-02-13|1994-03-30|DELIVER IN PERSON|TRUCK|regular excuses-- fluffily ruthl|
Specifically the columns that are causing output to the error file are:
|9551a22|0a05|0a02|
|7704a48|0a00|0a07|
|74558a40|0a10|0a01|
|68382a45|0a07|0a07|
So each line should cause 3 writes to the error file, specifying these values. It works fine for a small number of lines, but when it reads a large number of lines I start getting these mysterious blank spaces. This problem occurs only when I have numeric fields that contain characters.
At a guess, perhaps you have control characters in the input stream that cause some unexpected behaviour. No sure exactly what outERROR is in the above context, but you can imagine, for example, that a form feed character in the input could have this sort of effect.
Try cleaning the data of non-printable characters first and see if that helps.
Call open with 'rb' and 'wb' to ensure binary mode, otherwise the data can be altered by the system trying to mess with line endings

Categories

Resources