Hyphens in csv columns/unknown data causing int conversion errors

Hyphens in csv columns/unknown data causing int conversion errors - python

I know how to convert between data types. Unfortunately, something in the data is obviating my str to int conversion during the cleaning process.
My code executes normally when I don't cast to an int. When I examined the csv file I realized that there are hyphens in the BeginDate and EndDate columns. I thought this was the reason for me ValueError, but have learned in the comments that this is not the case.
raw text
from csv import reader
opened_file = open('/Users/tymac/Artworks.csv')
read_data = reader(opened_file)
moma = list(read_data)
moma_header = moma[0]
moma = moma[1:]
for row in moma:
bd = row[5] # BeginDate
bd = bd.replace("(", "")
bd = bd.replace(")", "")
#bd = int(bd)
# I've stopped the loop after the first row "moma[0]",
# therefore no other cells should be causing the error.
if row == moma[0]:
print(bd)
print(type(bd))

As per the comments section, you discovered that the parenthesis represents a negative number. Almost certainly, you have a cell that that is not an integer type. An easy way to find the issue is to wrap your conversion in a try/except. For now, just print the cell - later, you will need to decide what to do with it.
from csv import reader
opened_file = open('/Users/tymac/Artworks.csv')
read_data = reader(opened_file)
moma = list(read_data)
moma_header = moma[0]
moma = moma[1:]
for row in moma:
bd = row[5]
bd = bd.replace("(", "")
bd = bd.replace(")", "")
try:
bd = int(bd)
except ValueError:
print(bd) # Just to find your bad cell, otherwise choose what to do with it.
For example, if I have a csv with the following data;
FName, LName, Number
James, Jones, (20)
Sam, Smith, (30)
Someone, Else, nan
and I run the code (changing to row[2] instead of row[5]), I will get a printed result of "nan" because the conversion to int fails. This tells me that I have a row that contains something other than an iteger.

Adding my own answer because this was the solution in code. SteveJ's comments led me to ask myself questions resulting in absolute filters so I marked his answer as correct.
I didn't know a number with a leading zero is not an integer in Python. Some of the cells started with a leading zero and certainly looked like an integer eg 0196. In addition, I tried to use 0000 as a placeholder cell for unknown dates. The exceptions to the leading zero rule in Python are numbers that contain all zeros like 0000. However, since I was filtering out zeros with other conditions, it was safer to use 1111 as my placeholder integer.
I had to get aggressive with the cleaning and create filters that eliminated all possible outliers even though I could not see them. A "Just In Case Filter" to filter out everything that did not leave me with a 4 digit number string. Now I have 4 digit year integers with 1111 integer placeholder cells so all is good.
In the end, I was able to clean it using these filters.
def clean_date(string):
bad_chars = ["(", ")", "\n", "\r", "\t"]
for char in bad_chars:
string = string.replace(char, "")
if len(string) > 4:
string = string[:4]
elif len(string) < 4:
string = "1111" # Don't use "0000" for padding, placeholders etc.
elif " " in string:
string = "1111"
elif string.isdigit() == False:
string = "1111"
elif len(string.split('1', 1)[0]):
string = "1111"
return string
for row in moma:
bd = row[5] # BeginDate/Birth Date
bd = clean_date(bd)
bd = int(bd) # Conversion
if row == moma[0]:
print(bd)
print(type(bd))
# Date of birth as an int
# 1841 <class 'int'>

Related

Add linebreaks after N special characters are observed

I have a requirement wherein I have a CSV file which has data in a wrong format. However based on the number pipes I need to add a newline character and make the data ready for Consumption.
Can we count the number pipes and add newline \ncharacter?
Example:
sadasd|asdasd|l||||0sds|sdsds|2||||0sdsd|asdasd|l||||0
Expected output:
sadasd|asdasd|l||||0
sds|sdsds|2||||0
sdsd|asdasd|l||||0 .

Something like this?
_in = "sadasd|asdasd|l||||0sds|sdsds|2||||0sdsd|asdasd|l||||0"
_out = ""
pipeCount = 0
for char in _in:
if pipeCount == 6:
_out = _out+char+"\n"
pipeCount = 0
else:
_out = _out+char
if char == "|":
pipeCount += 1
print(_out)
I am not sure I understood the criterion for adding newline (See comments on question), but my output conforms with your expectation:
sadasd|asdasd|l||||0
sds|sdsds|2||||0
sdsd|asdasd|l||||0
Output is still a string, but you can just as easily make it a list of string.

Summing a column in csv using Python

I work with large csv files and wanted to test if we can sum a numeric
column using Python. I generated a random data set:
id,first_name,last_name,email,gender,money
1,Clifford,Casterou,ccasterou0#dropbox.com,Male,53
2,Ethyl,Millichap,emillichap1#miitbeian.gov.cn,Female,58
3,Jessy,Stert,jstert2#gnu.org,Female,
4,Doy,Beviss,dbeviss3#dedecms.com,Male,80
5,Josee,Rust,jrust4#epa.gov,Female,13
6,Hedvige,Ahlf,hahlf5#vkontakte.ru,Female,67
On line 3 you will notice that value is missing(i removed that data on
purpose to test.)
I wrote the code :
import csv
with open("mock_7.txt","r+",encoding='utf8') as fin:
headerline = fin.readline()
amount = 0
debit = 0
value = 0
for row in csv.reader(fin):
# var = row.rstrip()
value =row[5].replace('',0)
value= float(value)
debit+=value
print (debit)
I got the error :
Traceback (most recent call last):
File "sum_csv1_v2.py", line 11, in <module>
value+= float(value)
TypeError: must be str, not float
As i am new to Python, my plan was to convert the empty cells with zero but I think i am missing something here. Also my script is based on comma separated files but i'm sure it wont work for other delimited files. Can you help me improve this code?

The original exception, now lost in the edit history,
TypeError: replace() argument 2 must be str, not int
is the result of str.replace() expecting string arguments, but you're passing an integer zero. Instead of replace you could simply check for empty string before conversion:
value = row[5]
value = float(value) if value else 0.0
Another option is to catch the potential ValueError:
try:
value = float(row[5])
except ValueError:
value = 0.0
This might hide the fact that the column contains "invalid" values other than just missing values.
Note that had you passed string arguments the end result would probably not have been what you expected:
In [2]: '123'.replace('', '0')
Out[2]: '0102030'
In [3]: float(_)
Out[3]: 102030.0
As you can see an empty string as the "needle" ends up replacing around each and every character in the string.
The latest exception in the question, after fixing the other errors, is the result of the float(value) conversion working and
value += float(value)
being equal to:
value = value + float(value)
and as the exception states, strings and floats don't mix.

Problem with your code is you're calling replace() without checking if its row[5] is empty or not.
Fixed code:
import csv
with open("mock_7.txt","r+",encoding='utf8') as fin:
headerline = fin.readline()
amount = 0
debit = 0
value = 0
for row in csv.reader(fin):
# var = row.rstrip()
if row[5].strip() == '':
row[5] = 0
value = float(row[5])
value += float(value)
debit += value
print (debit)
output:
542.0

Parsing numbers in strings from a file

I have a txt file as here:
pid,party,state,res
SC5,Republican,NY,Donald Trump 45%-Marco Rubio 18%-John Kasich 18%-Ted Cruz 11%
TB1,Republican,AR,Ted Cruz 27%-Marco Rubio 23%-Donald Trump 23%-Ben Carson 11%
FX2,Democratic,MI,Hillary Clinton 61%-Bernie Sanders 34%
BN1,Democratic,FL,Hillary Clinton 61%-Bernie Sanders 30%
PB2,Democratic,OH,Hillary Clinton 56%-Bernie Sanders 35%
what I want to do, is check that the % of each "res" gets to 100%
def addPoll(pid,party,state,res,filetype):
with open('Polls.txt', 'a+') as file: # open file temporarly for writing and reading
lines = file.readlines() # get all lines from file
file.seek(0)
next(file) # go to next line --
#this is suppose to skip the 1st line with pid/pary/state/res
for line in lines: # loop
line = line.split(',', 3)[3]
y = line.split()
print y
#else:
#file.write(pid + "," + party + "," + state + "," + res+"\n")
#file.close()
return "pass"
print addPoll("123","Democratic","OH","bla bla 50%-Asd ASD 50%",'f')
So in my code I manage to split the last ',' and enter it into a list, but im not sure how I can get only the numbers out of that text.

You can use regex to find all the numbers:
import re
for line in lines:
numbers = re.findall(r'\d+', line)
numbers = [int(n) for n in numbers]
print(sum(numbers))
This will print
0 # no numbers in the first line
97
85
97
92
93
The re.findall() method finds all substrings matching the specified pattern, which in this case is \d+, meaning any continuous string of digits. This returns a list of strings, which we cast to a list of ints, then take the sum.

It seems like what you have is CSV. Instead of trying to parse that on your own, Python already has a builtin parser that will give you back nice dictionaries (so you can do line['res']):
import csv
with open('Polls.txt') as f:
reader = csv.DictReader(f)
for row in reader:
# Do something with row['res']
pass
For the # Do something part, you can either parse the field manually (it appears to be structured): split('-') and then rsplit(' ', 1) each - separated part (the last thing should be the percent). If you're trying to enforce a format, then I'd definitely go this route, but regex are also a fine solution too for quickly pulling out what you want. You'll want to read up on them, but in your case, you want \d+%:
# Manually parse (throws IndexError if there isn't a space separating candidate name and %)
percents = [candidate.rsplit(' ', 1)[1] for candidate row['res'].split('-')]
if not all(p.endswith('%') for p in percents):
# Handle bad percent (not ending in %)
pass
else:
# Throws ValueError if any of the percents aren't integers
percents = [int(p[:-1]) for p in percents]
if sum(percents) != 100:
# Handle bad total
pass
Or with regex:
percents = [int(match.group(1)) for match in re.finditer(r'(\d+)%', row['res'])]
if sum(percents) != 100:
# Handle bad total here
pass
Regex is certainly shorter, but the former will enforce more strict formatting requirements on row['res'] and will allow you to later extract things like candidate names.
Also some random notes:
You don't need to open with 'a+' unless you plan to append to the file, 'r' will do (and 'r' is implicit, so you don't have to specify it).
Instead of next() use a for loop!

generate string with length equal to length of time in file, with 1 label per second , python

I have a file like this:
https://gist.github.com/manbharae/70735d5a7b2bbbb5fdd99af477e224be
What I want to do is generate 1 label for 1 second.
Since this above file is 160 seconds long, there should be 160 labels. in other words I want to generate string of length 160.
However I'm ending up having an str of len 166 instead of 160.
My code :
filename = './test_file.txt'
ann = []
with open(filename, 'r') as f:
for line in f:
_, end, label = line.strip().split('\t')
ann.append((int(float(end)), 'MIT' if label == 'MILAN' else 'not-MIT'))
str = ''
prev_value = 0
for s in ann:
value = s[0]
letter = 'M' if s[1] == 'MIT' else 'x'
str += letter * (value - prev_value)
print str
prev_value = value
name_of_file, file_ext = os.path.splitext(os.path.basename(filename))
print "\n\nfile_name processed:", name_of_file
print str
print "length of string", len(str),"\n\n"
My final output:
xxxxxxxMxMMMMxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxMMMMMMMMMMMMMMMMMMMMxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
166.
Which is wrong. Str should be 160characters with each character per second, because file is 160 seconds long.
There is some small bug somewhere, unable to find it.
Please advise what's going wrong here?
Thanks.
Few things that I tried were , trying to include an if condition to break out of the loop once length of 160 is reached like this:
if ann[len(ann)-1][0] == len(str):
break;
AFAIK, something is going wrong in the last iteration, because until then everything is fine.
however it didn't help.
I looked at : https://stackoverflow.com/a/14927311/4932791
https://stackoverflow.com/a/1424016/4932791

The reason it doesn't add up is because you have two occasions which should add a negative amount of letters because the value is lower than the previous number:
(69, 'not-MIT')
(68, 'not-MIT')
(76, 'not-MIT')
(71, 'not-MIT')
For future reference: it's better not to call your variables 'str' as 'str()' already is a defined function in python.

Format individual characters differently within an Excel cell

I have a column in Excel 2013 containing letters and the digits 1,2,3 and 4 (representing pinyin pronunciations and tone values). They are all in the same font & format, but I would like to convert the numbers only to superscript. It does not seem that I can use any of Excel's built-in find-and-replace functionality to replace a single character in a cell with its superscript version: the entire cell format gets changed. I saw a thread Format individual characters in a single Excel cell with python which apparently holds a solution, but that was the first time I had heard of Python or xlwt.
Since I have never used Python and xlwt, can someone give me a basic step-by-step set of instructions to install those utilities, customize the script and run it?
Sample:
Li1Shi4
Qin3Fat1
Gon1Lin3
Den1Choi3
Xin1Nen3
Script from other thread:
import xlwt
wb = xlwt.Workbook()
ws = wb.add_sheet('Sheet1')
font0 = xlwt.easyfont('')
font1 = xlwt.easyfont('bold true')
font2 = xlwt.easyfont('color_index red')
style = xlwt.easyxf('font: color_index blue')
seg1 = ('bold', font1)
seg2 = ('red', font2)
seg3 = ('plain', font0)
seg4 = ('boldagain', font1)
ws.write_rich_text(2, 5, (seg1, seg2, seg3, seg4))
ws.write_rich_text(4, 1, ('xyz', seg2, seg3, '123'), style)
wb.save('rich_text.xls')
What is the syntax that will achieve the "find numbers and replace with superscript"? Is it a font or a style? The code from the other thread seems to manually input "seg1" , "seg2" , "seg3" etc. Or am I misunderstanding the code?
Thanks in advance. I am using Windows 8, 64 bit, Excel 2013.

I'm bored and in a teaching mood, so, here's a long "answer" that also explains a little bit about how you can figure these things out for yourself in the future :)
I typed abc123def into a cell, and recorded a macro using the macro recorder.
This is where you should always start if you don't know what the correct syntax is.
In any case, I selected the numeric part of this cell, and right-clicked, format cell, change font to superscript.
This is what the macro recorder gives me. This is a lot of code. Fortunately, it's a lot of junk.
Sub Macro2()
With ActiveCell.Characters(Start:=1, Length:=3).Font 'Applies to the first 3 characters
.Name = "Calibri"
.FontStyle = "Regular"
.Size = 11
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
With ActiveCell.Characters(Start:=4, Length:=3).Font 'Applies to the middle 3 characters
.Name = "Calibri"
.FontStyle = "Regular"
.Size = 11
.Strikethrough = False
.Superscript = True
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
With ActiveCell.Characters(Start:=7, Length:=3).Font 'Applies to the last 3 characters
.Name = "Calibri"
.FontStyle = "Regular"
.Size = 11
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
End Sub
What it represents is three blocks of formatting: the first is the first 3 characters that aren't changed, then the 3 that we applied superscript to, and then the last three characters.
Almost all of this is default properties, since I made no other changes, so I can revise it to this:
Sub Macro2()
With ActiveCell.Characters(Start:=4, Length:=3).Font
.Superscript = False
End With
End Sub
Now we can see that there are two important parts to this. The first part is how to specify which characters to format. This is done by refereing to a cell's .Characters:
ActiveCell.Characters(Start:=4, Length:=3).Font
So we can see that this macro refers to the characters in the positon 4-6 in the string "abc123def", or "123".
The next, obvious part is to assign the .Font.Superscript property is True.
Now you want to generalize this so that you can apply it anywhere. The above code is "hardcoded" the Start and Length arguments. We need to make it dynamic. Easiest way to do this is to go 1 character at a time, and check to see if it's numeric, if so, apply the superscript.
Sub ApplySuperscriptToNumbers()
Dim i As Long
Dim str As String
Dim rng As Range
Dim cl As Range
'## Generally should work on any contiguous "Selection" of cell(s)
Set rng = Range(Selection.Address)
'## Iterate over each cell in this selection
For Each cl In rng.Cells
str = cl.Value
'## Iterate over each character in the cell
For i = 1 To Len(str)
'## Check if this character is numeric
If IsNumeric(Mid(str, i, 1)) Then
'## Apply superscript to this 1 character
cl.Characters(Start:=i, Length:=1).Font.Superscript = True
End If
Next
Next
End Sub

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Hyphens in csv columns/unknown data causing int conversion errors - python

Related

Add linebreaks after N special characters are observed

Summing a column in csv using Python

Parsing numbers in strings from a file

generate string with length equal to length of time in file, with 1 label per second , python

Format individual characters differently within an Excel cell

Categories

Resources