Python: using int() on a string that is not an integer literal - python

Note: I was using the wrong source file for my data - once that was fixed, my issue was resolved. It turns out, there is no simple way to use int(..) on a string that is not an integer literal.
This is an example from the book "Machine Learning In Action", and I cannot quite figure out what is wrong. Here's some background:
from numpy import as *
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines())
returnMat = zeros((numberOfLines,3))
classLabelVector = []
fr = open(filename)
index = 0
for line in fr.readlines():
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1])) # Problem here.
index += 1
return returnMat,classLabelVector
The .txt file is as follows:
40920 8.326976 0.953952 largeDoses
14488 7.153469 1.673904 smallDoses
26052 1.441871 0.805124 didntLike
75136 13.147394 0.428964 didntLike
38344 1.669788 0.134296 didntLike
...
I am getting an error on the line classLabelVector.append(int(listFromLine[-1])) because, I believe, int(..) is trying to parse over a String (ie "largeDoses") that is a not a literal integer. Am I missing something?
I looked up the documentation for int(), but it only seems to parse numbers and integer literals:
http://docs.python.org/2/library/functions.html#int
Also, an excerpt from the book explains this section as follows:
Finally, you loop over all the lines in the file and strip off the return line character with line.strip(). Next, you split the line
into a list of elements delimited by the tab character: '\t'. You take
the first three elements and shove them into a row of your matrix, and
you use the Python feature of negative indexing to get the last item
from the list to put into classLabelVector. You have to explicitly
tell the interpreter that you’d like the integer version of the last
item in the list, or it will give you the string version. Usually,
you’d have to do this, but NumPy takes care of those details for you.

strings like "largeDoses" could not be converted to integers. In folder Ch02 of that code project, you have two data files, use the second one datingTestSet2.txt instead of loading the first

You can use ast.literal_eval and catch the exception ValueError the malformed string (by the way int('9.4') will raise an exception)

Related

Trouble with indexing a string in Python

I am trying to check the first character in each line from a separate data file. This is the loop that I am using, but for some reason I get an error that says string index out of range.
for line_no in length:
line_being_checked = linecache.getline(file_path, line_no)
print(line_being_checked[0])
From what I understand (not very in english), lenght is the number of lines you want to check in the files.
You could do something like that:
for line in open("file.txt", "r").read().splitlines():
print(line[0])
This way, you'll be sure that the lenght is correct.
For the error, it is possible that you have an empty line, so you could len(line) to check if it is the case.

Python TypeError: expected a string or other character buffer object when importing text file

I am pretty new to python. For this task, I am trying to import a text file, add and to id, and remove punctuation from the text. I tried this method How to strip punctuation from a text file.
import string
def readFile():
translate_table = dict((ord(char), None) for char in string.punctuation)
with open('out_file.txt', 'w') as out_file:
with open('moviereview.txt') as file:
for line in file:
line = ' '.join(line.split(' '))
line = line.translate(translate_table)
out_file.write("<s>" + line.rstrip('\n') + "</s>" + '\n')
return out_file
However, I get an error saying:
TypeError: expected a string or other character buffer object
My thought is that after I split and join the line, I get a list of strings, so I cannot use str.translate() to process it. But it seems like everyone else have the same thing and it works,
ex. https://appliedmachinelearning.blog/2017/04/30/language-identification-from-texts-using-bi-gram-model-pythonnltk/ in example code from line 13.
So I am really confused, can anyone help? Thanks!
On Python 2, only unicode types have a translate method that takes a dict. If you intend to work with arbitrary text, the simplest solution here is to just use the Python 3 version of open on Py2; it will seamlessly decode your inputs and produce unicode instead of str.
As of Python 2.6+, replacing the normal built-in open with the Python 3 version is simple. Just add:
from io import open
to the imports at the top of your file. You can also remove line = ' '.join(line.split(' ')); that's definitionally a no-op (it splits on single spaces to make a list, then rejoins on single spaces). You may also want to add:
from __future__ import unicode_literals
to the very top of your file (before all of your code); that will make all of your uses of plain quotes automatically unicode literals, not str literals (prefix actual binary data with b to make it a str literal on Py2, bytes literal on Py3).
The above solution is best if you can swing it, because it will make your code work correctly on both Python 2 and Python 3. If you can't do it for whatever reason, then you need to change your translate call to use the API Python 2's str.translate expects, which means removing the definition of translate_table entirely (it's not needed) and just doing:
line = line.translate(None, string.punctuation)
For Python 2's str.translate, the arguments are a one-to-one mapping table for all values from 0 to 255 inclusive as the first argument (None if no mapping needed), and the second argument is a string of characters to delete (which string.punctuation already provides).
Answering here because a comment doesn't let me format code properly:
def r():
translate_table = dict((ord(char), None) for char in string.punctuation)
a = []
with open('out.txt', 'w') as of:
with open('test.txt' ,'r') as f:
for l in f:
l = l.translate(translate_table)
a.append(l)
of.write(l)
return a
This code runs fine for me with no errors. Can you try running that, and responding with a screenshot of the code you ran?

Replace words in list that later will be used in variable

I have a file which currently stores a string eeb39d3e-dd4f-11e8-acf7-a6389e8e7978
which I am trying to pass into as a variable to my subprocess command.
My current code looks like this
with open(logfilnavn, 'r') as t:
test = t.readlines()
print(test)
But this prints ['eeb39d3e-dd4f-11e8-acf7-a6389e8e7978\n'] and I don't want the part with ['\n'] to be passed into my command, so i'm trying to remove them by using replace.
with open(logfilnavn, 'r') as t:
test = t.readlines()
removestrings = test.replace('[', '').replace('[', '').replace('\\', '').replace("'", '').replace('n', '')
print(removestrings)
I get an exception value saying this so how can I replace these with nothing and store them as a string for my subprocess command?
'list' object has no attribute 'replace'
so how can I replace these with nothing and store them as a string for my subprocess command?
readline() returns a list. Try print(test[0].strip())
You can read the whole file and split lines using str.splitlines:
test = t.read().splitlines()
Your test variable is a list, because readlines() returns a list of all lines read.
Since you said the file only contains this one line, you probably wish to perform the replace on only the first line that you read:
removestrings = test[0].replace('[', '').replace('[', '').replace('\\', '').replace("'", '').replace('n', '')
Where you went wrong...
file.readlines() in python returns an array (collection or grouping of the same variable type) of the lines in the file -- arrays in python are called lists. you, here are treating the list as a string. you must first target the string inside it, then apply that string-only function.
In this case however, this would not work as you are trying to change the way the python interpretter has displayed it for one to understand.
Further information...
In code it would not be a string - we just can't easily understand the stack, heap and memory addresses easily. The example below would work for any number of lines (but it will only print the first element) you will need to change that and
this may be useful...
you could perhaps make the variables globally available (so that other parts of the program can read them
more useless stuff
before they go out of scope - the word used to mean the points at which the interpreter (what runs the program) believes the variable is useful - so that it can remove it from memory, or in much larger programs only worry about the locality of variables e.g. when using for loops i is used a lot without scope there would need to be a different name for each variable in the whole project. scopes however get specialised (meaning that if a scope contains the re-declaration of a variable this would fail as it is already seen as being one. an easy way to understand this might be to think of them being branches and the connections between the tips of branches. they don't touch along with their variables.
solution?
e.g:
with open(logfilenavn, 'r') as file:
lines = file.readlines() # creates a list
# an in-line for loop that goes through each item and takes off the last character: \n - the newline character
#this will work with any number of lines
strippedLines = [line[:-1] for line in lines]
#or
strippedLines = [line.replace('\n', '') for line in lines]
#you can now print the string stored within the list
print(strippedLines[0]) # this prints the first element in the list
I hope this helped!
You get the error because readlines returns a list object. Since you mentioned in the comment that there is just one line in the file, its better to use readline() instead,
line = "" # so you can use it as a variable outside `with` scope,
with open("logfilnavn", 'r') as t:
line = t.readline()
print(line)
# output,
eeb39d3e-dd4f-11e8-acf7-a6389e8e7978
readlines will return a list of lines, and you can't use replace with a list.
If you really want to use readlines, you should know that it doesn't remove the newline character from the end, you'll have to do it yourself.
lines = [line.rstrip('\n') for line in t.readlines()]
But still, after removing the newline character yourself from the end of each line, you'll have a list of lines. And from the question, it looks like, you only have one line, you can just access first line lines[0].
Or you can just leave out readlines, and just use read, it'll read all of the contents from the file. And then just do rstrip.
contents = t.read().rstrip('\n')

How to remove double quotes from strings in a list in python?

I am trying to get some data in a list of dictionaries.
The data comes from a csv file so it's all string.
the the keys in the file all have double qoutes, but since these are all strings, I want to remove them so they look like this in the dictionary:
{'key':value}
instead of this
{'"key"':value}
I tried simply using string = string[1:-1], but this doesn's work...
Here is my code:
csvDelimiter = ","
tsvDelimiter = "\t"
dataOutput = []
dataFile = open("browser-ww-monthly-201305-201405.csv","r")
for line in dataFile:
line = line[:-1] # Removes \n after every line
data = line.split(csvDelimiter)
for i in data:
if type(i) == str: # Doesn't work, I also tried if isinstance(i, str)
# but that didn't work either.
print i
i = i[1:-1]
print i
dataOutput.append({data[0] : data[1]})
dataFile.close()
print "Data output:\n"
print dataOutput
all the prints I get from print i are good, without double quotes, but when I append data to dataOutput, the quotes are back!
Any idea how to make them disappear forever?
Strip it. For example:
data[0].strip('"')
However, when reading cvs files, the best is to use the built-in cvs module. It takes care of this for you.
As noted in the comments, when dealing with CSV files you truly ought to use Python's built-in csv module (linking to Python 2 docs since it seems that's what you're using).
Another thing to note is that when you do:
data = line.split(csvDelimiter)
every item in the returned list, if it is not empty, will be strings. There's no sense in doing a type check in the loop (though if there were a reason to you would use isinstance). I don't know what "didn't work" about it, though it's possible you were using unicode strings. On Python 2 you can usually use isinstance(..., basestring) where basestring is a base class for both str and unicode. On Python 3 just use str unless you know you're dealing with bytes.
You said: "I tried simply using string = string[1:-1], but this doesn't work...". It seems to work fine for me:
In [101]: s="'word'"
In [102]: s[1:-1]
Out[102]: 'word'

What is the best way to iterate over a python list, excluding certain values and printing out the result

I am new to python and have a question:
I have checked similar questions, checked the tutorial dive into python, checked the python documentation, googlebinging, similar Stack Overflow questions and a dozen other tutorials.
I have a section of python code that reads a text file containing 20 tweets. I am able to extract these 20 tweets using the following code:
with open ('output.txt') as fp:
for line in iter(fp.readline,''):
Tweets=json.loads(line)
data.append(Tweets.get('text'))
i=0
while i < len(data):
print data[i]
i=i+1
The above while loop iterates perfectly and prints out the 20 tweets (lines) from output.txt.
However, these 20 lines contain Non-English Character data like "Los ladillo a los dos, soy maaaala o maloooooooooooo", URLs like "http://t.co/57LdpK", the string "None" and Photos with a URL like so "Photo: http://t.co/kxpaaaaa(I have edited this for privacy)
I would like to purge the output of this (which is a list), and exclude the following:
The None entries
Anything beginning with the string "Photo:"
It would be a bonus also if I can exclude non-unicode data
I have tried the following bits of code
Using data.remove("None:") but I get the error list.remove(x): x not in list.
Reading the items I do not want into a set and then doing a comparison on the output but no luck.
Researching into list comprehensions, but wonder if I am looking at the right solution here.
I am from an Oracle background where there are functions to chop out any wanted/unwanted section of output, so really gone round in circles in the last 2 hours on this. Any help greatly appreciated!
Try something like this:
def legit(string):
if (string.startswith("Photo:") or "None" in string):
return False
else:
return True
whatyouwant = [x for x in data if legit(x)]
I'm not sure if this will work out of the box for your data, but you get the idea. If you're not familiar, [x for x in data if legit(x)] is called a list comprehension
First of all, only add Tweet.get('text') if there is a text entry:
with open ('output.txt') as fp:
for line in iter(fp.readline,''):
Tweets=json.loads(line)
if 'text' in Tweets:
data.append(Tweets['text'])
That'll not add None entries (.get() returns None if the 'text' key is not present in the dictionary).
I'm assuming here that you want to further process the data list you are building here. If not, you can dispense with the for entry in data: loops below and stick to one loop with if statements. Tweets['text'] is the same value as entry in the for entry in data loops.
Next, you are looping over python unicode values, so use the methods provided on those objects to filter out what you don't want:
for entry in data:
if not entry.startswith("Photo:"):
print entry
You can use a list comprehension here; the following would print all entries too, in one go:
print '\n'.join([entry for entry in data if not entry.startswith("Photo:")])
In this case that doesn't really buy you much, as you are building one big string just to print it; you may as well just print the individual strings and avoid the string building cost.
Note that all your data is Unicode data. What you perhaps wanted is to filter out text that uses codepoints beyond ASCII points perhaps. You could use regular expressions to detect that there are codepoints beyond ASCII in your text
import re
nonascii = re.compile(ur'[^\x00-0x7f]', re.UNICODE) # all codepoints beyond 0x7F are non-ascii
for entry in data:
if entry.startswith("Photo:") or nonascii.search(entry):
continue # skip the rest of this iteration, continue to the next
print entry
Short demo of the non-ASCII expression:
>>> import re
>>> nonascii = re.compile(ur'[^\x00-\x7f]', re.UNICODE)
>>> nonascii.search(u'All you see is ASCII')
>>> nonascii.search(u'All you see is ASCII plus a little more unicode, like the EM DASH codepoint: \u2014')
<_sre.SRE_Match object at 0x1086275e0>
with open ('output.txt') as fp:
for line in fp.readlines():
Tweets=json.loads(line)
if not 'text' in Tweets: continue
txt = Tweets.get('text')
if txt.replace('.', '').replace('?','').replace(' ','').isalnum():
data.append(txt)
print txt
Small and simple.
Basic principle, one loop, if data matches your "OK" criteria add it and print it.
As Martijn pointed out, 'text' might not be in all the Tweets data.
Regexp replacement for .replace() would go something along the lines of: if re.match('^[\w-\ ]+$', txt) is not None: (it will not work for blankspace etc so yea as mentioned below..)
I'd suggest something like the following:
# use itertools.ifilter to remove items from a list according to a function
from itertools import ifilter
import re
# write a function to filter out entries you don't want
def my_filter(value):
if not value or value.startswith('Photo:'):
return False
# exclude unwanted chars
if re.match('[^\x00-\x7F]', value):
return False
return True
# Reading the data can be simplified with a list comprehension
with open('output.txt') as fp:
data = [json.loads(line).get('text') for line in fp]
# do the filtering
data = list(ifilter(my_filter, data))
# print the output
for line in data:
print line
Regarding unicode, assuming you're using python 2.x, the open function won't read data as unicode, it'll be read as the str type. You might want to convert it if you know the encoding, or read the file with a given encoding using codecs.open.
Try this:
with open ('output.txt') as fp:
for line in iter(fp.readline,''):
Tweets=json.loads(line)
data.append(Tweets.get('text'))
i=0
while i < len(data):
# these conditions will skip (continue) over the iterations
# matching your first two conditions.
if data[i] == None or data[i].startswith("Photo"):
continue
print data[i]
i=i+1

Categories

Resources