I'm writing some code that reads words from a text file and sorts them into a dictionary. It actually all runs fine, but for reference here it is:
def find_words(file_name, delimiter = " "):
"""
A function for finding the number of individual words, and the most popular words, in a given file.
The process will stop at any line in the file that starts with the word 'finish'.
If there is no finish point, the process will go to the end of the file.
Inputs: file_name: Name of file you want to read from, e.g. "mywords.txt"
delimiter: The way the words in the file are separated e.g. " " or ", "
: Delimiter will default to " " if left blank.
Output: Dictionary with all the words contained in the given file, and how many times each word appears.
"""
words = []
dictt = {}
with open(file_name, 'r') as wordfile:
for line in wordfile:
words = line.split(delimiter)
if words[0]=="finish":
break
# This next part is for filling the dictionary
# and correctly counting the amount of times each word appears.
for i in range(len(words)):
a = words[i]
if a=="\n" or a=="":
continue
elif dictt.has_key(a)==False:
dictt[words[i]] = 1
else:
dictt[words[i]] = int(dictt.get(a)) + 1
return dictt
The problem is that it only works if the arguments are given as string literals, e.g, this works:
test = find_words("hello.txt", " " )
But this doesn't:
test = find_words(hello.txt, )
The error message is undefined name 'hello'
I don't know how to alter the function arguments such that I can enter them without speech marks.
Thanks!
Simple, you define that name:
class hello:
txt = "hello.txt"
But joking aside, all the argument values in a function call are expressions. If you want to pass a string literally you'll have to make a string literal, using the quotes. Python is not a text preprocessor like m4 or cpp, and expects the entire program text to follow its syntax.
So it turns out I just misunderstood what was being asked. I've had it clarified by the course leader now.
As I am now fully aware, a function definition needs to be told when a string is being entered, hence the quote marks being required.
I admit full ignorance over my depth of understanding of how it all works - I thought you could pretty much put any assortment of letters and/or numbers in as an argument and then you can manipulate them within the function definition.
My ignorance may stem from the fact that I'm quite new to Python, having learned my coding basics on C++ where, if I remember correctly (it was well over a year ago), functions are defined with each argument being specifically set up as their type, e.g.
int max(int num1, int num2)
Whereas in Python you don't quite do it like that.
Thanks for the attempts at help (and ridicule!)
Problem is sorted now.
Related
I am very new to Python and am looking for assistance to where I am going wrong with an assignment. I have attempted different ways to approach the problem but keep getting stuck at the same point(s):
Problem 1: When I am trying to create a list of words from a file, I keep making a list for the words per line rather than the entire file
Problem 2: When I try and combine the lists I keep receiving "None" for my result or Nonetype errors [which I think means I have added the None's together(?)].
The assignment is:
#8.4 Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.You can download the sample data at http://www.py4e.com/code3/romeo.txt
My current code which is giving me a Nonetype error is:
poem = input("enter file:")
play = open(poem)
lst= list()
for line in play:
line=line.rstrip()
word=line.split()
if not word in lst:
lst= lst.append(word)
print(lst.sort())
If someone could just talk me through where I am going wrong that will be greatly appreciated!
your problem was lst= lst.append(word) this returns None
with open(poem) as f:
lines = f.read().split('\n') #you can also you readlines()
lst = []
for line in lines:
words = line.split()
for word in words:
if word:
lst.append(word)
Problem 1: When I am trying to create a list of words from a file, I keep making a list for the words per line rather than the entire file
You are doing play = open(poem) then for line in play: which is method for processing file line-by-line, if you want to process whole content at once then do:
play = open(poem)
content = play.read()
words = content.split()
Please always remember to close file after you used it i.e. do
play.close()
unless you use context manager way (i.e. like with open(poem) as f:)
Just to help you get into Python a little more:
You can:
1. Read whole file at once (if it is big it is better to grab it into RAM if you have enough of it, if not grab as much as you can for the chunk to be reasonable, then grab another one and so on)
2. Split data you read into words and
3. Use set() or dict() to remove duplicates
Along the way, you shouldn't forget to pay attention to upper and lower cases,
if you need same words, not just different not repeating strings
This will work in Py2 and Py3 as long as you do something about input() function in Py2 or use quotes when entering the path, so:
path = input("Filename: ")
f = open(filename)
c = f.read()
f.close()
words = set(x.lower() for x in c.split()) # This will make a set of lower case words, with no repetitions
# This equals to:
#words = set()
#for x in c.split():
# words.add(x.lower())
# set() is an unordered datatype ignoring duplicate items
# and it mimics mathematical sets and their properties (unions, intersections, ...)
# it is also fast as hell.
# Checking a list() each time for existance of the word is slow as hell
#----
# OK, you need a sorted list, so:
words = sorted(words)
# Or step-by-step:
#words = list(words)
#words.sort()
# Now words is your list
As for your errors, do not worry, they are common at the beginning in almost any objective oriented language.
Other explained them well in their comments. But not to make the answer lacking...:
Always pay attention on functions or methods which operate on the datatype (in place sort - list.sort(), list.append(), list.insert(), set.add()...) and which ones return a new version of the datatype (sorted(), str.lower()...).
If you ran into a similar situation again, use help() in interactive shell to see what exactly a function you used does.
>>> help(list.append)
>>> help(list.sort)
>>> help(str.lower)
>>> # Or any short documentation you need
Python, especially Python 3.x is sensitive to trying operations between types, but some might have a different connotation and can actually work while doing unexpected stuff.
E.g. you can do:
print(40*"x")
It will print out 40 'x' characters, because it will create a string of 40 characters.
But:
print([1, 2, 3]+None)
will, logically not work, which is what is happening somewhere in the rest of your code.
In some languages like javascript (terrible stuff) this will work perfectly well:
v = "abc "+123+" def";
Inserting the 123 seamlessly into the string. Which is usefull, but a programming nightmare and nonsense from another viewing angle.
Also, in Py3 a reasonable assumption from Py2 that you can mix unicode and byte strings and that automatic cast will be performed is not holding.
I.e. this is a TypeError:
print(b"abc"+"def")
because b"abc" is bytes() and "def" (or u"def") is str() in Py3 - what is unicode() in Py2)
Enjoy Python, it is the best!
I found this strange problem when I trying to add comments to my code. I used the triple-quoted strings to comment but the program crashed by giving the following error:
IndentationError: unexpected indent
When I use # to comment the triple-quoted strings, everything works normally. Does anyone know the reason behind this error and how I could fix it?
My Code:
#This programs show that comments using # rather than """ """
def main():
print("let's do something")
#Try using hashtag to comment this block to get code working
'''
Note following block gives you a non-sense indent error
The next step would be to consider how to get all the words from spam and ham
folder from different directory. My suggestion would be do it twice and then
concentrate two lists
Frist think about the most efficient way
For example, we might need to get rid off the duplicated words in the beginning
The thoughts of writing the algorithem to create the dictionary
Method-1:
1. To append all the list from the email all-together
2. Eliminate those duplicated words
cons: the list might become super large
I Choose method-2 to save the memory
Method-2:
1. kill the duplicated words in each string
2. Only append elements that is not already in the dictionary
Note:
1. In this case, the length of feature actually was determined by the
training cohorts, as we used the different English terms to decide feature
cons: the process time might be super long
'''
def wtf_python(var1, var2):
var3 = var1 + var2 + (var1*var2)
return var3
wtfRst1 = wtf_python(1,2)
wtfRst2 = wtf_python(3,4)
rstAll = { "wtfRst1" : wtfRst1,
"wtfRst2" : wtfRst2
}
return(rstAll)
if __name__ == "__main__":
mainRst = main()
print("wtfRst1 is :\n", mainRst['wtfRst1'])
print("wtfRst2 is :\n", mainRst['wtfRst2'])
The culprit:
Move the comments inside the function definition:
The reason:
Since the triple-quote strings are valid python exp, they should be treated like-wise, i.e. inside the function scope.
Hence:
def main():
print("let's do something")
#Try using hashtag to comment this block to get code working
'''
Note following block gives you a non-sense indent error
The next step would be to consider how to get all the words from spam and ham
folder from different directory. My suggestion would be do it twice and then
concentrate two lists
Frist think about the most efficient way
For example, we might need to get rid off the duplicated words in the beginning
The thoughts of writing the algorithem to create the dictionary
Method-1:
1. To append all the list from the email all-together
2. Eliminate those duplicated words
cons: the list might become super large
I Choose method-2 to save the memory
Method-2:
1. kill the duplicated words in each string
2. Only append elements that is not already in the dictionary
Note:
1. In this case, the length of feature actually was determined by the
training cohorts, as we used the different English terms to decide feature
cons: the process time might be super long
'''
def wtf_python(var1, var2):
var3 = var1 + var2 + (var1*var2)
return var3
wtfRst1 = wtf_python(1,2)
wtfRst2 = wtf_python(3,4)
rstAll = { "wtfRst1" : wtfRst1,
"wtfRst2" : wtfRst2
}
return(rstAll)
if __name__ == "__main__":
mainRst = main()
print("wtfRst1 is :\n", mainRst['wtfRst1'])
print("wtfRst2 is :\n", mainRst['wtfRst2'])
OUTPUT:
let's do something
wtfRst1 is :
5
wtfRst2 is :
19
You should push the indentation level of you triple-quote strings one tag to the right.
Although triple-quote strings are often used as comments, they are normal python expressions, so they should follow the language's syntax.
Triple quoted strings as comments must be valid Python strings. Valid Python strings must be properly indented.
Python sees the multi-line string, evaluates it, but since you don't assign a variable to it the string gets thrown away in the next line.
I am new to coding and I ran in trouble while trying to make my own fastq masker. The first module is supposed to trim the line with the + away, modify the sequence header (begins with >) to the line number, while keeping the sequence and quality lines (A,G,C,T line and Unicode score, respectively).
class Import_file(object):
def trim_fastq (self, fastq_file):
f = open('path_to_file_a', 'a' )
sanger = []
sequence = []
identifier = []
plus = []
f2 = open('path_to_file_b')
for line in f2.readlines():
line = line.strip()
if line[0]=='#':
identifier.append(line)
identifier.replace('#%s','>[i]' %(line))
elif line[0]==('A' or 'G'or 'T' or 'U' or 'C'):
seq = ','.join(line)
sequence.append(seq)
elif line[0]=='+'and line[1]=='' :
plus.append(line)
remove_line = file.writelines()
elif line[0]!='#' or line[0]!=('A' or 'G'or 'T' or 'U' or 'C') or line[0]!='+' and line[1]!='':
sanger.append(line)
else:
print("Danger Will Robinson, Danger!")
f.write("'%s'\n '%s'\n '%s'" %(identifier, sequence, sanger))
f.close()
return (sanger,sequence,identifier,plus)
Now for my question. I have ran this and no error appears, however the target file is empty. I am wondering what I am doing wrong... Is it my way to handle the lists or the lack of .join? I am sorry if this is a duplicate. It is simply that I do not know what is the mistake here. Also, important note... This is not some homework, I just need a masker for work... Any help is greatly appreciated and all mentions of improvement to the code are welcomed. Thanks.
Note (fastq format):
#SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=50
TTGCCTGCCTATCATTTTAGTGCCTGTGAGGTGGAGATGTGAGGATCAGT
+
hhhhhhhhhhghhghhhhhfhhhhhfffffe`ee[`X]b[d[ed`[Y[^Y
Edit: Still unable to get anything, but working at it.
Your problem is with your understanding of the return statement. return x means stop executing the current function and give x back to whoever called it. In your code you have:
return sanger
return sequence
return identifier
return plus
When the first one executes (return sanger) execution of the function stops and sanger is returned. The second through fourth return statements never get evaluated and neither does your I/O stuff at the end. If you're really interested in returning all of these values, move this after the file I/O and return the four of them packed up as a tuple.
f.write("'%s'\n '%s'\n '%s'" %(identifier, sequence, sanger))
f.close()
return (sanger,sequence,identifier,plus)
This should get you at least some output in the file. Whether or not that output is in the format you want, I can't really say.
Edit:
Just noticed you were using /n and probably want \n so I made the change in my answer here.
You have all sorts of errors beyond what #Brian addressed. I'm guessing that your if and else tests are trying to check the first character of line? You'd do that with
if line[0] == '#':
etc.
You'll probably need to write more scripts soon, so I suggest you work through the Python Tutorial so you can get on top of the basics. It'll be worth your while.
I am expecting users to upload a CSV file of max size 1MB to a web form that should fit a given format similar to:
"<String>","<String>",<Int>,<Float>
That will be processed later. I would like to verify the file fits a specified format so that the program that shall later use the file doesnt receive unexpected input and that there are no security concerns (say some injection attack against the parsing script that does some calculations and db insert).
(1) What would be the best way to go about doing this that would be fast and thorough? From what I've researched I could go the path of regex or something more like this. I've looked at the python csv module but that doesnt appear to have any built in verification.
(2) Assuming I go for a regex, can anyone direct me to towards the best way to do this? Do I match for illegal characters and reject on that? (eg. no '/' '\' '<' '>' '{' '}' etc.) or match on all legal eg. [a-zA-Z0-9]{1,10} for the string component? I'm not too familiar with regular expressions so pointers or examples would be appreciated.
EDIT:
Strings should contain no commas or quotes it would just contain a name (ie. first name, last name). And yes I forgot to add they would be double quoted.
EDIT #2:
Thanks for all the answers. Cutplace is quite interesting but is a standalone. Decided to go with pyparsing in the end because it gives more flexibility should I add more formats.
Pyparsing will process this data, and will be tolerant of unexpected things like spaces before and after commas, commas within quotes, etc. (csv module is too, but regex solutions force you to add "\s*" bits all over the place).
from pyparsing import *
integer = Regex(r"-?\d+").setName("integer")
integer.setParseAction(lambda tokens: int(tokens[0]))
floatnum = Regex(r"-?\d+\.\d*").setName("float")
floatnum.setParseAction(lambda tokens: float(tokens[0]))
dblQuotedString.setParseAction(removeQuotes)
COMMA = Suppress(',')
validLine = dblQuotedString + COMMA + dblQuotedString + COMMA + \
integer + COMMA + floatnum + LineEnd()
tests = """\
"good data","good2",100,3.14
"good data" , "good2", 100, 3.14
bad, "good","good2",100,3.14
"bad","good2",100,3
"bad","good2",100.5,3
""".splitlines()
for t in tests:
print t
try:
print validLine.parseString(t).asList()
except ParseException, pe:
print pe.markInputline('?')
print pe.msg
print
Prints
"good data","good2",100,3.14
['good data', 'good2', 100, 3.1400000000000001]
"good data" , "good2", 100, 3.14
['good data', 'good2', 100, 3.1400000000000001]
bad, "good","good2",100,3.14
?bad, "good","good2",100,3.14
Expected string enclosed in double quotes
"bad","good2",100,3
"bad","good2",100,?3
Expected float
"bad","good2",100.5,3
"bad","good2",100?.5,3
Expected ","
You will probably be stripping those quotation marks off at some future time, pyparsing can do that at parse time by adding:
dblQuotedString.setParseAction(removeQuotes)
If you want to add comment support to your input file, say a '#' followed by the rest of the line, you can do this:
comment = '#' + restOfline
validLine.ignore(comment)
You can also add names to these fields, so that you can access them by name instead of index position (which I find gives more robust code in light of changes down the road):
validLine = dblQuotedString("key") + COMMA + dblQuotedString("title") + COMMA + \
integer("qty") + COMMA + floatnum("price") + LineEnd()
And your post-processing code can then do this:
data = validLine.parseString(t)
print "%(key)s: %(title)s, %(qty)d in stock at $%(price).2f" % data
print data.qty*data.price
I'd vote for parsing the file, checking you've got 4 components per record, that the first two components are strings, the third is an int (checking for NaN conditions), and the fourth is a float (also checking for NaN conditions).
Python would be an excellent tool for the job.
I'm not aware of any libraries in Python to deal with validation of CSV files against a spec, but it really shouldn't be too hard to write.
import csv
import math
dataChecker = csv.reader(open('data.csv'))
for row in dataChecker:
if len(row) != 4:
print 'Invalid row length.'
return
my_int = int(row[2])
my_float = float(row[3])
if math.isnan(my_int):
print 'Bad int found'
return
if math.isnan(my_float):
print 'Bad float found'
return
print 'All good!'
Here's a small snippet I made:
import csv
f = csv.reader(open("test.csv"))
for value in f:
value[0] = str(value[0])
value[1] = str(value[1])
value[2] = int(value[2])
value[3] = float(value[3])
If you run that with a file that doesn't have the format your specified, you'll get an exception:
$ python valid.py
Traceback (most recent call last):
File "valid.py", line 8, in <module>
i[2] = int(i[2])
ValueError: invalid literal for int() with base 10: 'a3'
You can then make a try-except ValueError to catch it and let the users know what they did wrong.
There can be a lot of corner-cases for parsing CSV, so you probably don't want to try doing it "by hand". At least start with a package/library built-in to the language that you're using, even if it doesn't do all the "verification" you can think of.
Once you get there, then examine the fields for your list of "illegal" chars, or examine the values in each field to determine they're valid (if you can do so). You also don't even need a regex for this task necessarily, but it may be more concise to do it that way.
You might also disallow embedded \r or \n, \0 or \t. Just loop through the fields and check them after you've loaded the data with your csv lib.
Try Cutplace. It verifies that tabluar data conforms to an interface control document.
Ideally, you want your filtering to be as restrictive as possible - the fewer things you allow, the fewer potential avenues of attack. For instance, a float or int field has a very small number of characters (and very few configurations of those characters) which should actually be allowed. String filtering should ideally be restricted to only what characters people would have a reason to input - without knowing the larger context it's hard to tell you exactly which you should allow, but at a bare minimum the string match regex should require quoting of strings and disallow anything that would terminate the string early.
Keep in mind, however, that some names may contain things like single quotes ("O'Neil", for instance) or dashes, so you couldn't necessarily rule those out.
Something like...
/"[a-zA-Z' -]+"/
...would probably be ideal for double-quoted strings which are supposed to contain names. You could replace the + with a {x,y} length min/max if you wanted to enforce certain lengths as well.
Update: My current question is how can I get my code to read to the EOF starting from the beginning with each new search phrase.
This is an assignment I am doing and currently stuck on. Mind you this is a beginner's programming class using Python.
jargon = open("jargonFile.txt","r")
searchPhrase = raw_input("Enter the search phrase: ")
while searchPhrase != "":
result = jargon.readline().find(searchPhrase)
if result == -1:
print "Cannot find this term."
else:
print result
searchPhrase = raw_input("Enter the search phrase: ")
jargon.close()
The assignment is to take a user's searchPhrase and find it in a file (jargonFile.txt) and then have it print the result (which is the line it occured and the character occurence). I will be using a counter to find the line number of the occurence but I will implement this later. For now my question is the error I am getting. I cann't find a way for it to search the entire file.
Sample run:
Enter the search phrase: dog
16
Enter the search phrase: hack
Cannot find this term.
Enter the search phrase:
"dog" is found in the first line however it is also found in other lines of the jargonFile (multiple times as a string) but it is only showing the first occurence in the first line. The string hack is found numerous times in the jargonFile but my code is setup to only search the first line. How may I go about solving this problem?
If this is not clear enough I can post up the assignment if need be.
First you open the file and read it into a string with readline(). Later on you try to readline() from the string you obtained in the first step.
You need to take care what object (thing) you're handling: open() gave you a file "jargon", readline on jargon gave you the string "jargonFile".
So jargonFile.readline does not make sense anymore
Update as answer to comment:
Okay, now that the str error problem is solved think about the program structure:
big loop
enter a search term
open file
inner loop
read a line
print result if string found
close file
You'd need to change your program so it follows that descripiton
Update II:
SD, if you want to avoid reopening the file you'd still need two loops, but this time one loop reads the file into memory, when that's done the second loop asks for the search term. So you would structure it like
create empty list
open file
read loop:
read a line from the file
append the file to the list
close file
query loop:
ask the user for input
for each line in the array:
print result if string found
For extra points from your professor add some comments to your solution that mention both possible solutions and say why you choose the one you did. Hint: In this case it is a classic tradeoff between execution time (memory is fast) and memory usage (what if your jargon file contains 100 million entries ... ok, you'd use something more complicated than a flat file in that case, bu you can't load it in memory either.)
Oh and one more hint to the second solution: Python supports tuples ("a","b","c") and lists ["a","b","c"]. You want to use the latter one, because list can be modified (a tuple can't.)
myList = ["Hello", "SD"]
myList.append("How are you?")
foreach line in myList:
print line
==>
Hello
SD
How are you?
Okay that last example contains all the new stuff (define list, append to list, loop over list) for the second solution of your program. Have fun putting it all together.
Hmm, I don't know anything at all about Python, but it looks to me like you are not iterating through all the lines of the file for the search string entered.
Typically, you need to do something like this:
enter search string
open file
if file has data
start loop
get next line of file
search the line for your string and do something
Exit loop if line was end of file
So for your code:
jargon = open("jargonFile.txt","r")
searchPhrase = raw_input("Enter the search phrase: ")
while searchPhrase != "":
<<if file has data?>>
<<while>>
result = jargon.readline().find(searchPhrase)
if result == -1:
print "Cannot find this term."
else:
print result
<<result is not end of file>>
searchPhrase = raw_input("Enter the search phrase: ")
jargon.close()
Cool, did a little research on the page DNS provided and Python happens to have the "with" keyword. Example:
with open("hello.txt") as f:
for line in f:
print line
So another form of your code could be:
searchPhrase = raw_input("Enter the search phrase: ")
while searchPhrase != "":
with open("jargonFile.txt") as f:
for line in f:
result = line.find(searchPhrase)
if result == -1:
print "Cannot find this term."
else:
print result
searchPhrase = raw_input("Enter the search phrase: ")
Note that "with" automatically closes the file when you're done.
Your file is jargon, not jargonFile (a string). That's probably what's causing your error message. You'll also need a second loop to read each line of the file from the beginning until you find the word you're looking for. Your code currently stops searching if the word is not found in the current line of the file.
How about trying to write code that only gives the user one chance to enter a string? Input that string, search the file until you find it (or not) and output a result. After you get that working you can go back and add the code that allows multiple searches and ends on an empty string.
Update:
To avoid iterating the file multiple times, you could start your program by slurping the entire file into a list of strings, one line at a time. Look up the readlines method of file objects. You can then search that list for each user input instead of re-reading the file.
you shouldn't try to re-invent the wheel. just use the
re module functions.
your program could work better if you used:
result = jargon.read() .
instead of:
result = jargon.readline() .
then you could use the re.findall() function
and join the strings (with the indexes) you searched for with str.join()
this could get a little messy but if take some time to work it out, this could fix your problem.
the python documentation has this perfectly documented
Everytime you enter a search phrase, it looks for it on the next line, not the first one. You need to re-open the file for every search phrase, if you want it behave like you describe.
Take a look at the documentation for File objects:
http://docs.python.org/library/stdtypes.html#file-objects
You might be interested in the readlines method. For a simple case where your file is not enormous, you could use that to read all the lines into a list. Then, whenever you get a new search string, you can run through the whole list to see whether it's there.