Reading from a file/split function new character line - python

f = open("test.txt", 'r+')
print ("Name of the file: ", f.name)
str = f.read();
str = str.split(',')
print(str2)
f.close()
I need to read from a file and it gives the name of the class it has to make and the parameters it needs to pass.
Example:
rectangle,9.7,7.3
square,6
so I have to make a rectangle object and pass those 2 parameters. then write it to another file with the results. I am stuck chopping up the string.
I use the split function to get rid of the commas, and it seems it returns a list which I am saving into the str list, which is probably bad, I should change the name. However, my concern is that although it does take the comma out. It keeps the ,\n, new line character and concatenates it to the next line. So it splits it like this ['rectangle', '9.7', '7.3\nsquare', ...
how can I get rid of that.
Any suggestions would be welcomed. Should I read line by line instead of reading the whole thing?

Try calling strip() on each line to get rid of the newline character before splitting it.
Give this a try (EDIT - Annotated code with comments to make it easier to follow):
# Using "with open()" to open the file handle and have it automatically closed for your when the program exits.
with open("test.txt", 'r+') as f:
print "Name of the file: ", f.name
# Iterate over each line in the test.txt file
for line in f:
# Using strip() to remove newline characters and white space from each end of the line
# Using split(',') to create a list ("tokens") containing each segment of the line separated by commas
tokens = line.strip().split(',')
# Print out the very first element (position 0) in the tokens list, which should be the "class"
print "class: ", tokens[0]
# Print out all of the remaining elements in the tokens list, starting at the second element (i.e. position "1" since lists are "zero-based")
# This is using a "slice". "tokens[1:]" means return the contents of the tokens list starting at position 1 and continuing to the end
# "tokens[1:3]" Would mean give me all of the elements of the tokens list starting at position 1 and ending at position 3 (excluding position 3).
# Loop over the elements returned by the slice, assigning them one by one to the "argument" variable
for argument in tokens[1:]:
# Print out the argument
print "argument: ", argument
output:
Name of the file: test.txt
class: rectangle
argument: 9.7
argument: 7.3
class: square
argument: 6
More information on slice: http://pythoncentral.io/how-to-slice-listsarrays-and-tuples-in-python/

Related

Output using readlines in Python

I was wondering if anyone had an answer for this. In the image the top case has the code,
output of the two different lines of code from below:
def stats ():
inFile = open('textFile.txt', 'r')
line = inFile.readlines()
print(line[0])
and the second case has the code,
def stats ():
inFile = open('textFile.txt', 'r')
line = inFile.readlines()
print(line[0:1])
instead of going to the next iteration and printing it, it just spits out the iteration, now populated with all the \t and the end of line character \n. Can anyone explain why this is happening?
In the first case you're printing a single line, which is a string.
In the second case you're printing a slice of a list, which is also a list. The strings contained within the list use repr instead of str when printed, which changes the representation. You should loop through the list and print each string separately to fix this.
>>> s='a\tstring\n'
>>> print(str(s))
a string
>>> print(repr(s))
'a\tstring\n'
>>> print(s)
a string
>>> print([s])
['a\tstring\n']

Take tokens from a text file, calculate their frequency, and return them in a new text file in Python

After a long time researching and asking friends, I am still a dumb-dumb and don't know how to solve this.
So, for homework, we are supposed to define a function which accesses two files, the first of which is a text file with the following sentence, from which we are to calculate the word frequencies:
In a Berlin divided by the Berlin Wall , two angels , Damiel and Cassiel , watch the city , unseen and unheard by its human inhabitants .
We are also to include commas and periods: each single item has already been tokenised (individual items are surrounded by whitespaces - including the commas and periods). Then, the word frequencies must be entered into a new txt-file as "word:count", and in the order in which the words appear, i.e.:
In:1
a:1
Berlin:2
divided:1
etc.
I have tried the following:
def find_token_frequency(x, y):
with open(x, encoding='utf-8') as fobj_1:
with open(y, 'w', encoding='utf-8') as fobj_2:
fobj_1list = fobj_1.split()
unique_string = []
for i in fobj_1list:
if i not in unique_string:
unique_string.append(i)
for i in range(0, len(unique_string)):
fobj_2.write("{}: {}".format(unique_string[i], fobj_1list.count(unique_string[i])))
I am not sure I need to actually use .split() at all, but I don't know what else to do, and it does not work anyway, since it tells me I cannot split that object.
I am told:
Traceback (most recent call last):
[...]
fobj_1list = fobj_1.split()
AttributeError: '_io.TextIOWrapper' object has no attribute 'split'
When I remove the .split(), the displayed error is:
fobj_2.write("{}: {}".format(unique_string[i], fobj_1list.count(unique_string[i])))
AttributeError: '_io.TextIOWrapper' object has no attribute 'count'
Let's divide your problem into smaller problems so we can more easily solve this.
First we need to read a file, so let's do so and save it into a variable:
with open("myfile.txt") as fobj_1:
sentences = fobj_1.read()
Ok, so now we have your file as a string stored in sentences. Let's turn it into a list and count the occurrence of each word:
words = sentence.split(" ")
frequency = {word:words.count(word) for word in set(words)}
Here frequency is a dictionary where each word in the sentences is a key with the value being how many times they appear on the sentence. Note the usage of set(words). A set does not have repeated elements, that's why we are iterating over the set of words and not the word list.
Finally, we can save the word frequencies into a file
with open("results.txt", 'w') as fobj_2:
for word in frequency: fobj_2.write(f"{word}:{frequency[word]}\n")
Here we use f strings to format each line into the desired output. Note that f-strings are available for python3.6+.
I'm unable to comment as I don't have the required reputation, but the reason split() isn't working is because you're calling it on the file object itself, not a string. Try calling:
fobj_1list = fobj_1.readline().split()
instead. Also, when I ran this locally, I got an error saying that TypeError: 'encoding' is an invalid keyword argument for this function. You may want to remove the encoding argument from your function calls.
I think that should be enough to get you going.
The following script should do what you want.
#!/usr/local/bin/python3
def find_token_frequency(inputFileName, outputFileName):
# wordOrderList to maintain order
# dict to keep track of count
wordOrderList = []
wordCountDict = dict()
# read the file
inputFile = open(inputFileName, encoding='utf-8')
lines = inputFile.readlines()
inputFile.close()
# iterate over all lines in the file
for line in lines:
# and split them into words
words = line.split()
# now, iterate over all words
for word in words:
# and add them to the list and dict
if word not in wordOrderList:
wordOrderList.append(word)
wordCountDict[word] = 1
else:
# or increment their count
wordCountDict[word] = wordCountDict[word] +1
# store result in outputFile
outputFile = open(outputFileName, 'w', encoding='utf-8')
for index in range(0, len(wordOrderList)):
word = wordOrderList[index]
outputFile.write(f'{word}:{wordCountDict[word]}\n')
outputFile.close()
find_token_frequency("input.txt", "output.txt")
I changed your variable names a bit to make the code more readable.

Removing lines from a txt file based on the structure of the line

Code:
with open("filename.txt" 'r') as f: #I'm not sure about reading it as r because I would be removing lines.
lines = f.readlines() #stores each line in the txt into 'lines'.
invalid_line_count = 0
for line in lines: #this iterates through each line of the txt file.
if line is invalid:
# something which removes the invalid lines.
invalid_line_count += 1
print("There were " + invalid_line_count + " amount of invalid lines.")
I have a text file like so:
1,2,3,0,0
2,3,0,1,0
0,0,0,1,2
1,0,3,0,0
3,2,1,0,0
The valid line structure is 5 values split by commas.
For a line to be valid, it must have a 1, 2, 3 and two 0's. It doesn't matter in what position these numbers are.
An example of a valid line is 1,2,3,0,0
An example of an invalid line is 1,0,3,0,0, as it does not contain a 2 and has 3 0's instead of 2.
I would like to be able to iterate through the text file and remove invalid lines.
and maybe a little message saying "There were x amount of invalid lines."
Or maybe as suggested:
As you read each line from the original file, test it for validity. If it passes, write it out to the new file. When you're finished, rename the original file to something else, then rename the new file to the original file.
I think that the csv module may help so I read the documentation and it doesn't help me.
Any ideas?
You can't remove lines from a file, per se. Rather, you have to rewrite the file, including only the valid lines. Either close the file after you've read all the data, and reopen in mode "w", or write to a new file as you process the lines (which takes less memory in the short term.
Your main problem with detecting line validity seems to be handling the input. You want to convert the input text to a list of values; this is a skill you should get from learning your tools. The ones you need here are split to divide the line, and int to convert the values. For instance:
line_vals = line.split(',')
Now iterate through line_vals, and convert each to integer with int.
Validity: you need to count the quantity of each value you have in this list. You should be able to count things by value; if not back up to your prior lessons and review basic logic and data flow. If you want the advanced method for this, use collections.Counter, which is a convenient type of dictionary that accumulates counts from any sequence.
Does that get you moving? If you're still lost, I recommend some time with a local tutor.
One of the possible right approaches:
with open('filename.txt', 'r+') as f: # opening file in read/write mode
inv_lines_cnt = 0
valid_list = [0, 0, 1, 2, 3] # sorted list of valid values
lines = f.read().splitlines()
f.seek(0)
f.truncate(0) # truncating the initial file
for l in lines:
if sorted(map(int, l.split(','))) == valid_list:
f.write(l+'\n')
else:
inv_lines_cnt += 1
print("There were {} amount of invalid lines.".format(inv_lines_cnt))
The output:
There were 2 amount of invalid lines.
The final filename.txt contents:
1,2,3,0,0
2,3,0,1,0
3,2,1,0,0
This is a mostly language-independent problem. What you would do is open another file for writing. As you read each line from the original file, test it for validity. If it passes, write it out to the new file. When you're finished, rename the original file to something else, then rename the new file to the original file.
For a line to be valid, each line must have a 1, 2, 3 and 2 0's. It doesn't matter in what position these numbers are.
CHUNK_SIZE = 65536
def _is_valid(line):
"""Check if a line is valid.
A line is valid if it is of length 5 and contains '1', '2', '3',
in any order, as well as '0', twice.
:param list line: The line to check.
:return: True if the line is valid, else False.
:rtype: bool
"""
if len(line) != 5:
# If there's not exactly five elements in the line, return false
return False
if all(x in line for x in {"1", "2", "3"}) and line.count("0") == 2:
# Builtin `all` checks if a condition (in this case `x in line`)
# applies to all elements of a certain iterator.
# `list.count` returns the amount of times a specific
# element appears in it. If "0" appears exactly twice in the line
# and the `all` call returns True, the line is valid.
return True
# If the previous block doesn't execute, the line isn't valid.
return False
def get_valid_lines(path):
"""Get the valid lines from a file.
The valid lines will be written to `path`.
:param str path: The path to the file.
:return: None
:rtype: None
"""
invalid_lines = 0
contents = []
valid_lines = []
with open(path, "r") as f:
# Open the `path` parameter in reading mode.
while True:
chunk = f.read(CHUNK_SIZE)
# Read `CHUNK_SIZE` bytes (65536) from the file.
if not chunk:
# Reaching the end of the file, we get an EOF.
break
contents.append(chunk)
# If the chunk is not empty, add it to the contents.
contents = "".join(contents).split("\n")
# `contents` will be split in chunks of size 65536. We need to join
# them using `str.join`. We then split all of this by newlines, to get
# each individual line.
for line in contents:
if not _is_valid(line=line):
invalid_lines += 1
else:
valid_lines.append(line)
print("Found {} invalid lines".format(invalid_lines))
with open(path, "w") as f:
for line in valid_lines:
f.write(line)
f.write("\n")
I'm splitting this up into two functions, one to check if a line is valid according to your rules, and a second one to manipulate a file. If you want to return the valid lines instead, just remove the second with statement and replace it with return valid_lines.

Python printing lines from a file

Salutations, I am trying to write a function that prints data from a text file line by line. The output needs to have the number of the line followed by a colon and a space. I came up with the following code;
def print_numbered_lines(filename):
"""Function to print numbered lines from a list"""
data = open(filename)
line_number = 1
for line in data:
print(str(line_number)+": "+line, end=' ')
line_number += 1
The issue is when I run this function using test text files I created, the first line is not on the same indentation level as the rest of the lines in the output, ie. the outputs look kind of like
1: 9874234,12.5,23.0,50.0
2: 7840231,70,60,85.4
3: 3845913,55.5,60.5,80.0
4: 3849511,20,60,50
Where am I going wrong? Thanks
Replace the value of end argument with empty string instead of space. As end argument is a space, it's printing a space after every line. So latter lines have a space at the beginning of the line.
def print_numbered_lines(filename):
"""Function to print numbered lines from a list"""
data = open(filename)
line_number = 1
for line in data:
print(str(line_number) + ": " + line, end='')
line_number += 1
Another way you can do this, is strip the new lines and print without passing any value to end argument. This will remove the \n it has at the end of the line and a new line will be printed as end="\n" by default.
def print_numbered_lines(filename):
"""Function to print numbered lines from a list"""
data = open(filename)
line_number = 1
for line in data:
print(str(line_number) + ": " + line.strip("\n"))
line_number += 1
This has to do with your print statement.
print(str(line_number)+": "+line, end=' ')
You probably saw that when printing your lines there was an extra line between them and then you tried to work around this by using end=' '.
If you want to remove the 'empty' lines you should use line.strip(). This removes them.
Use this:
print(str(line_number)+": "+line.strip())
strip can also take an argument. This is from the documentation:
str.strip([chars])
Return a copy of the string with the leading and trailing characters removed. The chars argument is a string specifying the set of characters to be removed. If omitted or None, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped:
Whats up with that?
The lines in your file are not separated into different lines by nothing. On linux a newline is represented by \n. Normal editors convert these by pushing the text down into a new line.
When reading a file Python separates lines on exactly these \n but doesn't throw them away. When printing they will be interpreted again and combined with the newline a print adds there will be one newline 'too much'.
The end parameter in your print statement simply changes what print will use after printing a line. Default is \n.
Check what it does when you use end=" !":
1: aaa
!2: bbb
!3: ccc
You can see the \n after 'aaa' causing a newline (which is part of the string) and after that print adds the contents of end. So it adds a !. The next line is printed in the same line because there is no other newline that would cause a line break before printing it.
You specified end argument as a space. So after first line each has this extra space.
line that your read from file looks somehting like this:
'9874234,12.5,23.0,50.0\n'
Look at the ending. Line translation happens is due to original line.
So to get what you want you just need to change end argument of print to empty string( not space)
Moreover, I advise you to change the implementation of the function and use enumerate for line numbering.
def print_numbered_lines(filename):
data = open(filename)
for i, line in enumerate(data):
print(str(i+1)+": "+line, end='')

Data Manipulation: Stemming from a inability to select lists

I am very new to python with no real prior programing knowledge. At my current job I am being asked to take data in the form of text from about 500+ files and plot them out. I understand the plotting to a degree, but I cannot seem to figure out how to manipulate the data in a way that it is easy to select specific sections. Currently this is what I have for opening a file:
fp=open("file")
for line in fp:
words = line.strip().split()
print words
The result is it gives me a list for each line of the file, but I can only access the last line made. Does any one know a way that would allow me to choose different variations of lists? Thanks a lot!!
The easiest way to get a list of lines from a file is as follows:
with open('file', 'r') as f:
lines = f.readlines()
Now you can split those lines or do whatever you want with them:
lines = [line.split() for line in lines]
I'm not certain that answers your question -- let me know if you have something more specific in mind.
Since I don't understand exactly what you are asking, here are a few more examples of how you might process a text file. You can experiment with these in the interactive interpreter, which you can generally access just by typing 'python' at the command line.
>>> with open('a_text_file.txt', 'r') as f:
... text = f.read()
...
>>> text
'the first line of the text file\nthe second line -- broken by a symbol\nthe third line of the text file\nsome other data\n'
That's the raw, unprocessed text of the file. It's a string. Strings are immutable -- they can't be altered -- but they can be copied in part or in whole.
>>> text.splitlines()
['the first line of the text file', 'the second line -- broken by a symbol', 'the third line of the text file', 'some other data']
splitlines is a string method. splitlines splits the string wherever it finds a \n (newline) character; it then returns a list containing copies of the separate sections of the string.
>>> lines = text.splitlines()
Here I've just saved the above list of lines to a new variable name.
>>> lines[0]
'the first line of the text file'
Lists are accessed by indexing. Just provide an integer from 0 to len(lines) - 1 and the corresponding line is returned.
>>> lines[2]
'the third line of the text file'
>>> lines[1]
'the second line -- broken by a symbol'
Now you can start to manipulate individual lines.
>>> lines[1].split('--')
['the second line ', ' broken by a symbol']
split is another string method. It's like splitlines but you can specify the character or string that you want to use as the demarcator.
>>> lines[1][4]
's'
You can also index the characters in a string.
>>> lines[1][4:10]
'second'
You can also "slice" a string. The result is a copy of characters 4 through 9. 10 is the stop value, so the 10th character isn't included in the slice. (You can slice lists too.)
>>> lines[1].index('broken')
19
If you want to find a substring within a string, one way is to use index. It returns the index at which the first occurrence of the substring appears. (It throws an error if the substring isn't in the string. If you don't want that, use find, which returns a -1 if the substring isn't in the string.)
>>> lines[1][19:]
'broken by a symbol'
Then you can use that to slice the string. If you don't provide a stop index, it just returns the remainder of the string.
>>> lines[1][:19]
'the second line -- '
If you don't provide a start index, it returns the beginning of the string and stops at the stop index.
>>> [line for line in text.splitlines() if 'line' in line]
['the first line of the text file', 'the second line -- broken by a symbol', 'the third line of the text file']
You can also use in -- it's a boolean operation that returns True if a substring is in a string. In this case, I've used a list comprehension to get only the lines that have 'line' in them. (Note that the last line is missing from the list. It has been filtered.)
Let me know if you have any more questions.

Categories

Resources