reading file to check for same number of delimiters

reading file to check for same number of delimiters - python

I have created a method which reads a file line by line and checks if they all contain the same number of delimiters (see below code). The trouble with the solution is that it works on a line per line basis. Given that some of the files I am dealing with are gigabytes in size, this will take a while to process, is there a better solution which will 1) validate whether all lines contain the same number of delimiters 2) not cause any out of memory issues. Thanks in advance.
def isValid(fileName):
with open(fileName,'rb') as infile:
for lineNumber,line in enumerate(infile,1):
count = line.count(',')
if lineNumber > 1 and prevCount != count:
# this line does not contain the same number of delimiters
return False
prevCount = count
return True

You can use all instead and a generator expression:
with open(file_name) as your_file:
start = your_file.readline().count(',') # initial count
print all(i.count(',') == start for i in your_file)

I propose a different approach (without code):
1. read the file as binary, and in chunks of, say, 64 KB
2. count the number of end-of-line tokens in the chunk
3. count the number of delimiters in the chunk but only up to the position of the last EOL token
4. if both number do not divide evenly, stop and return False
5. At EOF, return True
As you'd have to handle the 'overlap' between the last EOL token and the end of the chunk the logic is a bit more complicated than the 'brute-force' approach. But in dealing with GBs it might pay off.

I just noticed that - if you would want to stick with simple logic - the original code can be deflated a bit:
def isValid(fileName):
with open(fileName,'r') as infile:
count = infile.readline().count(',')
for line in infile:
if line.count(',') != count:
return False
return True
There is no need to keep the previous line's count as one single difference will decide it. So keep only the delim count of the first line.
Then, the file needs to be opened as a text file ('r'), not as a binary.
Lastly, by prefetching the very first line just before the loop we can discard the call to enumerate.

Related

Python - count key value pairs from text file

I have the following text file:
abstract 233:1 253:1 329:2 1087:2 1272:1
game 64:1 99:1 206:1 595:1
direct 50:1 69:1 1100:1 1765:1 2147:1 3160:1
each key pair is how many times each string appears in a document [docID]:[stringFq]
How could you calculate the number of key pairs in this text file?

Your regex approach works fine. Here is an iterative approach. If you uncomment the print statements you will uncover some itermediate results.
Given
%%file foo.txt
abstract 233:1 253:1 329:2 1087:2 1272:1
game 64:1 99:1 206:1 595:1
direct 50:1 69:1 1100:1 1765:1 2147:1 3160:1
Code
import itertools as it
with open("foo.txt") as f:
lines = f.readlines()
#print(lines)
pred = lambda x: x.isalpha()
count = 0
for line in lines:
line = line.strip("\n")
line = "".join(it.dropwhile(pred, line))
pairs = line.strip().split(" ")
#print(pairs)
count += len(pairs)
count
# 15
Details
First we use a with statement, which an idiom for safely opening and closing files. We then split the file into lines via readlines(). We define a conditional function (or predicate) that we will use later. The lambda expression is used for convenience and is equivalent to the following function:
def pred(x):
return x.isaplha()
We initialize a count variable and start iterating each line. Every line may have a trailing newline character \n, so we first strip() them away before feeding the line to dropwhile.
dropwhile is a special itertools iterator. As it iterates a line, it will discard any leading characters that satisfy the predicate until it reaches the first character that fails the predicate. In other words, all letters at the start will be dropped until the first non-letter is found (which happens to be a space). We clean the new line again, stripping the leading space, and the remaining string is split() into a list of pairs.
Finally the length of each line of pairs is incrementally added to count. The final count is the sum of all lengths of pairs.
Summary
The code above shows how to tackle basic file handling with simple, iterative steps:
open the file
split the file into lines
while iterating each line, clean and process data
output a result

import re
file = open('input.txt', 'r')
file = file.read()
numbers = re.findall(r"[-+]?\d*\.\d+|\d+", file)
#finds all ints from text file
numLen = len(numbers) / 2
#counts all ints, when I needed to count pairs, so I just divided it by 2
print(numLen)

Removing lines from a txt file based on the structure of the line

Code:
with open("filename.txt" 'r') as f: #I'm not sure about reading it as r because I would be removing lines.
lines = f.readlines() #stores each line in the txt into 'lines'.
invalid_line_count = 0
for line in lines: #this iterates through each line of the txt file.
if line is invalid:
# something which removes the invalid lines.
invalid_line_count += 1
print("There were " + invalid_line_count + " amount of invalid lines.")
I have a text file like so:
1,2,3,0,0
2,3,0,1,0
0,0,0,1,2
1,0,3,0,0
3,2,1,0,0
The valid line structure is 5 values split by commas.
For a line to be valid, it must have a 1, 2, 3 and two 0's. It doesn't matter in what position these numbers are.
An example of a valid line is 1,2,3,0,0
An example of an invalid line is 1,0,3,0,0, as it does not contain a 2 and has 3 0's instead of 2.
I would like to be able to iterate through the text file and remove invalid lines.
and maybe a little message saying "There were x amount of invalid lines."
Or maybe as suggested:
As you read each line from the original file, test it for validity. If it passes, write it out to the new file. When you're finished, rename the original file to something else, then rename the new file to the original file.
I think that the csv module may help so I read the documentation and it doesn't help me.
Any ideas?

You can't remove lines from a file, per se. Rather, you have to rewrite the file, including only the valid lines. Either close the file after you've read all the data, and reopen in mode "w", or write to a new file as you process the lines (which takes less memory in the short term.
Your main problem with detecting line validity seems to be handling the input. You want to convert the input text to a list of values; this is a skill you should get from learning your tools. The ones you need here are split to divide the line, and int to convert the values. For instance:
line_vals = line.split(',')
Now iterate through line_vals, and convert each to integer with int.
Validity: you need to count the quantity of each value you have in this list. You should be able to count things by value; if not back up to your prior lessons and review basic logic and data flow. If you want the advanced method for this, use collections.Counter, which is a convenient type of dictionary that accumulates counts from any sequence.
Does that get you moving? If you're still lost, I recommend some time with a local tutor.

One of the possible right approaches:
with open('filename.txt', 'r+') as f: # opening file in read/write mode
inv_lines_cnt = 0
valid_list = [0, 0, 1, 2, 3] # sorted list of valid values
lines = f.read().splitlines()
f.seek(0)
f.truncate(0) # truncating the initial file
for l in lines:
if sorted(map(int, l.split(','))) == valid_list:
f.write(l+'\n')
else:
inv_lines_cnt += 1
print("There were {} amount of invalid lines.".format(inv_lines_cnt))
The output:
There were 2 amount of invalid lines.
The final filename.txt contents:
1,2,3,0,0
2,3,0,1,0
3,2,1,0,0

This is a mostly language-independent problem. What you would do is open another file for writing. As you read each line from the original file, test it for validity. If it passes, write it out to the new file. When you're finished, rename the original file to something else, then rename the new file to the original file.

For a line to be valid, each line must have a 1, 2, 3 and 2 0's. It doesn't matter in what position these numbers are.
CHUNK_SIZE = 65536
def _is_valid(line):
"""Check if a line is valid.
A line is valid if it is of length 5 and contains '1', '2', '3',
in any order, as well as '0', twice.
:param list line: The line to check.
:return: True if the line is valid, else False.
:rtype: bool
"""
if len(line) != 5:
# If there's not exactly five elements in the line, return false
return False
if all(x in line for x in {"1", "2", "3"}) and line.count("0") == 2:
# Builtin `all` checks if a condition (in this case `x in line`)
# applies to all elements of a certain iterator.
# `list.count` returns the amount of times a specific
# element appears in it. If "0" appears exactly twice in the line
# and the `all` call returns True, the line is valid.
return True
# If the previous block doesn't execute, the line isn't valid.
return False
def get_valid_lines(path):
"""Get the valid lines from a file.
The valid lines will be written to `path`.
:param str path: The path to the file.
:return: None
:rtype: None
"""
invalid_lines = 0
contents = []
valid_lines = []
with open(path, "r") as f:
# Open the `path` parameter in reading mode.
while True:
chunk = f.read(CHUNK_SIZE)
# Read `CHUNK_SIZE` bytes (65536) from the file.
if not chunk:
# Reaching the end of the file, we get an EOF.
break
contents.append(chunk)
# If the chunk is not empty, add it to the contents.
contents = "".join(contents).split("\n")
# `contents` will be split in chunks of size 65536. We need to join
# them using `str.join`. We then split all of this by newlines, to get
# each individual line.
for line in contents:
if not _is_valid(line=line):
invalid_lines += 1
else:
valid_lines.append(line)
print("Found {} invalid lines".format(invalid_lines))
with open(path, "w") as f:
for line in valid_lines:
f.write(line)
f.write("\n")
I'm splitting this up into two functions, one to check if a line is valid according to your rules, and a second one to manipulate a file. If you want to return the valid lines instead, just remove the second with statement and replace it with return valid_lines.

How to input a line word by word in Python?

I have multiple files, each with a line with, say ~10M numbers each. I want to check each file and print a 0 for each file that has numbers repeated and 1 for each that doesn't.
I am using a list for counting frequency. Because of the large amount of numbers per line I want to update the frequency after accepting each number and break as soon as I find a repeated number. While this is simple in C, I have no idea how to do this in Python.
How do I input a line in a word-by-word manner without storing (or taking as input) the whole line?
EDIT: I also need a way for doing this from live input rather than a file.

Read the line, split the line, copy the array result into a set. If the size of the set is less than the size of the array, the file contains repeated elements
with open('filename', 'r') as f:
for line in f:
# Here is where you do what I said above
To read the file word by word, try this
import itertools
def readWords(file_object):
word = ""
for ch in itertools.takewhile(lambda c: bool(c), itertools.imap(file_object.read, itertools.repeat(1))):
if ch.isspace():
if word: # In case of multiple spaces
yield word
word = ""
continue
word += ch
if word:
yield word # Handles last word before EOF
Then you can do:
with open('filename', 'r') as f:
for num in itertools.imap(int, readWords(f)):
# Store the numbers in a set, and use the set to check if the number already exists
This method should also work for streams because it only reads one byte at a time and outputs a single space delimited string from the input stream.
After giving this answer, I've updated this method quite a bit. Have a look
<script src="https://gist.github.com/smac89/bddb27d975c59a5f053256c893630cdc.js"></script>

The way you are asking it is not possible I guess. You can't read word by word as such in python . Something of this can be done:
f = open('words.txt')
for word in f.read().split():
print(word)

Python file reading after file is all read

Using python 2-7:
msg = self.infile.read(1500)
I am reading pieces of 1500Bytes from a file inside a possibly infinite while loop,
when the file is all done and I have read it all what happens? Would I be reading again from the start? (I don't want that)
Is there a simple way to count how many chuncks of 1500B strings (or less for the last one) I have read in total without saving them?

You can hold a counter:
ct = 0
while 1:
msg = self.infile.read(1500)
if len(msg) == 0:
break
else:
ct += 1
After the loop ends the ct variable will hold your number of chunks.

Update
While I'll leave the below notes I added here, other answers in here are better meeting your requirements.
Specifically adding a
if not msg:
break
Into the loop will take care of stopping the infinite loop when you reach the EOF.
As you haven't detailed why you are grabbing 1500 Byte chunks, I'm going to suggest an alternative per line solution. Assuming that self.infile is a file descriptor.
msg = [ line for line in self.infile ]
This will give you a list with each line from the file in a separate element with index '0' being the first and index 'n' (ie the last element in the list) being the last.

You can read chunk using iter like this
>>> with open('infile') as infile:
... for chunk in iter(lambda:infile.read(1500), ''):
... if len(chunk) == 1500: # skip if size of chunk is less than 1500
... print chunk # you can deal with `chunk` and count it here
Or you just want know count
>>> os.stat('infile').st_size/1500

Efficiently reading a certain line in a file

Came across some different methods for reading files in Python, I was wondering which is the fastest way to do it.
For example reading the last line of a file, one can do
input_file = open('mytext.txt', 'r')
lastLine = ""
for line in input_file:
lastLine = line
print lastLine # This is the last line
Or
fileHandle = open('mytext.txt', 'r')
lineList = fileHandle.readlines()
print lineList[-1] #This is the last line
I'm assuming for that particular case this may be not really relevant discussing efficiency...
Question:
1. Which method is faster for picking a random line
2. Can we deal with concepts like "SEEK" in Python (if so is it faster?)

If you don't need a uniform distribution (i.e. it's okay that the chance for some line to be picked is not equal for all lines) and/or if your lines are all about the same length then the problem of picking the random line can be simplified to:
Determine the size of the file in bytes
Seek to a random position
Search for the last newline character if any (there may be none if there's no preceding line)
Pick all text up to the next newline character or the end of file, whichever comes first.
For (2) you do an educated guess for how far you've got to search backwards to find the previous newline. If you can tell that a line is n bytes on average then you could read the previous n bytes in a single step.

I had this problematic few days ago and I use this solution. My solution is similar to the #Frerich Raabe one, but with no random, just logic :)
def get_last_line(f):
""" f is a file object in read mode, I just extract the algorithm from a bigger function """
tries = 0
offs = -512
while tries < 5:
# Put the cursor at n*512nth character before the end.
# If we reach the max fsize, it puts the cursor at the beginning (fsize * -1 means move the cursor of -fsize from the end)
f.seek(max(fsize * -1, offs), 2)
lines = f.readlines()
if len(lines) > 1: # If there's more than 1 lines found, then we have the last complete line
return lines[-1] # Returns the last complete line
offs *= 2
tries += 1
raise ValueError("No end line found, after 5 tries (Your file may has only 1 line or the last line is longer than %s characters)" % offs)
The tries counters avoid to be block if the file has also one line (a very very long last line). The algorithm tries to get the last line from the last 512 characters, then 1024, 2048... and stop if there's still no complete line at the th iteration.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

reading file to check for same number of delimiters - python

You can use all instead and a generator expression: with open(file_name) as your_file: start = your_file.readline().count(',') # initial count print all(i.count(',') == start for i in your_file)

Related

Python - count key value pairs from text file

Removing lines from a txt file based on the structure of the line

How to input a line word by word in Python?

Python file reading after file is all read

Efficiently reading a certain line in a file

Categories

Resources