Getting number of lines in a text file without readlines - python

Let's say I have a program that uses a .txt file to store data it needs to operate. Because it's a very large amount of data (just go with it) in the text file I was to use a generator rather than an iterator to go through the data in it so that my program leaves as much space as possible. Let's just say (I know this isn't secure) that it's a list of usernames. So my code would look like this (using python 3.3).
for x in range LenOfFile:
id = file.readlines(x)
if username == id:
validusername = True
#ask for a password
if validusername == True and validpassword == True:
pass
else:
print("Invalid Username")
Assume that valid password is set to True or False where I ask for a password. My question is, since I don't want to take up all of the RAM I don't want to use readlines() to get the whole thing, and with the code here I only take a very small amount of RAM at any given time. However, I am not sure how I would get the number of lines in the file (assume I cannot find the number of lines and add to it as new users arrive). Is there a way Python can do this without reading the entire file and storing it at once? I already tried len(), which apparently doesn't work on text files but was worth a try. The one way I have thought of to do this is not too great, it involves just using readlines one line at a time in a range so big the text file must be smaller, and then continuing when I get an error. I would prefer not to use this way, so any suggestions would be appreciated.

You can just iterate over the file handle directly, which will then iterate over it line-by-line:
for line in file:
if username == line.strip():
validusername = True
break
Other than that, you can’t really tell how many lines a file has without looking at it completely. You do know how big a file is, and you could make some assumptions on the character count for example (UTF-8 ruins that though :P); but you don’t know how long each line is without seeing it, so you don’t know where the line breaks are and as such can’t tell how many lines there are in total. You still would have to look at every character one-by-one to see if a new line begins or not.
So instead of that, we just iterate over the file, and stop once whenever we read a whole line—that’s when the loop body executes—and then we continue looking from that position in the file for the next line break, and so on.

Yes, the good news is you can find number of lines in a text file without readlines, for line in file, etc. More specifically in python you can use byte functions, random access, parallel operation, and regular expressions, instead of slow sequential text line processing. Parallel text file like CSV file line counter is particularly suitable for SSD devices which have fast random access, when combined with a many processor cores. I used a 16 core system with SSD to store the Higgs Boson dataset as a standard file which you can go download to test on. Even more specifically here are fragments from working code to get you started. You are welcome to freely copy and use but if you do then please cite my work thank you:
import re
from argparse import ArgumentParser
from multiprocessing import Pool
from itertools import repeat
from os import stat
unitTest = 0
fileName = None
balanceFactor = 2
numProcesses = 1
if __name__ == '__main__':
argparser = ArgumentParser(description='Parallel text file like CSV file line counter is particularly suitable for SSD which have fast random access')
argparser.add_argument('--unitTest', default=unitTest, type=int, required=False, help='0:False 1:True.')
argparser.add_argument('--fileName', default=fileName, required=False, help='')
argparser.add_argument('--balanceFactor', default=balanceFactor, type=int, required=False, help='integer: 1 or 2 or 3 are typical')
argparser.add_argument('--numProcesses', default=numProcesses, type=int, required=False, help='integer: 1 or more. Best when matched to number of physical CPU cores.')
cmd = vars(argparser.parse_args())
unitTest=cmd['unitTest']
fileName=cmd['fileName']
balanceFactor=cmd['balanceFactor']
numProcesses=cmd['numProcesses']
#Do arithmetic to divide partitions into startbyte, endbyte strips among workers (2 lists of int)
#Best number of strips to use is 2x to 3x number of workers, for workload balancing
#import numpy as np # long heavy import but i love numpy syntax
def PartitionDataToWorkers(workers, items, balanceFactor=2):
strips = balanceFactor * workers
step = int(round(float(items)/strips))
startPos = list(range(1, items+1, step))
if len(startPos) > strips:
startPos = startPos[:-1]
endPos = [x + step - 1 for x in startPos]
endPos[-1] = items
return startPos, endPos
def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'): # counts number of searchChar appearing in the byte range
with open(fileName, 'r') as f:
f.seek(startByte-1) # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
bytes = f.read(endByte - startByte + 1)
cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
return cnt
if 0 == unitTest:
# Run app, not unit tests.
fileBytes = stat(fileName).st_size # Read quickly from OS how many bytes are in a text file
startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
p = Pool(numProcesses)
partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
globalSum = sum(partialSum)
print(globalSum)
else:
print("Running unit tests") # Bash commands like: head --bytes 96 beer.csv are how I found the correct values.
fileName='beer.csv' # byte 98 is a newline
assert(8==ReadFileSegment(1, 288, fileName))
assert(1==ReadFileSegment(1, 100, fileName))
assert(0==ReadFileSegment(1, 97, fileName))
assert(1==ReadFileSegment(97, 98, fileName))
assert(1==ReadFileSegment(98, 99, fileName))
assert(0==ReadFileSegment(99, 99, fileName))
assert(1==ReadFileSegment(98, 98, fileName))
assert(0==ReadFileSegment(97, 97, fileName))
print("OK")
The bash wc program is slightly faster but you wanted pure python, and so did I. Below is some performance testing results. That said if you change some of this code to use cython or something you might even get some more speed.
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.257s
user 0m12.088s
sys 0m20.512s
HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv
real 0m1.820s
user 0m0.364s
sys 0m1.456s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.256s
user 0m10.696s
sys 0m19.952s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000
real 0m17.380s
user 0m11.124s
sys 0m6.272s
Conclusion: The speed is good for a pure python program compared to a C program. However, it’s not good enough to use the pure python program over the C program.
I wondered if compiling the regex just one time and passing it to all workers will improve speed. Answer: Regex pre-compiling does NOT help in this application. I suppose the reason is that the overhead of process serialization and creation for all the workers is dominating.
One more thing. Does parallel CSV file reading even help, I wondered? Is the disk the bottleneck, or is it the CPU? Oh yes, yes it does. Parallel file reading works quite well. Well there you go!
Data science is a typical use case for pure python. I like to use python (jupyter) notebooks, and I like to keep all code in the notebook rather than use bash scripts when possible. Finding the number of examples in a dataset is a common need for doing machine learning where you generally need to partition a dataset into training, dev, and testing examples.
Higgs Boson dataset:
https://archive.ics.uci.edu/ml/datasets/HIGGS

If you want number of lines in a file so badly, why don't you use len
with open("filename") as f:
num = len(f.readlines())

Related

Counter unable to run in PowerShell/IDLE not responding to large list?

I am a complete beginner, so forgive the probably-obvious question. I have a list of roughly ~800,000 items that I am trying to run through Counter. When I try to open the script in IDLE, it stops responding, and when I try to run the script through PowerShell, it throws back an error in Line 9 (the line the large list is populated on). Is there a cap on the number of items that Counter can run?
For brevity's sake, I am not including my whole list here of course, but this is my basic script:
#!/usr/bin/env python3
import json
from itertools import count
from urllib.request import urlopen
from collections import Counter
from collections import Counter
list1 = [list, items, here, et cetera]
print(Counter(list1))
This is the complete script -- Full script with list data.
The full code took 20-30 seconds to load. IDLE quits responding because the text widget of the tk GUI framework it uses freezes with super-duper long lines. 10000 chars is enough to bring it almost to a stop. 100000s or 1000000s should completely freeze is.
The error in Powershell has nothing to do with IDLE. Posting a sample list with a few unquoted items would have been enough to expose that error.
Given that the recommended maximum line length for Python is 79 characters https://www.python.org/dev/peps/pep-0008/#maximum-line-length your expectations should have been moderate at best.
It's generally a bad practice to keep your data in your code. If you must, you should at least properly quote and escape each string in the list, e.g.:
list1 = ['sherlock holmes', 'something\\else', 'a \'quote\' here', ...]
But it's a lot easier and more robust to just put your data in a text file:
sherlock holmes
star wars
star wars sequel trilogy
...
ya lit
books
The text file need no escaping, although you may need something to deal with line endings, which appear to have been escaped as \xa0 in your data.
And then read the file from code:
with open('myfile.txt') as f:
list1 = f.read().splitlines()
From the partial escaping, it seems likely something generated your 'code' to begin with - you may want to generate it again without the escaping, and just output a clean text file, and only deal with the line endings in a sensible way.

How can I skip n lines of a binary stdin using Python?

I'm piping binary data to a Python script on a Hadoop cluster using the Hadoop CLI. The binary data have terminators that identify where new documents begin. The records are sorted by a unique identifier which starts at 1000000001 and increments by 1.
I am trying to save the data only for a subset of these IDs which I have in a dictionary.
My current process is to select the data from the CLI using:
hadoop select "Database" "Collection" | cut -d$'\t' -f2 | python script.py
and process it in script.py which looks like this:
import json
import sys
member_mapping = json.load(open('member_mapping.json'))
output = []
for line in sys.stdin:
person = json.loads(line)
if member_mapping.get(person['personId']):
output.append({person['personId']: person})
if len(output) == len(member_mapping):
break
The problem is that there are 6.5M IDs in this binary data and it takes almost 2 hours to scan. I know the min() and max() IDs in my dictionary and you can see in my code that I stop early when I have saved n documents where n is the length of my mapping file.
I want to make this process more efficient by skipping as many reads as possible. If the ID starts at 1000000001 and the first ID I want to save is 1000010001, can I simply skip 10,000 lines?
Due to system issues, I'm not currently able to use spark or any other tools that may improve this process, so I need to stick to solutions that utilize Python and the Hadoop CLI for now.
You could try using enumerate and a threshold, and then skip any input that isn't in the rane you care about. This isn't a direct fix, but should run much faster and throw those first 10,000 lines away pretty quick.
for lineNum, line in enumerate(sys.stdin):
if(lineNum < 10000):
continue
person = json.loads(line)
if member_mapping.get(person['personId']):
output.append({person['personId']: person})
if len(output) == len(member_mapping):
break

Python: How to Average Ping Times From File

I am looking to write two python scripts; one to ping an IP, and store the ping results to a file, and one to extract and average the ping times from the created .txt file. (Please note that all I really need to log are the ping times) (My platform is the Pi if it helps)
Below is the line of code which I plan to use to store the ping results in a text file (Obviously in my program I have put this in an infinity loop with a delay so it doesn't ping too often)
command = os.system('ping 127.0.0.1 >> pingresults.txt')
I am stuck on how access this file, and then to parse this file into just the useful data? (Please bear in mind that I am a serious amateur)
I am wondering if when I initially log the data into the file, if I can filter it for just the ping time. That would make averaging them later much easier.
If you have any suggestions of commands of interest, or tricks, or implementations, that would be great!
I'll take this in basic steps, entirely in Python, ignoring Python "tricks":
Open the file:
f = open("pingresults.txt", "r")
time_list = []
for line in f:
# Ping time is in the penultimate field, 5 chars in.
field_list = line.split(' ')
ping_time = field_list[-2][5:]
time_list.append(field_list[-1]) # add last field to end of time_list
print sum(time_list) / float(len(time_list))

Python caching file reads? File reads are faster on second execution

I wrote a simple caching script that searches through a directory for certain files, hashes that directory and then reads and processes the files found in that directory in a loop. It then writes the hash plus whatever data I was able to pull to a file.
On the first execution it takes about twenty seconds which is expected as there are lots of big files to regex match line by line. However, on subsequent executions, it runs MUCH faster, around 4 seconds. This seems to persist even if I start a new instance of the python interpreter or close out my IDE (Spyder) entirely.
What's interesting though is that if I run the script on a different folder, it takes about 20 seconds again, then back to 4 seconds every time after. Also interesting, is that running it with Python 3 on those same two folders also takes 20 seconds the first time, then again back to 4 seconds every time after. If I rename the folder, it only takes 4 seconds on the first run. So far I haven't been able to recreate the 20 second runtime at all.
It seems to me that Python or the OS must be internally caching the file reads (just an educated guess as I think the open() calls are the most expensive).
I'm not using anything fancy here btw, just pythons built-in open() and looping through the lines, the re module and a very minimal hash module that can be found here.
Here's one of the functions that opens and processes these files in a loop (the others are virtually the same, just searching in different paths)
def createGlobalDataDictionary(self, path):
dict1 = {}
dict2 = {}
start = timeit.default_timer()
self.filelist[path] = hashfunc.dirhash(path, 'sha1')
for i in os.listdir(path):
if i.endswith(".m"):
if re.match('.*foo.*', i):
openi = open(os.path.join(path, i), "r")
for line in openi:
if re.match('.*bar.*', line):
key = line.split('.')[0]
value = line.partition('\'')[-1].rpartition('\'')[0]
dict1[key] = value
openi.close()
elif re.match('.*baz.*', i):
openi = open(os.path.join(path, i), "r")
for line in openi:
if re.match(".*qux.*", line):
key = line.split('.')[0]
value = line.partition('\'')[-1].rpartition('\'')[0]
dict2[key] = value
openi.close()
return (dict1, dict2)
I'm using WinPython 32-Bit 2.7.6.3 and WinPython 64-Bit 3.4.2.4
I've been trying to figure out what's causing this behavior all day and I'm pretty stumped so any help is appreciated.

Python: how to capture output to a text file? (only 25 of 530 lines captured now)

I've done a fair amount of lurking on SO and a fair amount of searching and reading, but I must also confess to being a relative noob at programming in general. I am trying to learn as I go, and so I have been playing with Python's NLTK. In the script below, I can get everything to work, except it only writes what would be the first screen of a multi-screen output, at least that's how I am thinking about it.
Here's the script:
#! /usr/bin/env python
import nltk
# First we have to open and read the file:
thefile = open('all_no_id.txt')
raw = thefile.read()
# Second we have to process it with nltk functions to do what we want
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
# Now we can actually do stuff with it:
concord = text.concordance("cultural")
# Now to save this to a file
fileconcord = open('ccord-cultural.txt', 'w')
fileconcord.writelines(concord)
fileconcord.close()
And here's the beginning of the output file:
Building index...
Displaying 25 of 530 matches:
y .   The Baobab Tree : Stories of Cultural Continuity The continuity evident
regardless of ethnicity , and the cultural legacy of Africa as well . This Af
What am I missing here to get the entire 530 matches written to the file?
text.concordance(self, word, width=79, lines=25) seem to have other parameters as per manual.
I see no way to extract the size of concordance index, however, the concordance printing code seem to have this part: lines = min(lines, len(offsets)), therefore you can simply pass sys.maxint as a last argument:
concord = text.concordance("cultural", 75, sys.maxint)
Added:
Looking at you original code now, I can't see a way it could work before. text.concordance does not return anything, but outputs everything to stdout using print. Therefore, the easy option would be redirection stdout to you file, like this:
import sys
....
# Open the file
fileconcord = open('ccord-cultural.txt', 'w')
# Save old stdout stream
tmpout = sys.stdout
# Redirect all "print" calls to that file
sys.stdout = fileconcord
# Init the method
text.concordance("cultural", 200, sys.maxint)
# Close file
fileconcord.close()
# Reset stdout in case you need something else to print
sys.stdout = tmpout
Another option would be to use the respective classes directly and omit the Text wrapper. Just copy bits from here and combine them with bits from here and you are done.
Update:
I found this write text.concordance output to a file Options
from the ntlk usergroup. It's from 2010, and states:
Documentation for the Text class says: "is intended to support
initial exploration of texts (via the interactive console). ... If you
wish to write a program which makes use of these analyses, then you
should bypass the Text class, and use the appropriate analysis
function or class directly instead."
If nothing has changed in the package since then, this may be the source of your problem.
--- previously ---
I don't see a problem with writing to the file using writelines():
file.writelines(sequence)
Write a sequence of strings to the file. The sequence can be any
iterable object producing strings, typically a list of strings. There
is no return value. (The name is intended to match readlines();
writelines() does not add line separators.)
Note the italicized part, did you examine the output file in different editors? Perhaps the data is there, but not being rendered correctly due to missing end of line seperators?
Are you sure this part is generating the data you want to output?
concord = text.concordance("cultural")
I'm not familiar with nltk, so I'm just asking as part of eliminating possible sources for the problem.

Categories

Resources