How can I skip n lines of a binary stdin using Python? - python

I'm piping binary data to a Python script on a Hadoop cluster using the Hadoop CLI. The binary data have terminators that identify where new documents begin. The records are sorted by a unique identifier which starts at 1000000001 and increments by 1.
I am trying to save the data only for a subset of these IDs which I have in a dictionary.
My current process is to select the data from the CLI using:
hadoop select "Database" "Collection" | cut -d$'\t' -f2 | python script.py
and process it in script.py which looks like this:
import json
import sys
member_mapping = json.load(open('member_mapping.json'))
output = []
for line in sys.stdin:
person = json.loads(line)
if member_mapping.get(person['personId']):
output.append({person['personId']: person})
if len(output) == len(member_mapping):
break
The problem is that there are 6.5M IDs in this binary data and it takes almost 2 hours to scan. I know the min() and max() IDs in my dictionary and you can see in my code that I stop early when I have saved n documents where n is the length of my mapping file.
I want to make this process more efficient by skipping as many reads as possible. If the ID starts at 1000000001 and the first ID I want to save is 1000010001, can I simply skip 10,000 lines?
Due to system issues, I'm not currently able to use spark or any other tools that may improve this process, so I need to stick to solutions that utilize Python and the Hadoop CLI for now.

You could try using enumerate and a threshold, and then skip any input that isn't in the rane you care about. This isn't a direct fix, but should run much faster and throw those first 10,000 lines away pretty quick.
for lineNum, line in enumerate(sys.stdin):
if(lineNum < 10000):
continue
person = json.loads(line)
if member_mapping.get(person['personId']):
output.append({person['personId']: person})
if len(output) == len(member_mapping):
break

Related

Import multiple files output from bash script into Python lists

I have a bash script that connects to multiple compute nodes and pulls data from each one depending on some arguments entered after the bash script is called. For simplicity sake, I'm essentially doing this:
for h in node{0..7}; do ssh $h 'fold -w 80 /program.headers | grep "RA"
| head -600 | tr -d "RA =" > '$h'filename'; done
I'm trying to take the 8 files that come out of this (each have 600 pieces of information) and save them each as a list in Python. I then need to manipulate them in Python (split and convert to float) to be able to plot the data with Matplotlib.
For a bash script that only outputs one file, I can easily make a variable name equal to check_output and then manipulate from there:
test = subprocess.check_output("./bashscript")
test_list = test.split()
test = [float(a) for a in test_list]
I am also able to read a saved file from my bash script by using:
test = subprocess.check_output(['cat', '/path/filename'])
test_list = test.split()
test = [float(a) for a in test_list]
The problem is, I'm working with over 80 files after I get all that I need. Is there some way in Python to say, "for every file made store the contents of that as a list"?
Instead of capturing data by using subprocess you can use os.popen() to execute scripts. The benefit of using it is that you can read the output of a command/script as you are reading a file. So you can use read(), readlines(),readline() according to your wish which all will give you a list. By using that you can execute the script and capture output like this
import os
output=os.popen("./bashscript").readlines() #now output has the op of bashsceipt with each line as a seperate item as list.
check this for more info on how to use os.popen(). check this to know difference between read(),readlines(),readline(),xreadlines()
Define a simple interface between your bash script and your python script
It looks like the simple interface used to be a print out of the file, but this solution did not scale to multiple files. Now, I recommend the interface be printing out the names of files created. It would look something like this:
filenames = subprocess.check_output("./bashscript").split()
for filename in filenames:
with open(filename) as file_obj:
file_data = [float(a) for a in file_obj.readlines()]
It looks like you are unfamiliar with Python but are familiar with bash. As a result, you are programming hobbled on bash crutches, instead you should embrace Python and use it in your application. You probably do not need the bash script at all.

Efficient way to find a string based on a list

I'm new to scripting and have been reading up on Python for about 6 weeks. The below is meant to read a log file and send an alert if one of the keywords defined in srchstring is found. It works as expected and doesn't alert on strings previously found, as expected. However the file its processing is actively being written to by an application and the script is too slow on files around 500mb. under 200mb it works fine ie within 20secs.
Could someone suggest a more efficient way to search for a string within a file based on a pre-defined list?
import os
srchstring = ["Shutdown", "Disconnecting", "Stopping Event Thread"]
if os.path.isfile(r"\\server\\share\\logfile.txt"):
with open(r"\\server\\share\\logfile.txt","r") as F:
for line in F:
for st in srchstring:
if st in line:
print line,
#do some slicing of string to get dd/mm/yy hh:mm:ss:ms
# then create a marker file called file_dd/mm/yy hh:mm:ss:ms
if os.path.isfile("file_dd/mm/yy hh:mm:ss:ms"): # check if a file already exists named file_dd/mm/yy hh:mm:ss:ms
print "string previously found- ignoring, continuing search" # marker file exists
else:
open("file_dd/mm/yy hh:mm:ss:ms", 'a') # create file_dd/mm/yy hh:mm:ss:ms
print "error string found--creating marker file sending email alert" # no marker file, create it then send email
else:
print "file not exist"
Refactoring the search expression to a precompiled regular expression avoids the (explicit) innermost loop.
import os, re
regex = re.compile(r'Shutdown|Disconnecting|Stopping Event Thread')
if os.path.isfile(r"\\server\\share\\logfile.txt"):
#Indentation fixed as per comment
with open(r"\\server\\share\\logfile.txt","r") as F:
for line in F:
if regex.search(line):
# ...
I assume here that you use Linux. If you don't, install MinGW on Windows and the solution below will become suitable too.
Just leave the hard part to the most efficient tools available. Filter your data before you go to the python script. Use grep command to get the lines containing "Shutdown", "Disconnecting" or "Stopping Event Thread"
grep 'Shutdown\|Disconnecting\|"Stopping Event Thread"' /server/share/logfile.txt
and redirect the lines to your script
grep 'Shutdown\|Disconnecting\|"Stopping Event Thread"' /server/share/logfile.txt | python log.py
Edit: Windows solution. You can create a .bat file to make it executable.
findstr /c:"Shutdown" /c:"Disconnecting" /c:"Stopping Event Thread" \server\share\logfile.txt | python log.py
In 'log.py', read from stdin. It's file-like object, so no difficulties here:
import sys
for line in sys.stdin:
print line,
# do some slicing of string to get dd/mm/yy hh:mm:ss:ms
# then create a marker file called file_dd/mm/yy hh:mm:ss:ms
# and so on
This solution will reduce the amount of work your script has to do. As Python isn't a fast language, it may speed up the task. I suspect it can be rewritten purely in bash and it will be even faster (20+ years of optimization of a C program is not the thing you compete with easily), but I don't know bash enough.

Python: How to Average Ping Times From File

I am looking to write two python scripts; one to ping an IP, and store the ping results to a file, and one to extract and average the ping times from the created .txt file. (Please note that all I really need to log are the ping times) (My platform is the Pi if it helps)
Below is the line of code which I plan to use to store the ping results in a text file (Obviously in my program I have put this in an infinity loop with a delay so it doesn't ping too often)
command = os.system('ping 127.0.0.1 >> pingresults.txt')
I am stuck on how access this file, and then to parse this file into just the useful data? (Please bear in mind that I am a serious amateur)
I am wondering if when I initially log the data into the file, if I can filter it for just the ping time. That would make averaging them later much easier.
If you have any suggestions of commands of interest, or tricks, or implementations, that would be great!
I'll take this in basic steps, entirely in Python, ignoring Python "tricks":
Open the file:
f = open("pingresults.txt", "r")
time_list = []
for line in f:
# Ping time is in the penultimate field, 5 chars in.
field_list = line.split(' ')
ping_time = field_list[-2][5:]
time_list.append(field_list[-1]) # add last field to end of time_list
print sum(time_list) / float(len(time_list))

Getting number of lines in a text file without readlines

Let's say I have a program that uses a .txt file to store data it needs to operate. Because it's a very large amount of data (just go with it) in the text file I was to use a generator rather than an iterator to go through the data in it so that my program leaves as much space as possible. Let's just say (I know this isn't secure) that it's a list of usernames. So my code would look like this (using python 3.3).
for x in range LenOfFile:
id = file.readlines(x)
if username == id:
validusername = True
#ask for a password
if validusername == True and validpassword == True:
pass
else:
print("Invalid Username")
Assume that valid password is set to True or False where I ask for a password. My question is, since I don't want to take up all of the RAM I don't want to use readlines() to get the whole thing, and with the code here I only take a very small amount of RAM at any given time. However, I am not sure how I would get the number of lines in the file (assume I cannot find the number of lines and add to it as new users arrive). Is there a way Python can do this without reading the entire file and storing it at once? I already tried len(), which apparently doesn't work on text files but was worth a try. The one way I have thought of to do this is not too great, it involves just using readlines one line at a time in a range so big the text file must be smaller, and then continuing when I get an error. I would prefer not to use this way, so any suggestions would be appreciated.
You can just iterate over the file handle directly, which will then iterate over it line-by-line:
for line in file:
if username == line.strip():
validusername = True
break
Other than that, you can’t really tell how many lines a file has without looking at it completely. You do know how big a file is, and you could make some assumptions on the character count for example (UTF-8 ruins that though :P); but you don’t know how long each line is without seeing it, so you don’t know where the line breaks are and as such can’t tell how many lines there are in total. You still would have to look at every character one-by-one to see if a new line begins or not.
So instead of that, we just iterate over the file, and stop once whenever we read a whole line—that’s when the loop body executes—and then we continue looking from that position in the file for the next line break, and so on.
Yes, the good news is you can find number of lines in a text file without readlines, for line in file, etc. More specifically in python you can use byte functions, random access, parallel operation, and regular expressions, instead of slow sequential text line processing. Parallel text file like CSV file line counter is particularly suitable for SSD devices which have fast random access, when combined with a many processor cores. I used a 16 core system with SSD to store the Higgs Boson dataset as a standard file which you can go download to test on. Even more specifically here are fragments from working code to get you started. You are welcome to freely copy and use but if you do then please cite my work thank you:
import re
from argparse import ArgumentParser
from multiprocessing import Pool
from itertools import repeat
from os import stat
unitTest = 0
fileName = None
balanceFactor = 2
numProcesses = 1
if __name__ == '__main__':
argparser = ArgumentParser(description='Parallel text file like CSV file line counter is particularly suitable for SSD which have fast random access')
argparser.add_argument('--unitTest', default=unitTest, type=int, required=False, help='0:False 1:True.')
argparser.add_argument('--fileName', default=fileName, required=False, help='')
argparser.add_argument('--balanceFactor', default=balanceFactor, type=int, required=False, help='integer: 1 or 2 or 3 are typical')
argparser.add_argument('--numProcesses', default=numProcesses, type=int, required=False, help='integer: 1 or more. Best when matched to number of physical CPU cores.')
cmd = vars(argparser.parse_args())
unitTest=cmd['unitTest']
fileName=cmd['fileName']
balanceFactor=cmd['balanceFactor']
numProcesses=cmd['numProcesses']
#Do arithmetic to divide partitions into startbyte, endbyte strips among workers (2 lists of int)
#Best number of strips to use is 2x to 3x number of workers, for workload balancing
#import numpy as np # long heavy import but i love numpy syntax
def PartitionDataToWorkers(workers, items, balanceFactor=2):
strips = balanceFactor * workers
step = int(round(float(items)/strips))
startPos = list(range(1, items+1, step))
if len(startPos) > strips:
startPos = startPos[:-1]
endPos = [x + step - 1 for x in startPos]
endPos[-1] = items
return startPos, endPos
def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'): # counts number of searchChar appearing in the byte range
with open(fileName, 'r') as f:
f.seek(startByte-1) # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
bytes = f.read(endByte - startByte + 1)
cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
return cnt
if 0 == unitTest:
# Run app, not unit tests.
fileBytes = stat(fileName).st_size # Read quickly from OS how many bytes are in a text file
startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
p = Pool(numProcesses)
partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
globalSum = sum(partialSum)
print(globalSum)
else:
print("Running unit tests") # Bash commands like: head --bytes 96 beer.csv are how I found the correct values.
fileName='beer.csv' # byte 98 is a newline
assert(8==ReadFileSegment(1, 288, fileName))
assert(1==ReadFileSegment(1, 100, fileName))
assert(0==ReadFileSegment(1, 97, fileName))
assert(1==ReadFileSegment(97, 98, fileName))
assert(1==ReadFileSegment(98, 99, fileName))
assert(0==ReadFileSegment(99, 99, fileName))
assert(1==ReadFileSegment(98, 98, fileName))
assert(0==ReadFileSegment(97, 97, fileName))
print("OK")
The bash wc program is slightly faster but you wanted pure python, and so did I. Below is some performance testing results. That said if you change some of this code to use cython or something you might even get some more speed.
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.257s
user 0m12.088s
sys 0m20.512s
HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv
real 0m1.820s
user 0m0.364s
sys 0m1.456s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.256s
user 0m10.696s
sys 0m19.952s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000
real 0m17.380s
user 0m11.124s
sys 0m6.272s
Conclusion: The speed is good for a pure python program compared to a C program. However, it’s not good enough to use the pure python program over the C program.
I wondered if compiling the regex just one time and passing it to all workers will improve speed. Answer: Regex pre-compiling does NOT help in this application. I suppose the reason is that the overhead of process serialization and creation for all the workers is dominating.
One more thing. Does parallel CSV file reading even help, I wondered? Is the disk the bottleneck, or is it the CPU? Oh yes, yes it does. Parallel file reading works quite well. Well there you go!
Data science is a typical use case for pure python. I like to use python (jupyter) notebooks, and I like to keep all code in the notebook rather than use bash scripts when possible. Finding the number of examples in a dataset is a common need for doing machine learning where you generally need to partition a dataset into training, dev, and testing examples.
Higgs Boson dataset:
https://archive.ics.uci.edu/ml/datasets/HIGGS
If you want number of lines in a file so badly, why don't you use len
with open("filename") as f:
num = len(f.readlines())

awk in python: How to use awk scripts in a python class?

I am trying to run an awk script using python, so I can process some data.
Is there any way to get an awk script to run in a python class without using the system class to invoke it as shell process? The framework where I run these python scripts does not allow the use of a subprocess call, so I am stuck either figuring out a way to convert my awk script in python, or if is possible, running the awk script in python.
Any suggestions? My awk script basically read a text file and isolate blocks of proteins that contains a specific chemical compound (the output is generated by our framework; I've add an example of how does it looks like below) and isolate them printing them out on a different file.
buildProtein compoundA compoundB
begin fusion
Calculate : (lots of text here on multiple lines)
(more lines)
Final result - H20: value CO2: value Compound: value
Other Compounds X: Value Y: value Z:value
[...another similar block]
So for example if I build a protein and I need to see if in the compounds I have CH3COOH in the final result line, if it does I have to take the whole block, starting from the command "buildProtein", until the beginning of the next block; and save it on a file; and then move to the next and see if it has again the compound that I am looking for...if it does not have it I skip to the next, until the end of the file (the file has multiple occurrence of the compound that I search for, sometimes they are contiguous while other times they are alternate with blocks that has not the compound.
Any help is more than welcome; banging my head for weeks now and after finding out this site I decided to ask for some help.
Thanks in advance for your kindness!
If you can't use the subprocess module, the best bet is to recode your AWK script in Python. To that end, the fileinput module is a great transition tool with and AWK-like feel.
Python's re module can help, or, if you can't be bothered with regular expressions and just need to do some quick field seperation, you can use the built in str .split() and .find() functions.
I have barely started learning AWK, so I can't offer any advice on that front. However, for some python code that does what you need:
class ProteinIterator():
def __init__(self, file):
self.file = open(file, 'r')
self.first_line = self.file.readline()
def __iter__(self):
return self
def __next__(self):
"returns the next protein build"
if not self.first_line: # reached end of file
raise StopIteration
file = self.file
protein_data = [self.first_line]
while True:
line = file.readline()
if line.startswith('buildProtein ') or not line:
self.first_line = line
break
protein_data.append(line)
return Protein(protein_data)
class Protein():
def __init__(self, data):
self._data = data
for line in data:
if line.startswith('buildProtein '):
self.initial_compounds = tuple(line[13:].split())
elif line.startswith('Final result - '):
pieces = line[15:].split()[::2] # every other piece is a name
self.final_compounds = tuple([p[:-1] for p in pieces])
elif line.startswith('Other Compounds '):
pieces = line[16:].split()[::2] # every other piece is a name
self.other_compounds = tuple([p[:-1] for p in pieces])
def __repr__(self):
return ("Protein(%s)"% self._data[0])
#property
def data(self):
return ''.join(self._data)
What we have here is an iterator for the buildprotein text file which returns one protein at a time as a Protein object. This Protein object is smart enough to know it's inputs, final results, and other results. You may have to modify some of the code if the actual text in the file is not exactly as represented in the question. Following is a short test of the code with example usage:
if __name__ == '__main__':
test_data = """\
buildProtein compoundA compoundB
begin fusion
Calculate : (lots of text here on multiple lines)
(more lines)
Final result - H20: value CO2: value Compound: value
Other Compounds X: Value Y: value Z: value"""
open('testPI.txt', 'w').write(test_data)
for protein in ProteinIterator('testPI.txt'):
print(protein.initial_compounds)
print(protein.final_compounds)
print(protein.other_compounds)
print()
if 'CO2' in protein.final_compounds:
print(protein.data)
I didn't bother saving values, but you can add that in if you like. Hopefully this will get you going.

Categories

Resources