awk in python: How to use awk scripts in a python class?

awk in python: How to use awk scripts in a python class? - python

I am trying to run an awk script using python, so I can process some data.
Is there any way to get an awk script to run in a python class without using the system class to invoke it as shell process? The framework where I run these python scripts does not allow the use of a subprocess call, so I am stuck either figuring out a way to convert my awk script in python, or if is possible, running the awk script in python.
Any suggestions? My awk script basically read a text file and isolate blocks of proteins that contains a specific chemical compound (the output is generated by our framework; I've add an example of how does it looks like below) and isolate them printing them out on a different file.
buildProtein compoundA compoundB
begin fusion
Calculate : (lots of text here on multiple lines)
(more lines)
Final result - H20: value CO2: value Compound: value
Other Compounds X: Value Y: value Z:value
[...another similar block]
So for example if I build a protein and I need to see if in the compounds I have CH3COOH in the final result line, if it does I have to take the whole block, starting from the command "buildProtein", until the beginning of the next block; and save it on a file; and then move to the next and see if it has again the compound that I am looking for...if it does not have it I skip to the next, until the end of the file (the file has multiple occurrence of the compound that I search for, sometimes they are contiguous while other times they are alternate with blocks that has not the compound.
Any help is more than welcome; banging my head for weeks now and after finding out this site I decided to ask for some help.
Thanks in advance for your kindness!

If you can't use the subprocess module, the best bet is to recode your AWK script in Python. To that end, the fileinput module is a great transition tool with and AWK-like feel.

Python's re module can help, or, if you can't be bothered with regular expressions and just need to do some quick field seperation, you can use the built in str .split() and .find() functions.

I have barely started learning AWK, so I can't offer any advice on that front. However, for some python code that does what you need:
class ProteinIterator():
def __init__(self, file):
self.file = open(file, 'r')
self.first_line = self.file.readline()
def __iter__(self):
return self
def __next__(self):
"returns the next protein build"
if not self.first_line: # reached end of file
raise StopIteration
file = self.file
protein_data = [self.first_line]
while True:
line = file.readline()
if line.startswith('buildProtein ') or not line:
self.first_line = line
break
protein_data.append(line)
return Protein(protein_data)
class Protein():
def __init__(self, data):
self._data = data
for line in data:
if line.startswith('buildProtein '):
self.initial_compounds = tuple(line[13:].split())
elif line.startswith('Final result - '):
pieces = line[15:].split()[::2] # every other piece is a name
self.final_compounds = tuple([p[:-1] for p in pieces])
elif line.startswith('Other Compounds '):
pieces = line[16:].split()[::2] # every other piece is a name
self.other_compounds = tuple([p[:-1] for p in pieces])
def __repr__(self):
return ("Protein(%s)"% self._data[0])
#property
def data(self):
return ''.join(self._data)
What we have here is an iterator for the buildprotein text file which returns one protein at a time as a Protein object. This Protein object is smart enough to know it's inputs, final results, and other results. You may have to modify some of the code if the actual text in the file is not exactly as represented in the question. Following is a short test of the code with example usage:
if __name__ == '__main__':
test_data = """\
buildProtein compoundA compoundB
begin fusion
Calculate : (lots of text here on multiple lines)
(more lines)
Final result - H20: value CO2: value Compound: value
Other Compounds X: Value Y: value Z: value"""
open('testPI.txt', 'w').write(test_data)
for protein in ProteinIterator('testPI.txt'):
print(protein.initial_compounds)
print(protein.final_compounds)
print(protein.other_compounds)
print()
if 'CO2' in protein.final_compounds:
print(protein.data)
I didn't bother saving values, but you can add that in if you like. Hopefully this will get you going.

Related

Run python files and change strings with 1 file

lets say i have a folder which contains the following files:
f1.py
f2.py
f3.py
in f1.py i got this code:
#O = "Random string"
print("ABCD")
#P = "Random string"
but in f2.py and f3.py i have this code:
#M = "Random string"
print("EFGH")
#Z = "Random string"
And i want to change the strings in the 'print' function in f2.py and f3.py to the string i have in print in f1.py, and run all the files in the folder after changing the strings, using f1.py

It would be best to have more context why you want to do this.
This is possible, but in 99% of the cases it's not a god idea to write self modifying code, though it can be a lot of fun.
In fact you do not really write self modifying code, but more one piece of code modifying other files. But this is also rarely to be recommended.
What's more usual is, that one script analyzes / parses f1.py, extracts the data writes some data into a file (e.g. a json file)
and f2.py and f3.py read the data from that file and do then print this data.
Is there a particular reason you want to have code, that is modifying other python files.
If you really want to have f2.py and f3.py modified, then there is another solution, which is called templating (you can for example use Jinji).
In this case you have two template files f2.py.template and f3.py.template.
you write a file parsing f1.py, extracting the data and creates f2.py from f2.py.template and the extracted data. (Same for f3.py.template and f3.py)
If you're really 100% sure, that you really want what you ask for.
Yes it is possible:
you write a script, tht opens and reads f1.py line by line, looks for the line "#O = ", then memorizes the next line.
Then it reads f2.py line by line and writes it to another file (e.g. next_version_of_f2.py). it reads in a line and writes it out until it encounters the line #M = "Random string in f2.py In this case the line will be written out, the desired print will be written out, the print line from f2.py will be read and ignored and then you read and write all the other lines.
Then you close f2.py and next_version_of_f2.py, rename f2.py into f2.py.old and rename next_version_of_f2.py to f2.py

This is certainly possible but probably inadvisable.
Editing code should typically be a separate action from executing it (even when we use a single tool that can do both, like a lot of modern IDEs).
It suggests a poor workflow. If you want f2.py to print("ABCD"), then it should be written that way.
It's confusing. In order to understand what f2.py does, you have to mentally model the entirety of f1.py and f2.py, and there's no indication of this in f2.py.
It invites all kinds of difficult-to-debug situations. What happens if f1.py is run twice at the same time? What if two different versions of f1.py are run at the same time? Or if I happen to be reading f2.py when you run f1.py? Or if I'm editing f2.py and save my changes while you're running f1.py?
It's a security problem. For f1.py to edit f2.py, the user (shell, web-server, or other surface) calling f1.py has to have edit permissions on f2.py. That means that if they can get f1.py to do something besides what you intended (specifically, if they can get their own text in place of "ABCD"), then they can get arbitrary code execution in everyone else's runtime!
Note that it's perfectly fine to have code that generates or edits other code. The problem is when a program (possibly spanning multiple files) edits its own source.
gelonida discusses some options, which are fine and appropriate for certain contexts such as managing user-specific configuration or building documents. That said, if you're familiar with functions, variables, imports, and other basics of computer science, then you may not even need a config.json file or a template engine.
Reconsider the end result you're trying to accomplish. Do some more research/reading, and if you're still stuck start a new question about the bigger-picture task.

complaints about XYZ problems are old hat, so here's how to do what you want, even though it's awful and you really shouldn't.
f1.py
import os
import re
import sys
#O = "Random string"
print("ABCD")
#P = "Random string"
selector = re.compile(
r'#([A-Z]) = \"Random string\"\nprint\(\"([A-Z]{4})\"\)\n#([A-Z]) = \"Random string\"')
template = '''#{} = "Random string"
print("{}")
#{} = "Random string"'''
own_file_name = os.path.abspath(__file__)
own_directory = os.path.dirname(own_file_name)
def read_file(name: str) -> str:
with open(name, 'r') as f:
return f.read()
find_replacement = selector.match(read_file(own_file_name))
replacement = find_replacement.group(1) if find_replacement else False
if not replacement:
sys.exit(-1)
def make_replacement(reg_match) -> str:
return template.format(reg_match.group(1), replacement, reg_match.group(3))
for dir_entry in os.listdir(own_directory):
if dir_entry.is_file():
original = read_file(dir_entry.path)
with open(dir_entry.path, 'w') as out_file:
out_file.write(selector.sub(make_replacement, original))
# this will cause an infinite loop, but you technically asked for it :)
for dir_entry in os.listdir(own_directory):
if dir_entry.is_file():
exec(read_file(dir_entry.path))
I want to be clear that the above is a joke. I haven't tested it, and I desperately hope it won't actually solve any problems for you.

How can I skip n lines of a binary stdin using Python?

I'm piping binary data to a Python script on a Hadoop cluster using the Hadoop CLI. The binary data have terminators that identify where new documents begin. The records are sorted by a unique identifier which starts at 1000000001 and increments by 1.
I am trying to save the data only for a subset of these IDs which I have in a dictionary.
My current process is to select the data from the CLI using:
hadoop select "Database" "Collection" | cut -d$'\t' -f2 | python script.py
and process it in script.py which looks like this:
import json
import sys
member_mapping = json.load(open('member_mapping.json'))
output = []
for line in sys.stdin:
person = json.loads(line)
if member_mapping.get(person['personId']):
output.append({person['personId']: person})
if len(output) == len(member_mapping):
break
The problem is that there are 6.5M IDs in this binary data and it takes almost 2 hours to scan. I know the min() and max() IDs in my dictionary and you can see in my code that I stop early when I have saved n documents where n is the length of my mapping file.
I want to make this process more efficient by skipping as many reads as possible. If the ID starts at 1000000001 and the first ID I want to save is 1000010001, can I simply skip 10,000 lines?
Due to system issues, I'm not currently able to use spark or any other tools that may improve this process, so I need to stick to solutions that utilize Python and the Hadoop CLI for now.

You could try using enumerate and a threshold, and then skip any input that isn't in the rane you care about. This isn't a direct fix, but should run much faster and throw those first 10,000 lines away pretty quick.
for lineNum, line in enumerate(sys.stdin):
if(lineNum < 10000):
continue
person = json.loads(line)
if member_mapping.get(person['personId']):
output.append({person['personId']: person})
if len(output) == len(member_mapping):
break

File is created but cannot be written in Python

I am trying to write some results I get from a function for a range but I don't understand why the file is empty. The function is working fine because I can see the results in the console when I use print. First, I'm creating the file which is working because it is created; the output file name is taken from a string, and that part is working too. So the following creates the file in the given path:
report_strategy = open(output_path+strategy.partition("strategy(")[2].partition(",")[0]+".txt", "w")
it creates a text file with the name taken from a string named "strategy", for example:
strategy = "strategy(abstraction,Ent_parent)"
a file called "abstraction.txt" is created in the output path folder. So far so good. But I can't get to write anything to this file. I have a range of a few integers
maps = (175,178,185)
This is the function:
def strategy_count(map_path,map_id)
The following loop does the counting for each item in the range "maps" to return an integer:
for i in maps:
report_strategy.write(str(i), ",", str(strategy_count(maps_path,str(i))))
and the file is closed at the end:
report_strategy.close()
Now the following:
for i in maps:
print str(i), "," , strategy_count(maps_path,str(i))
does give me what I want in the console:
175 , 3
178 , 0
185 , 1
What am I missing?! The function works, the file is created. I see the output in the console as I want, but I can't write the same thing in the file. And of course, I close the file.
This is a part of a program that reads text files (actually Prolog files) and runs an Answer Set Programming solver called Clingo. Then the output is read to find instances of occurring strategies (a series of actions with specific rules). The whole code:
import pmaps
import strategies
import generalization
# select the strategy to count:
strategy = strategies.abstraction_strategy
import subprocess
def strategy_count(path,name):
p=subprocess.Popen([pmaps.clingo_path,"0",""],
stdout=subprocess.PIPE,stderr=subprocess.STDOUT,stdin=subprocess.PIPE)
#
## write input facts and rules to clingo
with open(path+name+".txt","r") as source:
for line in source:
p.stdin.write(line)
source.close()
# some generalization rules added
p.stdin.write(generalization.parent_of)
p.stdin.write(generalization.chain_parent_of)
# add the strategy
p.stdin.write(strategy)
p.stdin.write("#hide.")
p.stdin.write("#show strategy(_,_).")
#p.stdin.write("#show parent_of(_,_,_).")
# close the input to clingo
p.stdin.close()
lines = []
for line in p.stdout.readlines():
lines.append(line)
counter=0
for line in lines:
if line.startswith('Answer'):
answer = lines[counter+1]
break
if line.startswith('UNSATISFIABLE'):
answer = ''
break
counter+=1
strategies = answer.count('strategy')
return strategies
# select which data set (from the "pmaps" file) to count strategies for:
report_strategy = open(pmaps.hw3_output_path+strategy.partition("strategy(")[2].partition(",")[0]+".txt", "w")
for i in pmaps.pmaps_hw3_fall14:
report_strategy.write(str(i), ",", str(strategy_count(pmaps.path_hw3_fall14,str(i))))
report_strategy.close()
# the following is for testing the code. It is working and there is the right output in the console
#for i in pmaps.pmaps_hw3_fall14:
# print str(i), "," , strategy_count(pmaps.path_hw3_fall14,str(i))

write takes one argument, which must be a string. It doesn't take multiple arguments like print, and it doesn't add a line terminator.
If you want the behavior of print, there's a "print to file" option:
print >>whateverfile, stuff, to, print
Looks weird, doesn't it? The function version of print, active by default in Python 3 and enabled with from __future__ import print_function in Python 2, has nicer syntax for it:
print(stuff, to, print, out=whateverfile)

The problem was with the write which as #user2357112 mentioned takes only one argument. The solution could also be joining the strings with + or join():
for i in maps:
report.write(str(i)+ ","+str(strategy_count(pmaps.path_hw3_fall14,str(i)))+"\n")
#user2357112 your answer might have the advantage of knowing if your test debug in the console produces the write answer, you just need to write that. Thanks.

Getting number of lines in a text file without readlines

Let's say I have a program that uses a .txt file to store data it needs to operate. Because it's a very large amount of data (just go with it) in the text file I was to use a generator rather than an iterator to go through the data in it so that my program leaves as much space as possible. Let's just say (I know this isn't secure) that it's a list of usernames. So my code would look like this (using python 3.3).
for x in range LenOfFile:
id = file.readlines(x)
if username == id:
validusername = True
#ask for a password
if validusername == True and validpassword == True:
pass
else:
print("Invalid Username")
Assume that valid password is set to True or False where I ask for a password. My question is, since I don't want to take up all of the RAM I don't want to use readlines() to get the whole thing, and with the code here I only take a very small amount of RAM at any given time. However, I am not sure how I would get the number of lines in the file (assume I cannot find the number of lines and add to it as new users arrive). Is there a way Python can do this without reading the entire file and storing it at once? I already tried len(), which apparently doesn't work on text files but was worth a try. The one way I have thought of to do this is not too great, it involves just using readlines one line at a time in a range so big the text file must be smaller, and then continuing when I get an error. I would prefer not to use this way, so any suggestions would be appreciated.

You can just iterate over the file handle directly, which will then iterate over it line-by-line:
for line in file:
if username == line.strip():
validusername = True
break
Other than that, you can’t really tell how many lines a file has without looking at it completely. You do know how big a file is, and you could make some assumptions on the character count for example (UTF-8 ruins that though :P); but you don’t know how long each line is without seeing it, so you don’t know where the line breaks are and as such can’t tell how many lines there are in total. You still would have to look at every character one-by-one to see if a new line begins or not.
So instead of that, we just iterate over the file, and stop once whenever we read a whole line—that’s when the loop body executes—and then we continue looking from that position in the file for the next line break, and so on.

Yes, the good news is you can find number of lines in a text file without readlines, for line in file, etc. More specifically in python you can use byte functions, random access, parallel operation, and regular expressions, instead of slow sequential text line processing. Parallel text file like CSV file line counter is particularly suitable for SSD devices which have fast random access, when combined with a many processor cores. I used a 16 core system with SSD to store the Higgs Boson dataset as a standard file which you can go download to test on. Even more specifically here are fragments from working code to get you started. You are welcome to freely copy and use but if you do then please cite my work thank you:
import re
from argparse import ArgumentParser
from multiprocessing import Pool
from itertools import repeat
from os import stat
unitTest = 0
fileName = None
balanceFactor = 2
numProcesses = 1
if __name__ == '__main__':
argparser = ArgumentParser(description='Parallel text file like CSV file line counter is particularly suitable for SSD which have fast random access')
argparser.add_argument('--unitTest', default=unitTest, type=int, required=False, help='0:False 1:True.')
argparser.add_argument('--fileName', default=fileName, required=False, help='')
argparser.add_argument('--balanceFactor', default=balanceFactor, type=int, required=False, help='integer: 1 or 2 or 3 are typical')
argparser.add_argument('--numProcesses', default=numProcesses, type=int, required=False, help='integer: 1 or more. Best when matched to number of physical CPU cores.')
cmd = vars(argparser.parse_args())
unitTest=cmd['unitTest']
fileName=cmd['fileName']
balanceFactor=cmd['balanceFactor']
numProcesses=cmd['numProcesses']
#Do arithmetic to divide partitions into startbyte, endbyte strips among workers (2 lists of int)
#Best number of strips to use is 2x to 3x number of workers, for workload balancing
#import numpy as np # long heavy import but i love numpy syntax
def PartitionDataToWorkers(workers, items, balanceFactor=2):
strips = balanceFactor * workers
step = int(round(float(items)/strips))
startPos = list(range(1, items+1, step))
if len(startPos) > strips:
startPos = startPos[:-1]
endPos = [x + step - 1 for x in startPos]
endPos[-1] = items
return startPos, endPos
def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'): # counts number of searchChar appearing in the byte range
with open(fileName, 'r') as f:
f.seek(startByte-1) # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
bytes = f.read(endByte - startByte + 1)
cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
return cnt
if 0 == unitTest:
# Run app, not unit tests.
fileBytes = stat(fileName).st_size # Read quickly from OS how many bytes are in a text file
startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
p = Pool(numProcesses)
partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
globalSum = sum(partialSum)
print(globalSum)
else:
print("Running unit tests") # Bash commands like: head --bytes 96 beer.csv are how I found the correct values.
fileName='beer.csv' # byte 98 is a newline
assert(8==ReadFileSegment(1, 288, fileName))
assert(1==ReadFileSegment(1, 100, fileName))
assert(0==ReadFileSegment(1, 97, fileName))
assert(1==ReadFileSegment(97, 98, fileName))
assert(1==ReadFileSegment(98, 99, fileName))
assert(0==ReadFileSegment(99, 99, fileName))
assert(1==ReadFileSegment(98, 98, fileName))
assert(0==ReadFileSegment(97, 97, fileName))
print("OK")
The bash wc program is slightly faster but you wanted pure python, and so did I. Below is some performance testing results. That said if you change some of this code to use cython or something you might even get some more speed.
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.257s
user 0m12.088s
sys 0m20.512s
HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv
real 0m1.820s
user 0m0.364s
sys 0m1.456s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.256s
user 0m10.696s
sys 0m19.952s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000
real 0m17.380s
user 0m11.124s
sys 0m6.272s
Conclusion: The speed is good for a pure python program compared to a C program. However, it’s not good enough to use the pure python program over the C program.
I wondered if compiling the regex just one time and passing it to all workers will improve speed. Answer: Regex pre-compiling does NOT help in this application. I suppose the reason is that the overhead of process serialization and creation for all the workers is dominating.
One more thing. Does parallel CSV file reading even help, I wondered? Is the disk the bottleneck, or is it the CPU? Oh yes, yes it does. Parallel file reading works quite well. Well there you go!
Data science is a typical use case for pure python. I like to use python (jupyter) notebooks, and I like to keep all code in the notebook rather than use bash scripts when possible. Finding the number of examples in a dataset is a common need for doing machine learning where you generally need to partition a dataset into training, dev, and testing examples.
Higgs Boson dataset:
https://archive.ics.uci.edu/ml/datasets/HIGGS

If you want number of lines in a file so badly, why don't you use len
with open("filename") as f:
num = len(f.readlines())

Python: how to capture output to a text file? (only 25 of 530 lines captured now)

I've done a fair amount of lurking on SO and a fair amount of searching and reading, but I must also confess to being a relative noob at programming in general. I am trying to learn as I go, and so I have been playing with Python's NLTK. In the script below, I can get everything to work, except it only writes what would be the first screen of a multi-screen output, at least that's how I am thinking about it.
Here's the script:
#! /usr/bin/env python
import nltk
# First we have to open and read the file:
thefile = open('all_no_id.txt')
raw = thefile.read()
# Second we have to process it with nltk functions to do what we want
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
# Now we can actually do stuff with it:
concord = text.concordance("cultural")
# Now to save this to a file
fileconcord = open('ccord-cultural.txt', 'w')
fileconcord.writelines(concord)
fileconcord.close()
And here's the beginning of the output file:
Building index...
Displaying 25 of 530 matches:
y .   The Baobab Tree : Stories of Cultural Continuity The continuity evident
regardless of ethnicity , and the cultural legacy of Africa as well . This Af
What am I missing here to get the entire 530 matches written to the file?

text.concordance(self, word, width=79, lines=25) seem to have other parameters as per manual.
I see no way to extract the size of concordance index, however, the concordance printing code seem to have this part: lines = min(lines, len(offsets)), therefore you can simply pass sys.maxint as a last argument:
concord = text.concordance("cultural", 75, sys.maxint)
Added:
Looking at you original code now, I can't see a way it could work before. text.concordance does not return anything, but outputs everything to stdout using print. Therefore, the easy option would be redirection stdout to you file, like this:
import sys
....
# Open the file
fileconcord = open('ccord-cultural.txt', 'w')
# Save old stdout stream
tmpout = sys.stdout
# Redirect all "print" calls to that file
sys.stdout = fileconcord
# Init the method
text.concordance("cultural", 200, sys.maxint)
# Close file
fileconcord.close()
# Reset stdout in case you need something else to print
sys.stdout = tmpout
Another option would be to use the respective classes directly and omit the Text wrapper. Just copy bits from here and combine them with bits from here and you are done.

Update:
I found this write text.concordance output to a file Options
from the ntlk usergroup. It's from 2010, and states:
Documentation for the Text class says: "is intended to support
initial exploration of texts (via the interactive console). ... If you
wish to write a program which makes use of these analyses, then you
should bypass the Text class, and use the appropriate analysis
function or class directly instead."
If nothing has changed in the package since then, this may be the source of your problem.
--- previously ---
I don't see a problem with writing to the file using writelines():
file.writelines(sequence)
Write a sequence of strings to the file. The sequence can be any
iterable object producing strings, typically a list of strings. There
is no return value. (The name is intended to match readlines();
writelines() does not add line separators.)
Note the italicized part, did you examine the output file in different editors? Perhaps the data is there, but not being rendered correctly due to missing end of line seperators?
Are you sure this part is generating the data you want to output?
concord = text.concordance("cultural")
I'm not familiar with nltk, so I'm just asking as part of eliminating possible sources for the problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.