Related
Working on project where i am given raw log data and need to parse it out to a readable state, know enought with python to break off all the undeed part and just left with raw data that needs to be split and formated, but cant figure out a way to break it apart if they put multiple records on the same line, which does not always happen.
This is the string value i am getting so far.
*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000*190206*01,2050,0100550,01,4999,0000000,,
I need to break it out apart so that each line starts with the number value, but since i can assume there will only be 1 or two of them i cant think of a way to do it, and the number of other comma seperate values after it vary so i cant go by length. this is what i am looking to get to use will further operations with data from the above example.
*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000
*190206*01,2050,0100550,01,4999,0000000,,
txt = "*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000*190206*01,2050,0100550,01,4999,0000000,,"
output = list()
i = 0
x = txt.split("*")
while i < len(x):
if len(x[i]) == 0:
i += 1
continue
print ("*{0}*{1}".format(x[i],x[i+1]))
output.append("*{0}*{1}".format(x[i],x[i+1]))
i += 2
Use split to tokezine the words between *
Print two constitutive tokens
You can use regex:
([*][0-9]*[*])
You can catch the header part with this and then split according to it.
Same answer as #mujiga but I though a dict might better for further operations
txt = "*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000*190206*01,2050,0100550,01,4999,0000000,,"
datadict=dict()
i=0
x=txt.split("*")
while i < len(x):
if len(x[i]) == 0:
i += 1
continue
datadict[x[i]]=x[i+1]
i += 2
Adding on to #Ali Nuri Seker's suggestion to use regex, here' a simple one lacking lookarounds (which might actually hurt it in this case)
>>> import re
>>> string = '''*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000*190206*01,2050,0100550,01,4999,0000000,,'''
>>> print(re.sub(r'([\*][0-9,]+[\*]+[0-9,]+)', r'\n\1', string))
#Output
*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000
*190206*01,2050,0100550,01,4999,0000000,,
I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.
I'm working on some of the old Google Code Jam problems as practice in python since we don't use this language at my school. Here is the current problem I am working on that is supposed to just reverse the order of a string by word.
Here is my code:
import sys
f = open("B-small-practice.in", 'r')
T = int(f.readline().strip())
array = []
for i in range(T):
array.append(f.readline().strip().split(' '))
for ar in array:
if len(ar) > 1:
count = len(ar) - 1
while count > -1:
print ar[count]
count -= 1
The problem is that instead of printing:
test a is this
My code prints:
test
a
is
this
Please let me know how to format my loop so that it prints all in one line. Also, I've had a bit of a learning curve when it comes to reading input from a .txt file and manipulating it, so if you have any advice on different methods to do so for such problems, it would be much appreciated!
print by default appends a newline. If you don't want that behavior, tack a comma on the end.
while count > -1:
print ar[count],
count -= 1
Note that a far easier method than yours to reverse a list is just to specify a step of -1 to a slice. (and join it into a string)
print ' '.join(array[::-1]) #this replaces your entire "for ar in array" loop
There are multiple ways you can alter your output format. Here are a few:
One way is by adding a comma at the end of your print statement:
print ar[count],
This makes print output a space at the end instead of a newline.
Another way is by building up a string, then printing it after the loop:
reversed_words = ''
for ar in array:
if len(ar) > 1:
count = len(ar) - 1
while count > -1:
reversed_words += ar[count] + ' '
count -= 1
print reversed_words
A third way is by rethinking your approach, and using some useful built-in Python functions and features to simplify your solution:
for ar in array:
print ' '.join(ar[::-1])
(See str.join and slicing).
Finally, you asked about reading from a file. You're already doing that just fine, but you forgot to close the file descriptor after you were done reading, using f.close(). Even better, with Python 2.5+, you can use the with keyword:
array = list()
with open(filename) as f:
T = int(f.readline().strip())
for i in range(T):
array.append(f.readline().strip().split(' '))
# The rest of your code here
This will close the file descriptor f automatically after the with block, even if something goes wrong while reading the file.
T = int(raw_input())
for t in range(T):
print "Case #%d:" % (t+1),
print ' '.join([w for w in raw_input().split(' ')][::-1])
This is more about to find the fastest way to do it.
I have a file1 which contains about one million strings(length 6-40) in separate line. I want to search each of them in another file2 which contains about 80,000 strings and count occurrence(if small string is found in one string multiple times, the occurence of this string is still 1). If anyone is interested to compare performance, there is link to download file1 and file2.
dropbox.com/sh/oj62918p83h8kus/sY2WejWmhu?m
What i am doing now is construct a dictionary for file 2, use strings ID as key and string as value. (because strings in file2 have duplicate values, only string ID is unique)
my code is
for line in file1:
substring=line[:-1].split("\t")
for ID in dictionary.keys():
bigstring=dictionary[ID]
IDlist=[]
if bigstring.find(substring)!=-1:
IDlist.append(ID)
output.write("%s\t%s\n" % (substring,str(len(IDlist))))
My code will take hours to finish. Can anyone suggest a faster way to do it?
both file1 and file2 are just around 50M, my pc have 8G memory, you can use as much memory as you need to make it faster. Any method that can finish in one hour is acceptable:)
Here, after I have tried some suggestions from these comments below, see performance comparison, first comes the code then it is the run time.
Some improvements suggested by Mark Amery and other peoples
import sys
from Bio import SeqIO
#first I load strings in file2 to a dictionary called var_seq,
var_seq={}
handle=SeqIO.parse(file2,'fasta')
for record in handle:
var_seq[record.id]=str(record.seq)
print len(var_seq) #Here print out 76827, which is the right number. loading file2 to var_seq doesn't take long, about 1 second, you shall not focus here to improve performance
output=open(outputfilename,'w')
icount=0
input1=open(file1,'r')
for line in input1:
icount+=1
row=line[:-1].split("\t")
ensp=row[0] #ensp is just peptides iD
peptide=row[1] #peptides is the substrings i want to search in file2
num=0
for ID,bigstring in var_seq.iteritems():
if peptide in bigstring:
num+=1
newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
output.write(newline)
if icount%1000==0:
break
input1.close()
handle.close()
output.close()
It will take 1m4s to finish. Improved 20s compared to my old one
#######NEXT METHOD suggested by entropy
from collections import defaultdict
var_seq=defaultdict(int)
handle=SeqIO.parse(file2,'fasta')
for record in handle:
var_seq[str(record.seq)]+=1
print len(var_seq) # here print out 59502, duplicates are removed, but occurances of duplicates are stored as value
handle.close()
output=open(outputfilename,'w')
icount=0
with open(file1) as fd:
for line in fd:
icount+=1
row=line[:-1].split("\t")
ensp=row[0]
peptide=row[1]
num=0
for varseq,num_occurrences in var_seq.items():
if peptide in varseq:
num+=num_occurrences
newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
output.write(newline)
if icount%1000==0:
break
output.close()
This one takes 1m10s,not faster as expected since it avoids searching duplicates, don't understand why.
Haystack and Needle method suggested by Mark Amery, which turned out to be the fastest, The problem of this method is that counting result for all substrings is 0, which I don't understand yet.
Here is the code I implemented his method.
class Node(object):
def __init__(self):
self.words = set()
self.links = {}
base = Node()
def search_haystack_tree(needle):
current_node = base
for char in needle:
try:
current_node = current_node.links[char]
except KeyError:
return 0
return len(current_node.words)
input1=open(file1,'r')
needles={}
for line in input1:
row=line[:-1].split("\t")
needles[row[1]]=row[0]
print len(needles)
handle=SeqIO.parse(file2,'fasta')
haystacks={}
for record in handle:
haystacks[record.id]=str(record.seq)
print len(haystacks)
for haystack_id, haystack in haystacks.iteritems(): #should be the same as enumerate(list)
for i in xrange(len(haystack)):
current_node = base
for char in haystack[i:]:
current_node = current_node.links.setdefault(char, Node())
current_node.words.add(haystack_id)
icount=0
output=open(outputfilename,'w')
for needle in needles:
icount+=1
count = search_haystack_tree(needle)
newline="%s\t%s\t%s\n" % (needles[needle],needle,str(count))
output.write(newline)
if icount%1000==0:
break
input1.close()
handle.close()
output.close()
It takes only 0m11s to finish, which is much faster than other methods. However, I don't know it is my mistakes to make all counting result as 0, or there is a flaw in the Mark's method.
Your code doesn't seem like it works(are you sure you didn't just quote it from memory instead of pasting the actual code?)
For example, this line:
substring=line[:-1].split("\t")
will cause substring t be a list. But later you do:
if bigstring.find(substring)!=-1:
That would cause an error if you call str.find(list).
In any case, you are building lists uselessly in your innermost loop. This:
IDlist=[]
if bigstring.find(substring)!=-1:
IDlist.append(ID)
#later
len(IDlist)
That will uselessly allocate and free lists which would cause memory thrashing as well as uselessly bogging everything down.
This is code that should work and uses more efficient means to do the counting:
from collections import defaultdict
dictionary = defaultdict(int)
with open(file2) as fd:
for line in fd:
for s in line.split("\t"):
dictionary[s.strip()] += 1
with open(file1) as fd:
for line in fd:
for substring in line.split('\t'):
count = 0
for bigstring,num_occurrences in dictionary.items():
if substring in bigstring:
count += num_occurrences
print substring, count
PS: I am assuming that you have multiple words per line that are tab-split because you do line.split("\t") at some point. If that is wrong, it should be easy to revise the code.
PPS: If this ends up being too slow for your use(you'd have to try it, but my guess is this should run in ~10min given the number of strings you said you had). You'll have to use suffix trees as one of the comments suggested.
Edit: Amended the code so that it handles multiple occurrences of the same string in file2 without negatively affecting performance
Edit 2: Trading maximum space for time.
Below is code that will consume quite a bit of memory and take a while to build the dictionary. However, once that's done, each search out of the million strings to search for should complete in the time it takes for a single hashtable lookup, that is O(1).
Note, I have added some statements to log the time it takes for each step of the process. You should keep those so you know which part of the time is taken when searching. Since you are testing with 1000 strings only this matters a lot since if 90% of the cost is the build time, not the search time, then when you test with 1M strings you will still only be doing that once, so it won't matter
Also note that I have amended my code to parse file1 and file2 as you do, so you should be able to just plug this in and test it:
from Bio import SeqIO
from collections import defaultdict
from datetime import datetime
def all_substrings(s):
result = set()
for length in range(1,len(s)+1):
for offset in range(len(s)-length+1):
result.add(s[offset:offset+length])
return result
print "Building dictionary...."
build_start = datetime.now()
dictionary = defaultdict(int)
handle = SeqIO.parse(file2, 'fasta')
for record in handle:
for sub in all_substrings(str(record.seq).strip()):
dictionary[sub] += 1
build_end = datetime.now()
print "Dictionary built in: %gs" % (build-end-build_start).total_seconds()
print "Searching...\n"
search_start = datetime.now()
with open(file1) as fd:
for line in fd:
substring = line.strip().split("\t")[1]
count = dictionary[substring]
print substring, count
search_end = datetime.now()
print "Search done in: %gs" % (search_end-search_start).total_seconds()
I'm not an algorithms whiz, but I reckon this should give you a healthy performance boost. You need to set 'haystacks' to be a list of the big words you want to look in, and 'needles' to be a list of the substrings you're looking for (either can contain duplicates), which I'll let you implement on your end. It'd be great if you could post your list of needles and list of haystacks so that we can easily compare performance of proposed solutions.
haystacks = <some list of words>
needles = <some list of words>
class Node(object):
def __init__(self):
self.words = set()
self.links = {}
base = Node()
for haystack_id, haystack in enumerate(haystacks):
for i in xrange(len(haystack)):
current_node = base
for char in haystack[i:]:
current_node = current_node.links.setdefault(char, Node())
current_node.words.add(haystack_id)
def search_haystack_tree(needle):
current_node = base
for char in needle:
try:
current_node = current_node.links[char]
except KeyError:
return 0
return len(current_node.words)
for needle in needles:
count = search_haystack_tree(needle)
print "Count for %s: %i" % (needle, count)
You can probably figure out what's going on by looking at the code, but just to put it in words: I construct a huge tree of substrings of the haystack words, such that given any needle, you can navigate the tree character by character and end up at a node which has attached to it the set of all haystack ids of haystacks containing that substring. Then for each needle we just go through the tree and count the size of the set at the end.
This code block works - it loops through a file that has a repeating number of sets of data
and extracts out each of the 5 pieces of information for each set.
But I I know that the current factoring is not as efficient as it can be since it is looping
through each key for each line found.
Wondering if some python gurus can offer better way to do this more efficiently.
def parse_params(num_of_params,lines):
for line in lines:
for p in range(1,num_of_params + 1,1):
nam = "model.paramName "+str(p)+" "
par = "model.paramValue "+str(p)+" "
opt = "model.optimizeParam "+str(p)+" "
low = "model.paramLowerBound "+str(p)+" "
upp = "model.paramUpperBound "+str(p)+" "
keys = [nam,par,opt,low,upp]
for key in keys:
if key in line:
a,val = line.split(key)
if key == nam: names.append(val.rstrip())
if key == par: params.append(val.rstrip())
if key == opt: optimize.append(val.rstrip())
if key == upp: upper.append(val.rstrip())
if key == low: lower.append(val.rstrip())
print "Names = ",names
print "Params = ",params
print "Optimize = ",optimize
print "Upper = ",upper
print "Lower = ",lower
Though this doesn't answer your question (other answers are getting at that) something that has helped me a lot in doing things similar to what you're doing are List Comprehensions. They allow you to build lists in a concise and (I think) easy to read way.
For instance, the below code builds a 2-dimenstional array with the values you're trying to get at. some_funct here would be a little regex, if I were doing it, that uses the index of the last space in the key as the parameter, and looks ahead to collect the value you're trying to get in the line (the value which corresponds to the key currently being looked at) and appends it to the correct index in the seen_keys 2D array.
Wordy, yes, but if you get list-comprehension and you're able to construct the regex to do that, you've got a nice, concise solution.
keys = ["model.paramName ","model.paramValue ","model.optimizeParam ""model.paramLowerBound ","model.paramUpperBound "]
for line in lines:
seen_keys = [[],[],[],[],[]]
[seen_keys[keys.index(k)].some_funct(line.index(k) for k in keys if k in line]
It's not totally easy to see the expected format. From what I can see, the format is like:
lines = [
"model.paramName 1 foo",
"model.paramValue 2 bar",
"model.optimizeParam 3 bat",
"model.paramLowerBound 4 zip",
"model.paramUpperBound 5 ech",
"model.paramName 1 foo2",
"model.paramValue 2 bar2",
"model.optimizeParam 3 bat2",
"model.paramLowerBound 4 zip2",
"model.paramUpperBound 5 ech2",
]
I don't see the above code working if there is more than one value in each line. Which means the digit is not really significant unless I'm missing something. In that case this works very easily:
import re
def parse_params(num_of_params,lines):
key_to_collection = {
"model.paramName":names,
"model.paramValue":params,
"model.optimizeParam":optimize,
"model.paramLowerBound":upper,
"model.paramUpperBound":lower,
}
reg = re.compile(r'(.+?) (\d) (.+)')
for line in lines:
m = reg.match(line)
key, digit, value = m.group(1, 2, 3)
key_to_collection[key].append(value)
It's not entirely obvious from your code, but it looks like each line can have one "hit" at most; if that's indeed the case, then something like:
import re
def parse_params(num_of_params, lines):
sn = 'Names Params Optimize Upper Lower'.split()
ks = '''paramName paramValue optimizeParam
paramLowerBound paramUpperBound'''.split()
vals = dict((k, []) for k in ks)
are = re.compile(r'model\.(%s) (\d+) (.*)' % '|'.join(ks))
for line in lines:
mo = are.search(line)
if not mo: continue
p = int(mo.group(2))
if p < 1 or p > num_of_params: continue
vals[mo.group(1)].append(mo.group(3).rstrip())
for k, s in zip(ks, sn):
print '%-8s =' % s,
print vals[k]
might work -- I exercised it with a little code as follows:
if __name__ == '__main__':
lines = '''model.paramUpperBound 1 ZAP
model.paramLowerBound 1 zap
model.paramUpperBound 5 nope'''.splitlines()
parse_params(2, lines)
and it emits
Names = []
Params = []
Optimize = []
Upper = ['zap']
Lower = ['ZAP']
which I think is what you want (if some details must differ, please indicate exactly what they are and let's see if we can fix it).
The two key ideas are: use a dict instead of lots of ifs; use a re to match "any of the following possibilities" with parenthesized groups in the re's pattern to catch the bits of interest (the keyword after model., the integer number after that, and the "value" which is the rest of the line) instead of lots of if x in y checks and string manipulation.
There is a lot of duplication there, and if you ever add another key or param, you're going to have to add it in many places, which leaves you ripe for errors. What you want to do is pare down all of the places you have repeated things and use some sort of data model, such as a dict.
Some others have provided some excellent examples, so I'll just leave my answer here to give you something to think about.
Are you sure that parse_params is the bottle-neck? Have you profiled your app?
import re
from collections import defaultdict
names = ("paramName paramValue optimizeParam "
"paramLowerBound paramUpperBound".split())
stmt_regex = re.compile(r'model\.(%s)\s+(\d+)\s+(.*)' % '|'.join(names))
def parse_params(num_of_params, lines):
stmts = defaultdict(list)
for m in (stmt_regex.match(s) for s in lines):
if m and 1 <= int(m.group(2)) <= num_of_params:
stmts[m.group(1)].append(m.group(3).rstrip())
for k, v in stmts.iteritems():
print "%s = %s" % (k, ' '.join(v))
The code given in the OP does multiple tests per line to try to match against the expected set of values, each of which is being constructed on the fly. Rather than construct paramValue1, paramValue2, etc. for each line, we can use a regular expression to try to do the matching in a cheaper (and more robust) manner.
Here's my code snippet, drawing from some ideas that have already been posted. This lets you add a new keyword to the key_to_collection dictionary and not have to change anything else.
import re
def parse_params(num_of_params, lines):
pattern = re.compile(r"""
model\.
(.+) # keyword
(\d+) # index to keyword
[ ]+ # whitespace
(.+) # value
""", re.VERBOSE)
key_to_collection = {
"paramName": names,
"paramValue": params,
"optimizeParam": optimize,
"paramLowerBound": upper,
"paramUpperBound": lower,
}
for line in lines:
match = pattern.match(line)
if not match:
print "Invalid line: " + line
elif match[1] not in key_to_collection:
print "Invalid key: " + line
# Not sure if you really care about enforcing this
elif match[2] > num_of_params:
print "Invalid param: " + line
else:
key_to_collection[match[1]].append(match[3])
Full disclosure: I have not compiled/tested this.
It can certainly be made more efficient. But, to be honest, unless this function is called hundreds of times a second, or works on thousands of lines, is it necessary?
I would be more concerned about making it clear what is happening... currently, I'm far from clear on that aspect.
Just eyeballing it, the input seems to look like this:
model.paramName 1 A model.paramValue 1 B model.optimizeParam 1 C model.paramLowerBound 1 D model.paramUpperBound 1 E model.paramName 2 F model.paramValue 2 G model.optimizeParam 2 H model.paramLowerBound 2 I model.paramUpperBound 2 J
And your desired output seems to be something like:
Names = AF
Params = BG
etc...
Now, since my input certainly doesn't match yours, the output is likely off too, but I think I have the gist.
There are a few points. First, does it matter how many parameters are passed to the function? For example, if the input has two sets of parameters, do I just want to read both, or is it necessary to allow the function to only read one? For example, your code allows me to call parse_params(1,1) and have it only read parameters ending in a 1 from the same input. If that's not actually a requirement, you can skip a large chunk of the code.
Second, is it important to ONLY read the given parameters? If I, for example, have a parameter called 'paramFoo', is it bad if I read it? You can also simplify the procedure by just grabbing all parameters regardless of their name, and extracting their value.
def parse_params(input):
parameter_list = {}
param = re.compile(r"model\.([^ ]+) [0-9]+ ([^ ]+)")
each_parameter = param.finditer(input)
for match in each_parameter:
key = match[0]
value = match[1]
if not key in paramter_list:
parameter_list[key] = []
parameter_list[key].append(value)
return parameter_list
The output, in this instance, will be something like this:
{'paramName':[A, F], 'paramValue':[B, G], 'optimizeParam':[C, H], etc...}
Notes: I don't know Python well, I'm a Ruby guy, so my syntax may be off. Apologies.