Grep data from multiple file then feed into a log in Python - python

I'm new to Python, hopefully someone can help me in this.
I want to grep data from multiple files then combine the data I grep into a single log.
My input files as such:
Input file1 (200MHz)
Cell_a freq_100 50
Cell_a freq_200 6.8
Cell_b freq_100 70
Input file2 (100MHz)
Cell_a freq_100 100
Cell_a freq_200 10.5
Cell_b freq_100 60
This is my expected output
[cell] [freq] [value_frm_file1] [value_frm_file2] [value_frm_file3] [etc...]
Example expected output:-
Cell_a freq_100 50 100 #50 is taken from file1, 100 from file2
Cell_a freq_200 6.8 10.5
Cell_b freq_100 70 60
I guess the best way is to store in Python dictionary? Can you give me example or show me how to do this? Here is my code, but I'm only able to get the value one at a time, how to combined them accordingly to it's respective freq type?
for i in cmaxFreqList: #this is the list base on it's frq type, IE 200MHz, 100MHz etc
file = path + freqfile
with open (file) as f:
data = f.readlines()
for line in data:
line = line.rstrip('\n')
freqlength = len(line.split())
if freqlength == 3:
searchFreqValue =re.search("(\S+)\s+(\S+)\s+(\S+)",line)
cell = searchFreqValue.group(1)
freq = searchFreqValue.group(2)
value = searchFreqValue.group(3)
print ('cell + ' ' + freq + ' ' + value) #only can get up to printing out one value at a time
Thank you for your help!

That's a relatively simple task provided your files are not extremely huge (i.e. their combined data can fit into the working memory while concatenating them). All you need is to create a (cell_name, freq) map (you can use a dict for that) and then append the matching values to it. Once you go through all your files just write to the combined output file the map->value elements and Bob's your uncle:
import os
import collections
path = "." # current folder
freq_list = ["100.dat", "200.dat"] # a list of files to concatenate
result = collections.defaultdict(list) # a map to hold a list of our results
for file_name in freq_list: # go through each file name
with open(os.path.join(path, file_name), "r") as f: # open the file
for line in f: # go through it line by line
try:
cell, freq, value = line.split() # split it by whitespace into 3 elements
except ValueError: # invalid line - it didn't have exactly 3 elements
continue # ignore the current line and continue with the next
result[(cell, freq)].append(value) # append the value to our result map
with open(os.path.join(path, "combined.dat"), "w") as f: # open our output file for writing
# Python dictionaries are unsorted (<v3.6), sort the keys when looping through them
for element in sorted(result): # loop through each key in our result map
# write the key (cell name and frequency) separated by space, add space,
# write the values separated by space and finally add a new line:
f.write("{} {}\n".format(" ".join(element), " ".join(result[element])))
It's unclear from your code what cmaxFreqList contains, but in my example it (freq_list) holds the actual file names - you can of course construct your input file names any way you want (just make sure that os.path.join(path, file_name) constructs a valid path). For example, if the above-listed 100.dat contained:
Cell_a freq_100 50
Cell_a freq_200 6.8
Cell_b freq_100 70
and the 200.dat contained:
Cell_a freq_100 100
Cell_a freq_200 10.5
Cell_b freq_100 60
the "combined.dat" file will end up as:
Cell_a freq_100 50 100
Cell_a freq_200 6.8 10.5
Cell_b freq_100 70 60

I dont fully understand the question due to the readability of your expected ouput, however here are some tips you could use to iterate through the parameters and the values:
for searching a type of value (i.e cell, freq, etc) you could use the list index method:
parameters = ['Cell_', 'freq_', 'etc'] #Name of the parameters you are looking for
for parameter in parameters:
for line in data:
new_list = line.split()
position_of_the_value = new_list.index(parameter) + 1
if you
print(new_list[position_of_the_value])
you get the value for that parameter in that line, you can then store it in a list
parameter1_list = list()
parameter1_list.append(new_list[position_of_the_value])
finally, you construct the string you want to print
print('Parameter_1 '+ ' '.join(parameter1_list))
and this will print something like
Parameter_1 100 50 200 300
you just have to contruct loops to iterate over every parameter and every list in order to get them all printed.

Related

How do you split a list by space in python?

How do you split a list by space? With the code below, it reads a file with 4 lines of 7 numbers separated by spaces. When it takes the file and then splits it, it splits it by number so if i print item[0], 5 will print instead of 50. here is the code
def main():
filename = input("Enter the name of the file: ")
infile = open(filename, "r")
for i in range(4):
data = infile.readline()
print(data)
item = data.split()
print(data[0])
main()
the file looks like this
50 60 15 100 60 15 40 /n
100 145 20 150 145 20 45 /n
50 245 25 120 245 25 50 /n
100 360 30 180 360 30 55 /n
Split takes as argument the character you want to split your string with.
I invite you to read the documentation of methods you are using. :)
EDIT : By the way, readline returns a string, not a **list **.
However, split does return a list.
import nltk
tokens = nltk.word_tokenize(TextInTheFile)
Try this once you have opened that file.
TextInTheFile is a variable
There's not a lot wrong with what you are doing, except that you are printing the wrong thing.
Instead of
print(data[0])
use
print(item[0])
data[0] is the first character of the string you read from file. You split this string into a variable called item so that's what you should print.

Print match and line after match

I have this file containing 82 pairs of IDs:
EmuJ_000063620.1 EgrG_000063620.1 253 253
EmuJ_000065200.1 EgrG_000065200.1 128 128
EmuJ_000081200.1 EgrG_000081200.1 1213 1213
EmuJ_000096200.1 EgrG_000096200.1 295 298
EmuJ_000114700.1 EgrG_000114700.1 153 153
EmuJ_000133800.1 EgrG_000133800.1 153 153
EmuJ_000139900.1 EgrG_000144400.1 2937 2937
EmuJ_000164600.1 EgrG_000164600.1 167 167
and I have two other files with the sequences for EmuJ_* IDs and EgrG_* IDs as follows:
EgrG_sequences.fasta:
>EgrG_000632500.1
MKKKSHRKSPEGNHSLTKAANKDTAKCNEERGRNIGQSNEEENATRSEKDREGDEDRNLREYVISIAQKYYPHLVSCMRQDDDNQASADARGADGANDEEHCPKHCPRLNAQKYYLYSATCNHHCEDSQASCDEEGDGKRLLKQCLLWLTERYYPSLAARIRQCNDDQASSNAHGADETDDGDRRLKQALLLFAKKLYPCVTTCIRHCVADHTSHDARGVDEEVDGEQLLKQCLHSSAQKFYPRLAACVCHCDADHASTETCGALGVGNAERCPQQCPCLCAQQYYVQSATCVHHCDNEQSSPETRGVKEDVDVEQLLKQCLLMFAEKFHPTLAAGIRSCADDESSHVASVEGEDDADKQRLKQYLLLFAQKYYPHLIAYIQKRDDDQSSSSVRDSGEEANEEEERLKQCLLLFAQKLYPRLVAYTGRCDSNQSTSDGCSVDGEEAEKHYLKQSLLLLAQKYYPSLAAYLRQFDDNQSSSDVRSVDEEEAEKRHLKQGLLFFAEKYYPSLATYIRRCDDDQSSSDARVVDEVDDEDRRLKQGLLLLAQKYYPPLANYIRHSQSSFNVCGADEKEDEEHCLNQLPRLCAQEAYIRSSSCSHHCDDDQASNDTLVVDKEEEEKYRLKQGLLLLAQKFYPPLATCIHQCDDQSSHDTRGVDEEEAEEQLLKKCLLMFAEKFYPSLAATIHHSVYDQASFDMRDVDTENDETHCLSLSAENYSTASTTCIHHSDGDQSTSDACGVEEGDVEEQRLKRGLLLLAQKYYPSLAAYICQCDDYQPSSDVCGVGEEDTGEERLKQCLLLFAKKFYPSLASRNSQCGDNLILNDEVVGETVINSDTDTDEVTPVEKSTAVCDEVDEVPFKYVGSPTPLSDVDVDSLEKVIPPNDLTAHSSFQNSLDHSVEGGYPDRAFYIGRHTVESADSTAPLSKSSSTKLYFSNTDEFPTEEEVSSPIAPLSIQRRIRIYLEDLENVRKVSLIPLCKTDKFGNPQEEIIIDSNLDDDTDESKLSSVDVEFTMEQADATPLDLEAQDEDLKNCVAIILKHIWSELMECIRREGLSDVYELSLGDRRIEVPQDDVCLVR*
>EgrG_000006700.1
MTDTKGPDESYFEKEAFSSLPQPVDSPSASATDTDRIPVVAVSLPVSSGSIDVNCNCSCYLIICETKLIIDYQMTRKW*
and so on. The same for EmuJ_sequences.fasta
I need to get the sequences for each pair and write one after the other maintaining the order like this:
>EmuJ_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYENQCRLDTEVKRLCSNIQTFNRQVDMWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EgrG_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYDNQCRLDTEVKRLCSNIQTFNCQVDLWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EmuJ_000065200.1
MLCLITPFPSVVPVCVRTCVCMCPCPLLLILYTWSAYLVPFSLPLCLYAHFHIRFLPPFSSLSIPRFLTHSLFLPSYPPLTMLRMKKSLAPCPAERR*
>EgrG_000065200.1
MLCLVTSFPSAVPVCMRTCVCMCSCPLLLILYTWSAYLVPFSLPLCLYTHLHIRFLPPFPSLAIPRFLTHPLFLPTSLYVADKKEPSAMPRRASLRQMLLIVLLQELH*
>EmuJ_000081200.1
MNSLRIFAVVITCLMVVGFSYSIHPTFPSYQSVVWHSSANTGYECRDGICGYRCSNPWCHGFGSILHPQMGVQEMWGSAAHGRHAHSRAMTEFLAKASPEDVTMLIESTPNIDEVITSLDGEAVTILINKLPNLRRVMEELKPQTKMHIVSKLCGKVGSAMEWTEARRNDGSGMWNEYGSGWEGIDAIQDLEAEVIMRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPTPATSVSSMDPLVLRSIILNMPNLNDILMQVDPVYLQSALVHVPGFGAYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQRMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPRAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPTRFANVLLHVKPDYVRYIIEKHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPRVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAATASATEEPTVTVDYANLLRSKIPLIDNVIKMSDPEKVAILRDNLLDVSRILVNLDPTMLRNINSIIFNATKMLNELSVFLVEYPLEYLHKEGKSGVAVNKSEQVGTTGENGVSSIAVEKLQMVLLKIPLFDQFLKWIDQKKLHELLNKIPTLLEVIATANQETLDKINSLLHDAIATMNTAKKLIVTGICRKLAEEGKLRLPRVCPSAST*
>EgrG_000081200.1
MNLLRIFAVVITCLIVVGFGYPTHPTFPSYQTAVWHSSANTGYRCRAGICGYRCSSPWCHGFESALYPQMAMQEMWGSGAHGRHAHSRTMTEFLMKASPEDLTMLIESTPNIDEVITSLDSEAIIILINKLPNLRRVMEKLKPQTKMHIVSKLCDKVGNAMEWAGARRNDGSGMWNEYGSVWEGIDAIQDLEAEMITRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPMPATLVSNMDPHVLRSIILNMPNLDDILMQVDPVHLQSALMYVPGFGTYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQWMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPKAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPARFANVLLHVKPDYVRYIIENHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPNVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAPTASVTEASMVTVDYAHLLRSKIPLIDNVIKMSDPAKVAILRDNLLDVGTTDENGVSSITVEKLQMVLLKIPLFDQFLNWIDSKKLHALLQKIPTLLEVIATANQEALDKINLLLHDAIATMNTAKKLIVTSICRKLAEEGKLRLPRVCPSTST*
And so on.
I wrote a script in bash to do this and it worked like I wanted, it was very simple. Now I'm trying to do the same in Python (which I'm learning), but I'm having a hard time to do the same in a pythonic way.
I've tried this, but I've got only the first pair and then it stopped:
rbh=open('rbh_res_eg-not-sec.txt', 'r')
ems=open('em_seq.fasta', 'r')
egs=open('eg_seq.fasta', 'r')
for l in rbh:
emid=l.split('\t')[0]
egid=l.split('\t')[1]
# ids=emid+'\n'+egid
# print ids # just to check if split worked
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
for lg in egs:
if egid in lg:
print lg.strip()
print next(egs).strip()
I've tried some variations but I've got only the IDs, without the sequences.
So, how can I find the ID in the sequence file, print it and the line after it (the line with sequence referring to the ID)?
Please, let me know if I explained it clearly.
Iterating over a file moves the file pointer until it reaches the end of the file (the last line), so after the first iteration of your outer loop, the ems and egs files are exhausted.
The quick&dirty workaround would be to reset the ems and egs pointers to zero at the end of the outer loop, ie:
for line in rbh:
# no need to split twice
parts = line.split("\t")
emid, egid = parts[0].strip(), parts[1].strip()
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
ems.seek(0) # reset the file pointer
for lg in egs:
if egid in lg:
print lg.strip()
print next(egs).strip()
egs.seek(0) # reset the file pointer
Note that calling next(iterator) while already iterating over iterator will consume one more item of the iterator, as illustrated here:
>>> it = iter(range(20))
>>> for x in it:
... print x, next(it)
...
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
16 17
18 19
As you can see, we don't iter on each element of our range here... Given your file format it should not be a huge problem but I thought I'd still warn you about it.
Now your algorithm is far from efficient - for each line of the rbh file it will read scan the whole ems and egs files again and again.
_NB : the following assumes that each emid / egid will appear at most once in the fasta files._
If your ems and egs files are not too large and you have enough available memory, you could load them into a pair of dicts and do a mere dict lookup (which is O(1) and possibly one of the most optimized operation in Python)
# warning: totally untested code
def fastamap(path):
d = dict()
with open(path) as f:
for num, line in enumerate(f, 1):
line = line.strip()
# skip empty lines.
if not line:
continue
# sanity check: we should only see
# lines starting with ">", the "value"
# lines being consumed by the `next(f)` call
if not line.startswith(">"):
raise ValueError(
"in file %s: line %s doesn't start with '>'" % (
path, num
))
# ok, proceed
d[line.lstrip(">")] = next(f).strip()
return d
ems = fastamap('em_seq.fasta')
egs = fastamap('eg_seq.fasta')
with open('rbh_res_eg-not-sec.txt') as rhb:
for line in rhb:
parts = line.split("\t")
emid, egid = parts[0].strip(), parts[1].strip()
if emid in ems:
print emid
print ems[emid]
if egid in egs:
print egid
print egs[egid]
If this doesn't fly because of memory issues, well bad luck you're stuck with sequential scan (unless you want to use some database system but this might be a bit overkill), but - always assuming the emid/egid each only appears once in the fasta files - you can at least exit the inner loops once you've find your target:
for l in rbh:
# no need to split twice, you can just unpack
emid, egid = l.split('\t')
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
break # no need to go further
ems.seek(0) # reset the file pointer
# etc...

Python Overwrite Dictionary write to text file issue

In my last question Rewriting my scores text file to make sure it only has the Last 4 scores (python) I managed to get a dictionary to be printed out like this:
Adam 150 140 130 50
Dave 120 110 80 60
Jack 100 90 70 40
But I also wanted the dictionary to show the last 4 results in alphabetical order, as such:
Adam:150
Adam:140
Adam:130
Adam:50
Dave:120
Dave:110
Dave:80
Dave:60
Jack:100
Jack:90
Jack:70
Jack:40
I have tried to use for k, v in d.iteritems(): but that does not work either since there are 4 values to every key in one dictionary (4 scores to every 1 name).
What is the fix to this solution to get the dictionary to print out the list like the block of text above back into my text file (guess_scores.txt)?
The code that helps me write the scores into the text file is this, if you require it.
writer = open('Guess Scores.txt', 'wt')
for key, value in scores_guessed.items():
output = "{}:{}\n".format(key,','.join(map(str, scores_guessed[key])))
writer.write(output)
writer.close()
But I can't use the 3rd line (output = ... ) because that writes the scores to the text file similar to the block of code right at the top. I did try to use the iteritems but that returned an error in the IDLE.
Thanks, Delbert.
>>> d={'Adam': [150 ,140, 130 ,50] ,'Dave': [120, 110 ,80 ,60] ,'Jack' :[100, 90, 70 ,40]}
>>> for k,v in sorted(d.items()):
... for item in v:
... print str(k) + ":" + str(item)
...
Adam:150
Adam:140
Adam:130
Adam:50
Dave:120
Dave:110
Dave:80
Dave:60
Jack:100
Jack:90
Jack:70
Jack:40
And for write in file :
with open('Guess Scores.txt', 'wt') as f :
for k,v in sorted(d.items()):
for item in v:
f.write(str(k) + ":" + str(item)+'\n')

Find and print same elements in a loop

I have a huge input file that looks like this,
c651 OS05T0-00 492 749 29.07
c651 OS01T0-00 1141 1311 55.00
c1638 MLOC_8.3 27 101 72.00
c1638 MLOC_8.3 25 117 70.97
c2135 TRIUR3_3-P1 124 210 89.66
c2135 EMT17965 25 117 70.97
c1914 OS02T0-00 2 109 80.56
c1914 OS02T0-00 111 155 93.33
c1914 OS08T0-00 528 617 50.00
I would like to iterate inside each c, see if it has same elements in line[1] and print in 2 separate files
c that contain same elements and
that do not have same elements.
In case of c1914, since it has 2 same elements and 1 is not, it goes to file 2. So desired 2 output files will look like this, file1.txt
c1638 MLOC_8.3 27 101 72.00
c1638 MLOC_8.3 25 117 70.97
file2.txt
c651 OS05T0-00 492 749 29.07
c651 OS01T0-00 1141 1311 55.00
c2135 TRIUR3_3-P1 124 210 89.66
c1914 OS02T0-00 2 109 80.56
c1914 OS02T0-00 111 155 93.33
c1914 OS08T0-00 528 617 50.00
This is what I tried,
oh1=open('result.txt','w')
oh2=open('result2.txt','w')
f=open('file.txt','r')
lines=f.readlines()
for line in lines:
new_list=line.split()
protein=new_list[1]
for i in range(1,len(protein)):
(p, c) = protein[i-1], protein[i]
if c == p:
new_list.append(protein)
oh1.write(line)
else:
oh2.write(line)
If I understand you correctly, you want to send all lines for your input file that have a first element txt1 to your first output file if the second element txt2 of all those lines is the same; otherwise all those lines go to the second output file. Here is a program that does that.
from collections import defaultdict
# Read in file line-by-line for the first time
# Build up dictionary of txt1 to set of txt2 s
txt1totxt2 = defaultdict(set)
f=open('file.txt','r')
for line in f:
lst = line.split()
txt1=lst[0]
txt2=lst[1]
txt1totxt2[txt1].add(txt2);
# The dictionary tells us whether the second text
# is unique or not. If it's unique the set has
# just one element; otherwise the set has > 1 elts.
# Read in file for second time, sending each line
# to the appropriate output file
f.seek(0)
oh1=open('result1.txt','w')
oh2=open('result2.txt','w')
for line in f:
lst = line.split()
txt1=lst[0]
if len(txt1totxt2[txt1]) == 1:
oh1.write(line)
else:
oh2.write(line)
The program logic is very simple. For each txt it builds up a set of txt2s that it sees. When you're done reading the file, if the set has just one element, then you know that the txt2s are unique; if the set has more than one element, then there are at least two txt2s. Note that this means that if you only have one line in the input file with a particular txt1, it will always be sent to the first output file. There are ways round this if this is not the behaviour you want.
Note also that because the file is large, I've read it in line-by-line: lines=f.readlines() in your original program reads the whole file into memory at a time. I've stepped through it twice: the second time does the output. If this increases the run time then you can restore the lines=f.readlines() instead of reading it a second time. However the program as is should be much more robust to very large files. Conversely if your files are very large indeed, it would be worth looking at the program to reduce the memory usage further (the dictionary txt1totxt2 could be replaced with something more optimal, albeit more complicated, if necessary).
Edit: there was a good point in comments (now deleted) about the memory cost of this algorithm. To elaborate, the memory usage could be high, but on the other hand it isn't as severe as storing the whole file: rather txt1totxt2 is a dictionary from the first text in each line to a set of the second text, which is of the order of (size of unique first text) * (average size of unique second text for each unique first text). This is likely to be a lot smaller than the file size, but the approach may require further optimization. The approach here is to get something simple going first -- this can then be iterated to optimize further if necessary.
Try this...
import collections
parsed_data = collections.OrderedDict()
with open("input.txt", "r") as fd:
for line in fd.readlines():
line_data = line.split()
key = line_data[0]
key2 = line_data[1]
if not parsed_data.has_key(key):
parsed_data[key] = collections.OrderedDict()
if not parsed_data[key].has_key(key2):
parsed_data[key][key2] = [line]
else:
parsed_data[key][key2].append(line)
# now process the parsed data and write result files
fsimilar = open("similar.txt", "w")
fdifferent = open("different.txt", "w")
for key in parsed_data:
if len(parsed_data[key]) == 1:
f = fsimilar
else:
f = fdifferent
for key2 in parsed_data[key]:
for line in parsed_data[key][key2]:
f.write(line)
fsimilar.close()
fdifferent.close()
Hope this helps

Search and sort data from several files

I have a set of 1000 text files with names in_s1.txt, in_s2.txt and so. Each file contains millions of rows and each row has 7 columns like:
ccc245 1 4 5 5 3 -12.3
For me the most important is the values from the first and seventh columns; the pairs ccc245 , -12.3
What I need to do is to find between all the in_sXXXX.txt files, the 10 cases with the lowest values of the seventh column value, and I also need to get where each value is located, in which file. I need something like:
FILE 1st_col 7th_col
in_s540.txt ccc3456 -9000.5
in_s520.txt ccc488 -723.4
in_s12.txt ccc34 -123.5
in_s344.txt ccc56 -45.6
I was thinking about using python and bash for this purpose but at the moment I did not find a practical approach. All what I know to do is:
concatenate all in_ files in IN.TXT
search the lowest values there using: for i in IN.TXT ; do sort -k6n $i | head -n 10; done
given the 1st_col and 7th_col values of the top ten list, use them to filter the in_s files, using grep -n VALUE in_s*, so I get for each value the name of the file
It works but it is a bit tedious. I wonder about a faster approach only using bash or python or both. Or another better language for this.
Thanks
In python, use the nsmallest function in the heapq module -- it's designed for exactly this kind of task.
Example (tested) for Python 2.5 and 2.6:
import heapq, glob
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield fname, items[0], float(items[6])
f.close()
result = heapq.nsmallest(10, my_iterable(), lambda x: x[2])
print result
Update after above answer accepted
Looking at the source code for Python 2.6, it appears that there's a possibility that it does list(iterable) and works on that ... if so, that's not going to work with a thousand files each with millions of lines. If the first answer gives you MemoryError etc, here's an alternative which limits the size of the list to n (n == 10 in your case).
Note: 2.6 only; if you need it for 2.5 use a conditional heapreplace() as explained in the docs. Uses heappush() and heappushpop() which don't have the key arg :-( so we have to fake it.
import glob
from heapq import heappush, heappushpop
from pprint import pprint as pp
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield -float(items[6]), fname, items[0]
f.close()
def homegrown_nlargest(n, iterable):
"""Ensures heap never has more than n entries"""
heap = []
for item in iterable:
if len(heap) < n:
heappush(heap, item)
else:
heappushpop(heap, item)
return heap
result = homegrown_nlargest(10, my_iterable())
result = sorted(result, reverse=True)
result = [(fname, fld0, -negfld6) for negfld6, fname, fld0 in result]
pp(result)
I would:
take first 10 items,
sort them and then
for every line read from files insert the element into those top10:
in case its value is lower than highest one from current top10,
(keeping the sorting for performance)
I wouldn't post the complete program here as it looks like homework.
Yes, if it wasn't ten, this would be not optimal
Try something like this in python:
min_values = []
def add_to_min(file_name, one, seven):
# checks to see if 7th column is a lower value than exiting values
if len(min_values) == 0 or seven < max(min_values)[0]:
# let's remove the biggest value
min_values.sort()
if len(min_values) != 0:
min_values.pop()
# and add the new value tuple
min_values.append((seven, file_name, one))
# loop through all the files
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
add_to_min(file_name, columns[0], float(columns[6]))
# print answers
for (seven, file_name, one) in min_values:
print file_name, one, seven
Haven't tested it, but it should get you started.
Version 2, just runs the sort a single time (after a prod by S. Lott):
values = []
# loop through all the files and make a long list of all the rows
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
values.append((file_name, columns[0], float(columns[6]))
# sort values, print the 10 smallest
values.sort()
for (seven, file_name, one) in values[:10]
print file_name, one, seven
Just re-read you question, with millions of rows, you might run out of RAM....
A small improvement of your shell solution:
$ cat in.txt
in_s1.txt
in_s2.txt
...
$ cat in.txt | while read i
do
cat $i | sed -e "s/^/$i /" # add filename as first column
done |
sort -n -k8 | head -10 | cut -d" " -f1,2,8
This might be close to what you're looking for:
for file in *; do sort -k6n "$file" | head -n 10 | cut -f1,7 -d " " | sed "s/^/$file /" > "${file}.out"; done
cat *.out | sort -k3n | head -n 10 > final_result.out
If your files are million lines, you might want to consider using "buffering". the below script goes through those million lines, each time comparing field 7 with those in the buffer. If a value is smaller than those in the buffer, one of them in buffer is replaced by the new lower value.
for file in in_*.txt
do
awk -vt=$t 'NR<=10{
c=c+1
val[c]=$7
tag[c]=$1
}
NR>10{
for(o=1;o<=c;o++){
if ( $7 <= val[o] ){
val[o]=$7
tag[o]=$1
break
}
}
}
END{
for(i=1;i<=c;i++){
print val[i], tag[i] | "sort"
}
}' $file
done

Categories

Resources