Find and print same elements in a loop

Find and print same elements in a loop - python

I have a huge input file that looks like this,
c651 OS05T0-00 492 749 29.07
c651 OS01T0-00 1141 1311 55.00
c1638 MLOC_8.3 27 101 72.00
c1638 MLOC_8.3 25 117 70.97
c2135 TRIUR3_3-P1 124 210 89.66
c2135 EMT17965 25 117 70.97
c1914 OS02T0-00 2 109 80.56
c1914 OS02T0-00 111 155 93.33
c1914 OS08T0-00 528 617 50.00
I would like to iterate inside each c, see if it has same elements in line[1] and print in 2 separate files
c that contain same elements and
that do not have same elements.
In case of c1914, since it has 2 same elements and 1 is not, it goes to file 2. So desired 2 output files will look like this, file1.txt
c1638 MLOC_8.3 27 101 72.00
c1638 MLOC_8.3 25 117 70.97
file2.txt
c651 OS05T0-00 492 749 29.07
c651 OS01T0-00 1141 1311 55.00
c2135 TRIUR3_3-P1 124 210 89.66
c1914 OS02T0-00 2 109 80.56
c1914 OS02T0-00 111 155 93.33
c1914 OS08T0-00 528 617 50.00
This is what I tried,
oh1=open('result.txt','w')
oh2=open('result2.txt','w')
f=open('file.txt','r')
lines=f.readlines()
for line in lines:
new_list=line.split()
protein=new_list[1]
for i in range(1,len(protein)):
(p, c) = protein[i-1], protein[i]
if c == p:
new_list.append(protein)
oh1.write(line)
else:
oh2.write(line)

If I understand you correctly, you want to send all lines for your input file that have a first element txt1 to your first output file if the second element txt2 of all those lines is the same; otherwise all those lines go to the second output file. Here is a program that does that.
from collections import defaultdict
# Read in file line-by-line for the first time
# Build up dictionary of txt1 to set of txt2 s
txt1totxt2 = defaultdict(set)
f=open('file.txt','r')
for line in f:
lst = line.split()
txt1=lst[0]
txt2=lst[1]
txt1totxt2[txt1].add(txt2);
# The dictionary tells us whether the second text
# is unique or not. If it's unique the set has
# just one element; otherwise the set has > 1 elts.
# Read in file for second time, sending each line
# to the appropriate output file
f.seek(0)
oh1=open('result1.txt','w')
oh2=open('result2.txt','w')
for line in f:
lst = line.split()
txt1=lst[0]
if len(txt1totxt2[txt1]) == 1:
oh1.write(line)
else:
oh2.write(line)
The program logic is very simple. For each txt it builds up a set of txt2s that it sees. When you're done reading the file, if the set has just one element, then you know that the txt2s are unique; if the set has more than one element, then there are at least two txt2s. Note that this means that if you only have one line in the input file with a particular txt1, it will always be sent to the first output file. There are ways round this if this is not the behaviour you want.
Note also that because the file is large, I've read it in line-by-line: lines=f.readlines() in your original program reads the whole file into memory at a time. I've stepped through it twice: the second time does the output. If this increases the run time then you can restore the lines=f.readlines() instead of reading it a second time. However the program as is should be much more robust to very large files. Conversely if your files are very large indeed, it would be worth looking at the program to reduce the memory usage further (the dictionary txt1totxt2 could be replaced with something more optimal, albeit more complicated, if necessary).
Edit: there was a good point in comments (now deleted) about the memory cost of this algorithm. To elaborate, the memory usage could be high, but on the other hand it isn't as severe as storing the whole file: rather txt1totxt2 is a dictionary from the first text in each line to a set of the second text, which is of the order of (size of unique first text) * (average size of unique second text for each unique first text). This is likely to be a lot smaller than the file size, but the approach may require further optimization. The approach here is to get something simple going first -- this can then be iterated to optimize further if necessary.

Try this...
import collections
parsed_data = collections.OrderedDict()
with open("input.txt", "r") as fd:
for line in fd.readlines():
line_data = line.split()
key = line_data[0]
key2 = line_data[1]
if not parsed_data.has_key(key):
parsed_data[key] = collections.OrderedDict()
if not parsed_data[key].has_key(key2):
parsed_data[key][key2] = [line]
else:
parsed_data[key][key2].append(line)
# now process the parsed data and write result files
fsimilar = open("similar.txt", "w")
fdifferent = open("different.txt", "w")
for key in parsed_data:
if len(parsed_data[key]) == 1:
f = fsimilar
else:
f = fdifferent
for key2 in parsed_data[key]:
for line in parsed_data[key][key2]:
f.write(line)
fsimilar.close()
fdifferent.close()
Hope this helps

Related

Check if file line starts with a character

I want to read some files with Python that contain certain data I need.
The structure of the files is like this:
NAME : a280
COMMENT : drilling problem (Ludwig)
TYPE : TSP
DIMENSION: 280
EDGE_WEIGHT_TYPE : EUC_2D
NODE_COORD_SECTION
1 288 149
2 288 129
3 270 133
4 256 141
5 256 157
6 246 157
7 236 169
8 228 169
9 228 161
So, the file starts with a few lines that contain data I need, then there are some random lines I do NOT need, and then there are lines with numerical data that I do need. I read everything that I need to read just fine.
However, my problem is that I cannot find a way to bypass the random number of lines that are sandwiched between the data I need. The lines from file to file can be 1, 2 or more. It would be silly to hardcode some f.readline() commands in there to bypass this.
I have thought of some regular expression to check if the line is starting with a string, in order to bypass it, but I'm failing.
In other words, there can be more lines like "NODE_COORD_SECTION" that I don't need in my data.
Any help is highly appreciated.

Well you can simply check if every line is valid (stuff you need) and if it is not, you just skip it. For example:
line_list = line.split()
if line_list[0] not in ['NAME', 'COMMENT', 'TYPE', ...]:
break
if len(line_list) != 3:
break
if len(line_list) == 3 and (type(line_list[0]) != int or type(line_list[1]) != int or type(line_list[2]) != int):
break

It would be nice if you add some format to the "lines of your file" and if you showed some code, but her's what I would try.
I would first define a list of strings containing an indication of a valid line, then I would split the current line into a list of strings and check if the first element corresponds to any of the elements in a list of valid strings.
In case the first string doesn't corresponds to any of the strings in the list of valid strings, I would check if the first element is an integer, and so on...
current_line = 'LINE OF TEXT FROM FILE'
VALID_WORDS = ['VALID_STR1','VALID_STR2','VALID_STR3']
elems = current_line.split(' ')
valid_line = False
if elems[0] in VALID_WORDS:
# If the first str is in the list of valid words,
# continue...
valid_line = True
else if len(elems)==3:
# If it's not in the list of valid words BUT has 3
# elements, check if it's and int
try:
valid_line = isinstance(int(elems[0]),int)
except Exception as e:
valid_line = False
if valid_line:
# Your thing
pass
else:
# Not a valid line
continue

Extract rows between two identical strings?

Since I have a file which is huge (several GBs), I would not like to load the whole thing in memory and instead use *generators to load line by line. My file is something like this:
# millions of lines
..................
..................
keyw 28899
2233 121 ee 0o90 jjsl
2321 232 qq 0kj9 jksl
keyw 28900
3433 124 rr 8hu9 jkas
4532 343 ww 3ko9 aslk
1098 115 uy oiw8 rekl
keyw 29891
..................
..................
# millions more
So far I have found a similar answer here. But I am lost as how to implement it. Because the ans has specific identifiers Start and Stop, whereas my files have an incremental number with a identical keyword. I would like some help regarding this.
Edit: Generators not iterators

If you want to adapt that answer this may help:
bucket = []
for line in infile:
if line.split()[0] == 'keyw':
for strings in bucket:
outfile.write( strings + '\n')
bucket = []
continue
bucket.append(line.strip())

Print match and line after match

I have this file containing 82 pairs of IDs:
EmuJ_000063620.1 EgrG_000063620.1 253 253
EmuJ_000065200.1 EgrG_000065200.1 128 128
EmuJ_000081200.1 EgrG_000081200.1 1213 1213
EmuJ_000096200.1 EgrG_000096200.1 295 298
EmuJ_000114700.1 EgrG_000114700.1 153 153
EmuJ_000133800.1 EgrG_000133800.1 153 153
EmuJ_000139900.1 EgrG_000144400.1 2937 2937
EmuJ_000164600.1 EgrG_000164600.1 167 167
and I have two other files with the sequences for EmuJ_* IDs and EgrG_* IDs as follows:
EgrG_sequences.fasta:
>EgrG_000632500.1
MKKKSHRKSPEGNHSLTKAANKDTAKCNEERGRNIGQSNEEENATRSEKDREGDEDRNLREYVISIAQKYYPHLVSCMRQDDDNQASADARGADGANDEEHCPKHCPRLNAQKYYLYSATCNHHCEDSQASCDEEGDGKRLLKQCLLWLTERYYPSLAARIRQCNDDQASSNAHGADETDDGDRRLKQALLLFAKKLYPCVTTCIRHCVADHTSHDARGVDEEVDGEQLLKQCLHSSAQKFYPRLAACVCHCDADHASTETCGALGVGNAERCPQQCPCLCAQQYYVQSATCVHHCDNEQSSPETRGVKEDVDVEQLLKQCLLMFAEKFHPTLAAGIRSCADDESSHVASVEGEDDADKQRLKQYLLLFAQKYYPHLIAYIQKRDDDQSSSSVRDSGEEANEEEERLKQCLLLFAQKLYPRLVAYTGRCDSNQSTSDGCSVDGEEAEKHYLKQSLLLLAQKYYPSLAAYLRQFDDNQSSSDVRSVDEEEAEKRHLKQGLLFFAEKYYPSLATYIRRCDDDQSSSDARVVDEVDDEDRRLKQGLLLLAQKYYPPLANYIRHSQSSFNVCGADEKEDEEHCLNQLPRLCAQEAYIRSSSCSHHCDDDQASNDTLVVDKEEEEKYRLKQGLLLLAQKFYPPLATCIHQCDDQSSHDTRGVDEEEAEEQLLKKCLLMFAEKFYPSLAATIHHSVYDQASFDMRDVDTENDETHCLSLSAENYSTASTTCIHHSDGDQSTSDACGVEEGDVEEQRLKRGLLLLAQKYYPSLAAYICQCDDYQPSSDVCGVGEEDTGEERLKQCLLLFAKKFYPSLASRNSQCGDNLILNDEVVGETVINSDTDTDEVTPVEKSTAVCDEVDEVPFKYVGSPTPLSDVDVDSLEKVIPPNDLTAHSSFQNSLDHSVEGGYPDRAFYIGRHTVESADSTAPLSKSSSTKLYFSNTDEFPTEEEVSSPIAPLSIQRRIRIYLEDLENVRKVSLIPLCKTDKFGNPQEEIIIDSNLDDDTDESKLSSVDVEFTMEQADATPLDLEAQDEDLKNCVAIILKHIWSELMECIRREGLSDVYELSLGDRRIEVPQDDVCLVR*
>EgrG_000006700.1
MTDTKGPDESYFEKEAFSSLPQPVDSPSASATDTDRIPVVAVSLPVSSGSIDVNCNCSCYLIICETKLIIDYQMTRKW*
and so on. The same for EmuJ_sequences.fasta
I need to get the sequences for each pair and write one after the other maintaining the order like this:
>EmuJ_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYENQCRLDTEVKRLCSNIQTFNRQVDMWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EgrG_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYDNQCRLDTEVKRLCSNIQTFNCQVDLWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EmuJ_000065200.1
MLCLITPFPSVVPVCVRTCVCMCPCPLLLILYTWSAYLVPFSLPLCLYAHFHIRFLPPFSSLSIPRFLTHSLFLPSYPPLTMLRMKKSLAPCPAERR*
>EgrG_000065200.1
MLCLVTSFPSAVPVCMRTCVCMCSCPLLLILYTWSAYLVPFSLPLCLYTHLHIRFLPPFPSLAIPRFLTHPLFLPTSLYVADKKEPSAMPRRASLRQMLLIVLLQELH*
>EmuJ_000081200.1
MNSLRIFAVVITCLMVVGFSYSIHPTFPSYQSVVWHSSANTGYECRDGICGYRCSNPWCHGFGSILHPQMGVQEMWGSAAHGRHAHSRAMTEFLAKASPEDVTMLIESTPNIDEVITSLDGEAVTILINKLPNLRRVMEELKPQTKMHIVSKLCGKVGSAMEWTEARRNDGSGMWNEYGSGWEGIDAIQDLEAEVIMRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPTPATSVSSMDPLVLRSIILNMPNLNDILMQVDPVYLQSALVHVPGFGAYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQRMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPRAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPTRFANVLLHVKPDYVRYIIEKHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPRVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAATASATEEPTVTVDYANLLRSKIPLIDNVIKMSDPEKVAILRDNLLDVSRILVNLDPTMLRNINSIIFNATKMLNELSVFLVEYPLEYLHKEGKSGVAVNKSEQVGTTGENGVSSIAVEKLQMVLLKIPLFDQFLKWIDQKKLHELLNKIPTLLEVIATANQETLDKINSLLHDAIATMNTAKKLIVTGICRKLAEEGKLRLPRVCPSAST*
>EgrG_000081200.1
MNLLRIFAVVITCLIVVGFGYPTHPTFPSYQTAVWHSSANTGYRCRAGICGYRCSSPWCHGFESALYPQMAMQEMWGSGAHGRHAHSRTMTEFLMKASPEDLTMLIESTPNIDEVITSLDSEAIIILINKLPNLRRVMEKLKPQTKMHIVSKLCDKVGNAMEWAGARRNDGSGMWNEYGSVWEGIDAIQDLEAEMITRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPMPATLVSNMDPHVLRSIILNMPNLDDILMQVDPVHLQSALMYVPGFGTYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQWMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPKAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPARFANVLLHVKPDYVRYIIENHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPNVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAPTASVTEASMVTVDYAHLLRSKIPLIDNVIKMSDPAKVAILRDNLLDVGTTDENGVSSITVEKLQMVLLKIPLFDQFLNWIDSKKLHALLQKIPTLLEVIATANQEALDKINLLLHDAIATMNTAKKLIVTSICRKLAEEGKLRLPRVCPSTST*
And so on.
I wrote a script in bash to do this and it worked like I wanted, it was very simple. Now I'm trying to do the same in Python (which I'm learning), but I'm having a hard time to do the same in a pythonic way.
I've tried this, but I've got only the first pair and then it stopped:
rbh=open('rbh_res_eg-not-sec.txt', 'r')
ems=open('em_seq.fasta', 'r')
egs=open('eg_seq.fasta', 'r')
for l in rbh:
emid=l.split('\t')[0]
egid=l.split('\t')[1]
# ids=emid+'\n'+egid
# print ids # just to check if split worked
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
for lg in egs:
if egid in lg:
print lg.strip()
print next(egs).strip()
I've tried some variations but I've got only the IDs, without the sequences.
So, how can I find the ID in the sequence file, print it and the line after it (the line with sequence referring to the ID)?
Please, let me know if I explained it clearly.

Iterating over a file moves the file pointer until it reaches the end of the file (the last line), so after the first iteration of your outer loop, the ems and egs files are exhausted.
The quick&dirty workaround would be to reset the ems and egs pointers to zero at the end of the outer loop, ie:
for line in rbh:
# no need to split twice
parts = line.split("\t")
emid, egid = parts[0].strip(), parts[1].strip()
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
ems.seek(0) # reset the file pointer
for lg in egs:
if egid in lg:
print lg.strip()
print next(egs).strip()
egs.seek(0) # reset the file pointer
Note that calling next(iterator) while already iterating over iterator will consume one more item of the iterator, as illustrated here:
>>> it = iter(range(20))
>>> for x in it:
... print x, next(it)
...
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
16 17
18 19
As you can see, we don't iter on each element of our range here... Given your file format it should not be a huge problem but I thought I'd still warn you about it.
Now your algorithm is far from efficient - for each line of the rbh file it will read scan the whole ems and egs files again and again.
_NB : the following assumes that each emid / egid will appear at most once in the fasta files._
If your ems and egs files are not too large and you have enough available memory, you could load them into a pair of dicts and do a mere dict lookup (which is O(1) and possibly one of the most optimized operation in Python)
# warning: totally untested code
def fastamap(path):
d = dict()
with open(path) as f:
for num, line in enumerate(f, 1):
line = line.strip()
# skip empty lines.
if not line:
continue
# sanity check: we should only see
# lines starting with ">", the "value"
# lines being consumed by the `next(f)` call
if not line.startswith(">"):
raise ValueError(
"in file %s: line %s doesn't start with '>'" % (
path, num
))
# ok, proceed
d[line.lstrip(">")] = next(f).strip()
return d
ems = fastamap('em_seq.fasta')
egs = fastamap('eg_seq.fasta')
with open('rbh_res_eg-not-sec.txt') as rhb:
for line in rhb:
parts = line.split("\t")
emid, egid = parts[0].strip(), parts[1].strip()
if emid in ems:
print emid
print ems[emid]
if egid in egs:
print egid
print egs[egid]
If this doesn't fly because of memory issues, well bad luck you're stuck with sequential scan (unless you want to use some database system but this might be a bit overkill), but - always assuming the emid/egid each only appears once in the fasta files - you can at least exit the inner loops once you've find your target:
for l in rbh:
# no need to split twice, you can just unpack
emid, egid = l.split('\t')
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
break # no need to go further
ems.seek(0) # reset the file pointer
# etc...

Why does the file_object.tell() gives the same byte for a file at different positions?

Just starting my way into python and I can't get around the basic file navigation methods.
When I read the tell() tutorial it states that it returns the position where I am currently sitting on my file (on bytes).
My reasoning is that each character of the file will add up to the byte coordinate, right? This would mean that after a new line, which is just a string of characters that is split on the \n character, my byte coordinate would change ... but that seems to be incorrect.
I generate a quick toy text file on bash
$ for i in {1..10}; do echo "# this is the "$i"th line" ; done > toy.txt
$ for i in {11..20}; do echo " this is the "$i"th line" ; done >> toy.txt
and now I will iterate through this file and print out the line number and on each cycle, the result of the tell() call. The # are there as to mark some lines that delimit blocks of the file, which I want to return (see below).
My guess is that the for loop is iterating over the file object first, reaching it's end and thus it remains always the same.
This is toy example, on my real problem the file is Gigs in length and by applying the same method I get the result of tell() in blocks of what I image are reflecting how the for loop iterated over the file object.
Is this correct? Could you please shed some light on the concepts I am missing?
My final goal is to be able to locate specific coordinates in a file and then in parallel process these huge files from distributed starting points which I cannot monitor in the way I am screening for them.
os.path.getsize("toy.txt")
451
fa = open("toy.txt")
fa.seek(0) # let's double check
fa.tell()
count = 0
for line in fa:
if line.startswith("#"):
print line ,
print "tell {} count {}".format(fa.tell(), count)
else:
if count < 32775:
print line,
print "tell {} count {}".format(fa.tell(), count)
count += 1
Output:
# this is the 1th line
tell 451 count 0
# this is the 2th line
tell 451 count 1
# this is the 3th line
tell 451 count 2
# this is the 4th line
tell 451 count 3
# this is the 5th line
tell 451 count 4
# this is the 6th line
tell 451 count 5
# this is the 7th line
tell 451 count 6
# this is the 8th line
tell 451 count 7
# this is the 9th line
tell 451 count 8
# this is the 10th line
tell 451 count 9
this is the 11th line
tell 451 count 10
this is the 12th line
tell 451 count 11
this is the 13th line
tell 451 count 12
this is the 14th line
tell 451 count 13
this is the 15th line
tell 451 count 14
this is the 16th line
tell 451 count 15
this is the 17th line
tell 451 count 16
this is the 18th line
tell 451 count 17
this is the 19th line
tell 451 count 18
this is the 20th line
tell 451 count 19

You are using a for loop to read the file line by line:
for line in fa:
Files don't normally do this; you read blobs of data, usually chunks. In order for Python to give you lines instead, you need to read until the next newline. Only, reading byte by byte to find newlines is not very efficient.
So a buffer is used; you read a large chunk, then find the newlines in that chunk and yield a line for each one you find. Once the buffer is exhausted, you read a new chunk.
Your file in not big enough to read more than one chunk; it is only 451 bytes small, while a buffer is usually measured in kilobytes. If you were to create a larger file, you'll see the file position jump in large steps as you iterate.
See the file.next documenation (next is the method responsible for producing the next line when iterating, what the for loop does):
In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.
If you need to keep track of the absolute file position while looping over the lines, you'll have to use binary mode if on Windows (to prevent newline translation taking place), and keep track of the line lengths yourself:
position = 0
for line in fa:
position += len(line)
The alternative is to use the io library; this is the framework used in Python 3 to handle files. The file.tell() method takes the buffer into account and will produce an accurate file position even when iterating.
Take into account that when you use io.open() to open a file in text mode that you'll get unicode strings. In Python 2, you could just use binary mode (open with 'rb'), if you must have str bytestrings. In fact, only in binary mode will you be given access to IOBase.tell(), in textmode an exception is thrown:
>>> import io
>>> fa = io.open("toy.txt")
>>> next(fa)
u'# this is the 1th line\n'
>>> fa.tell()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: telling position disabled by next() call
In binary mode, you get accurate output for file.tell():
>>> import os.path
>>> os.path.getsize("toy.txt")
461
>>> fa = io.open("toy.txt", 'rb')
>>> for line in fa:
... if line.startswith("#"):
... print line ,
... print "tell {} count {}".format(fa.tell(), count)
... else:
... if count < 32775:
... print line,
... print "tell {} count {}".format(fa.tell(), count)
... count += 1
...
# this is the 1th line
tell 23 count 0
# this is the 2th line
tell 46 count 1
# this is the 3th line
tell 69 count 2
# this is the 4th line
tell 92 count 3
# this is the 5th line
tell 115 count 4
# this is the 6th line
tell 138 count 5
# this is the 7th line
tell 161 count 6
# this is the 8th line
tell 184 count 7
# this is the 9th line
tell 207 count 8
# this is the 10th line
tell 231 count 9
this is the 11th line
tell 254 count 10
this is the 12th line
tell 277 count 11
this is the 13th line
tell 300 count 12
this is the 14th line
tell 323 count 13
this is the 15th line
tell 346 count 14
this is the 16th line
tell 369 count 15
this is the 17th line
tell 392 count 16
this is the 18th line
tell 415 count 17
this is the 19th line
tell 438 count 18
this is the 20th line
tell 461 count 19

When you iterate over the file, it uses an internal buffer to minimize expensive IO operations, so the file isn't necessarily positioned at the last character the loop saw.

Using multiple genfromtxt on a single file

I'm fairly new to Python and am currently having problems with handling my input file reads. Basically I want my code to take an input file, where the relevant info is contained in blocks of 4 lines. For my specific purpose, I only care about the info in lines 1-3 of each block.
A two-block example of the input I'm dealing with, looks like:
#Header line 1
#Header line 2
'Mn 1', 5130.0059, -2.765, 5.4052, 2.5, 7.8214, 1.5, 1.310, 2.390, 0.500, 8.530,-5.360,-7.630,
' LS 3d6.(5D).4p z6F*'
' LS 3d6.(5D).4d e6F'
'K07 A Kurucz MnI 2007 1 K07 1 K07 1 K07 1 K07 1 K07 1 K07 1 K07 1 K07 1 K07 Mn '
'Fe 2', 5130.0127, -5.368, 7.7059, 2.5, 10.1221, 2.5, 1.030, 0.860, 0.940, 8.510,-6.540,-7.900,
' LS 3d6.(3F2).4p y4F*'
' LS 3d5.4s2 2F2'
'RU Kurucz FeII 2013 4 K13 5 RU 4 K13 4 K13 4 K13 4 K13 4 K13 4 K13 4 K13 Fe+ '
I would prefer to store the info from each of these three lines in separate arrays. Since the entries are a mix of strings and floats, I'm using Numpy.genfromtxt to read the input file, as follows:
import itertools
import numpy as np
with open(input_file) as f_in:
#Opening file, reading every fourth line starting with line 2.
data = np.genfromtxt(itertools.islice(f_in,2,None,4),dtype=None,delimiter=",")
#Storing lower transition designation:
low = np.genfromtxt(itertools.islice(f_in,3,None,4),dtype=str)
#Storing upper transition designation:
up = np.genfromtxt(itertools.islice(f_in,4,None,4),dtype=str)
Upon executing the code, genfromtxt correctly reads the information from the file the first time. However, for the second and third call to genfromtxt, I get the following warning
UserWarning: genfromtxt: Empty input file: "<itertools.islice object at 0x102d7a1b0>"
warnings.warn('genfromtxt: Empty input file: "%s"' % fname)
Whereas this is only a warning, the arrays returned by the second and third call of genfromtxt are empty, and not containing strings as expected. If I comment out the second and third call of genfromtxt, the code behaves as expected.
As far as I understand, the above should be working, and I'm a bit at a loss as to why it doesn't. Ideas?

After the first genfromtext (well, really the islice), the file iterator has reached the end of the file. Thus the warnings and empty arrays: the second two islice calls are using an empty iterator.
You'll want to read the file into memory line-by-line with f_in.readlines() as in hpaulj's answer, or add f_in.seek(0) before your subsequent reads, to reset the file pointer back to the beginning of the input. This is a slightly more memory-friendly solution, which could be important if those files are really huge.
# Note: Untested code follows
with open(input_file) as f_in:
np.genfromtxt(itertools.islice(f_in,2,None,4),dtype=None,delimiter=",")
f_in.seek(0) # Set the file pointer back to the beginning
low = np.genfromtxt(itertools.islice(f_in,3,None,4),dtype=str)
f_in.seek(0) # Set the file pointer back to the beginning
up = np.genfromtxt(itertools.islice(f_in,4,None,4),dtype=str)

Try this:
with open(input_file) as f_in:
#Opening file, reading every fourth line starting with line 2.
lines = f_in.readlines()
data = np.genfromtxt(lines[2::4],dtype=None,delimiter=",")
#Storing lower transition designation:
low = np.genfromtxt(lines[3::4],dtype=str)
#Storing upper transition designation:
up = np.genfromtxt(lines[4::4],dtype=str)
I haven't used islice much, but the itertools tend to be generators, which iterate through to the end. You have to be careful when calling them repeatedly. You might be able to make islice work with tee or repeat. But the simplest, I think is to get a list of lines, and selected the relevant ones with ordinary indexing.
Example with tee:
with open('myfile.txt') as f:
its = itertools.tee(f,2)
print(list(itertools.islice(its[0],0,None,2)))
print(list(itertools.islice(its[1],1,None,2)))
Now the file is read once, but can be iterated through twice.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find and print same elements in a loop - python

Related

Check if file line starts with a character

Extract rows between two identical strings?

Print match and line after match

Why does the file_object.tell() gives the same byte for a file at different positions?

Using multiple genfromtxt on a single file

Categories

Resources