Troubling shooting after combining two python scripts - python

Here is my first code. Using this code I extracted a list of (6800) random elements and saved my results as a text file. (The file that this code is reading from has over 10,000 lines so every time I run it, I get a new set of random elements).
import random
with open('filename.txt') as fin:
lines = fin.readlines()
random.shuffle(lines)
for i, line in enumerate(lines):
if i >= 0 and i < 6800:
print(line, end='')
Here is my second code. Using that saved text file from my previous step, I then use this code to compare the file to another text file. My results are as you can see, is the count; which always varies, e.g 2390 or 4325 etc..
import csv
with open ("SavedTextFile_RandomElements.txt") as f:
dict1 = {}
r = csv.reader(f,delimiter="\t")
for row in r:
a, b, v = row
dict1.setdefault((a,b),[]).append(v)
#for key in dict1:
#print(key[0])
#print(key[1])
#print(d[key][0]])
with open ("filename2.txt") as f:
dict2 = {}
r = csv.reader(f,delimiter="\t")
for row in r:
a, b, v = row
dict2.setdefault((a,b),[]).append(v)
#for key in dict2:
#print(key[0])
count = 0
for key1 in dict1:
for key2 in dict2:
if (key1[0] == key2[0]) and abs((float(key1[1].split(" ")[0])) - (float(key2[1].split(" ")[0]))) < 10000:
count += 1
print(count)
I decided to combine the two, because I want to skip the extracting and saving process and just have the first code run straight into the second having the random elements read automatically. Here is the combined two:
import csv
import random
with open('filename.txt') as fin:
lines = fin.readlines()
random.shuffle(lines)
str_o = " "
for i, line in enumerate(lines):
if i >= 0 and i < 6800:
str_o += line
r = str_o
dict1 = {}
r = csv.reader(fin,delimiter="\t")
for row in r:
a, b, v = row
dict1.setdefault((a,b),[]).append(v)
with open ("filename2.txt") as f:
dict2 = {}
r = csv.reader(f,delimiter="\t")
for row in r:
a, b, v = row
dict2.setdefault((a,b),[]).append(v)
count = 0
for key1 in dict1:
for key2 in dict2:
if (key1[0] == key2[0]) and abs((float(key1[1].split(" ")[0])) - (float(key2[1].split(" ")[0]))) < 1000:
count += 1
print(count)
However, now when I run the code. I always get a count of 0. Even if I change (less than one thousand):
< 1000:
to for example (less than ten thousand):
< 10000:
I am only receiving a count of zero. And I should only receive a count of zero when I write of course less than zero:
< 0:
But no matter what number I put in, I always get zero. I went wrong somewhere. Can you guys help me figure out where that was? I am happy to clarify anything.
[EDIT]
Both of my files are in the following format:
1 10045 0.120559958
1 157465 0.590642951
1 222471 0.947959795
1 222473 0.083341617
1 222541 0.054014337
1 222588 0.060296547

You wanted to combine the two codes, so you really don't need to read from SavedTextFile_RandomElements.txt, right?
In that case you need to read from somewhere, and I think you intended to store those in variable 'r'. But you overwrote 'r' using this:
r = csv.reader(fin,delimiter="\t")
BTW, was there a typo there 2 lines above that line? You didn't have any file open statement for 'fin'. The above combined code must have not been able to run properly (exception thrown).
To fix, simply remove the csv.reader line, like so (and reduce indentation starting dict1={}
r = str_o
dict1 = {}
for row in r:
a, b, v = row.split()
dict1.setdefault((a,b),[]).append(v)
EDIT: another issue, causing ValueError exception
I missed this earlier. You are concatenating the whole file contents into a single string, but you later on loops over r to read each line and break it into a,b,v.
The error to unpack comes from this loop because you are looping over a single string, meaning you are getting each character, instead of each line, per loop.
To fix this, you just need a single list 'r', no need for string:
r = []
for i, line in enumerate(lines):
if i >= 0 and i < 6800:
r.append(line)
dict1 = {}
for row in r:
a, b, v = row.split()
dict1.setdefault((a,b),[]).append(v)
EDIT: Reading 'row' or line into variables
Since input file is split line by line into string separated by whitespaces, you need to split the 'row' var:
a, b, v = row.split()

Can't tell where exactly you went wrong. In your combined code:
You create str_o and assign it to r but you never use it
A few lines later you assign a csv.reader to r - it is hard to tell from your indentation whether this is still within the with block.
You want to be doing something like this (I didn't use the csv module):
import collections, random
d1 = collections.defaultdict(list)
d2 = collections.defaultdict(list)
with open('filename.txt') as fin:
lines = fin.readlines()
lines = random.sample(lines, 6800)
for line in lines:
line = line.strip()
try:
a, b, v = line.split('\t')
d1[(a,b)].append(v)
except ValueError as e:
print 'Error:' + line
with open ("filename2.txt") as f:
for line in f:
line = line.strip()
try:
a, b, v = line.split('\t')
d2[(a,b)].append(v)
except ValueError as e:
print 'Error:' + line

Related

Finding missing lines in file

I have a 7000+ lines .txt file, containing description and ordered path to image. Example:
abnormal /Users/alex/Documents/X-ray-classification/data/images/1.png
abnormal /Users/alex/Documents/X-ray-classification/data/images/2.png
normal /Users/alex/Documents/X-ray-classification/data/images/3.png
normal /Users/alex/Documents/X-ray-classification/data/images/4.png
Some lines are missing. I want to somehow automate the search of missing lines. Intuitively i wrote:
f = open("data.txt", 'r')
lines = f.readlines()
num = 1
for line in lines:
if num in line:
continue
else:
print (line)
num+=1
But of course it didn't work, since lines are strings.
Is there any elegant way to sort this out? Using regex maybe?
Thanks in advance!
the following should hopefully work - it grabs the number out of the filename, sees if it's more than 1 higher than the previous number, and if so, works out all the 'in-between' numbers and prints them. Printing the number (and then reconstructing the filename later) is needed as line will never contain the names of missing files during iteration.
# Set this to the first number in the series -1
num = lastnum = 0
with open("data.txt", 'r') as f:
for line in f:
# Pick the digit out of the filename
num = int(''.join(x for x in line if x.isdigit()))
if num - lastnum > 1:
for i in range(lastnum+1, num):
print("Missing: {}.png".format(str(i)))
lastnum = num
The main advantage of working this way is that as long as your files are sorted in the list, it can handle starting at numbers other than 1, and also reports more than one missing number in the sequence.
You can try this:
lines = ["abnormal /Users/alex/Documents/X-ray-classification/data/images/1.png","normal /Users/alex/Documents/X-ray-classification/data/images/3.png","normal /Users/alex/Documents/X-ray-classification/data/images/4.png"]
maxvalue = 4 # or any other maximum value
missing = []
i = 0
for num in range(1, maxvalue+1):
if str(num) not in lines[i]:
missing.append(num)
else:
i += 1
print(missing)
Or if you want to check for the line ending with XXX.png:
lines = ["abnormal /Users/alex/Documents/X-ray-classification/data/images/1.png","normal /Users/alex/Documents/X-ray-classification/data/images/3.png","normal /Users/alex/Documents/X-ray-classification/data/images/4.png"]
maxvalue = 4 # or any other maximum value
missing = []
i = 0
for num in range(1, maxvalue+1):
if not lines[i].endswith(str(num) + ".png"):
missing.append(num)
else:
i += 1
print(missing)
Example: here

Python: read line after string is found

I have a file which contains blocks of lines that I would like to separate. Each block contains a number identifier in the block's header: "Block X" is the header line for the X-th block of lines. Like this:
Block X
#L E C A F X M N
11.2145 15 27 29.444444 7.6025229 1539742 29.419783
11.21451 13 28 24.607143 6.8247935 1596787 24.586264
...
Block Y
#L E C A F X M N
11.2145 15 27 29.444444 7.6025229 1539742 29.419783
11.21451 13 28 24.607143 6.8247935 1596787 24.586264
...
I can use "enumerate" to find the header line of the block as follows:
with open(filename,'r') as indata:
for num, line in enumerate(indata):
if 'Block X' in line:
startblock=num
print startblock
This will yield the line number of the first line of block #X.
However, my problem is identifying the last line of the block. To do that, I could find the next occurrence of a header line (i.e., the next block) and subtract a few numbers.
My question: how can I find the line number of a the next occurrence of a condition (i.e., right after a certain condition was met)?
I tried using enumerate again, this time indicating the starting value, like this:
with open(filename,'r') as indata:
for num, line in enumerate(indata,startblock):
if 'Block Y ' in line:
endscan=num
break
print endscan
That doesn't work, because it still begins reading the file from line 0, NOT from the line number "startblock". Instead, by starting the "enumerate" counter from a different number, the resulting value of the counter, in this case "endscan" is shifted from 0 by the amount "startblock".
Please, help! How can tell python to disregard the lines previous to "startblock"?
If you want the groups using Block as the delimiter for each section, you can use itertools.groupby:
from itertools import groupby
with open('test.txt') as f:
grp = groupby(f,key=lambda x: x.startswith("Block "))
for k,v in grp:
if k:
print(list(v) + list(next(grp, ("", ""))[1]))
Output:
['Block X\n', '#L E C A F X M N \n', '11.2145 15 27 29.444444 7.6025229 1539742 29.419783\n', '11.21451 13 28 24.607143 6.8247935 1596787 24.586264\n']
['Block Y\n', '#L E C A F X M N \n', '11.2145 15 27 29.444444 7.6025229 1539742 29.419783\n', '11.21451 13 28 24.607143 6.8247935 1596787 24.586264']
If Block can appear elsewhere but you want it only when followed by a space and a single char:
import re
with open('test.txt') as f:
r = re.compile("^Block \w$")
grp = groupby(f, key=lambda x: r.search(x))
for k, v in grp:
if k:
print(list(v) + list(next(grp, ("", ""))[1]))
You can use the .tell() and .seek() methods of file objects to move around. So for example:
with open(filename, 'r') as infile:
start = infile.tell()
end = 0
for line in infile:
if line.startswith('Block'):
end = infile.tell()
infile.seek(start)
# print all the bytes in the block
print infile.read(end - start)
# now go back to where we were so we iterate correctly
infile.seek(end)
# we finished a block, mark the start
start = end
If the difference between the header lines is uniform throughout the file, just use the distance to increase the indexing variable accordingly.
file1 = open('file_name','r')
lines = file1.readlines()
numlines = len(lines)
i=0
for line in file:
if line == 'specific header 1':
line_num1 = i
if line == 'specific header 2':
line_num2 = i
i+=1
diff = line_num2 - line_num1
Now that we know the difference between the line numbers we use for loops to acquire the data.
k=0
array = np.zeros([numlines, diff])
for i in range(numlines):
if k % diff == 0:
for j in range(diff):
array[i][j] = lines[i+j]
k+=1
% is the mod operator which returns 0 only when k is a multiple of the difference in line numbers between the two header lines in the file, which will only occur when the line corresponds to the a header line. Once the line is fixed we go on to the second for loop that fills the array so that we have a matrix that is numlines number of rows and a diff number of columns. The nonzeros rows will contain the data inbetween the header lines.
I have not tried this out, I am just writing off the top of my head. Hopefully it helps!

Do something to line and next lines until a symbol is hit

I have data, that is set up as the following:
//Name_1 * *
>a xyzxyzyxyzyxzzxy
>b xyxyxyzxyyxzyxyz
>c xyzyxzyxyzyxyzxy
//Name_2
>a xyzxyzyxyzxzyxyx
>b zxyzxyzxyyzxyxzx
>c zxyzxyzxyxyzyzxy
//Name_3 * *
>a xyzxyzyxyzxzyxyz
>b zxyzxyzxzyyzyxyx
>c zxyzxyzxyxyzyzxy
...
The //-line refers to an ID for the following group of sequences until the next //-line is reached.
I have been working on writing a program, that reads the position of the asterix, and print the characters on the given position for the sequences.
To simplifiy things for myself, I have been working on a subset of my data, containing only one group of sequences, so e.g.:
//Name_1 * *
>a xyzxyzyxyzyxzzxy
>b xyxyxyzxyyxzyxyz
>c xyzyxzyxyzyxyzxy
My program does what I want on this subset.
import sys
import csv
datafile = open(sys.argv[1], 'r')
outfile = open(sys.argv[1]+"_FGT_Data", 'w')
csv_out = csv.writer(outfile, delimiter=',')
csv_out.writerow(['Locus', 'Individual', 'Nucleotide', 'Position'])
with (datafile) as searchfile:
var_line = [line for line in searchfile if '*' in line]
LocusID = [line[2:13].strip() for line in var_line]
poslist = [i for line in var_line for i, x in enumerate(line) if x =='*']
datafile = open(sys.argv[1], 'r')
with (datafile) as getsnps:
lines = [line for line in getsnps.readlines() if line.startswith('>')]
for pos in poslist:
for line in lines:
snp = line[pos]
individual = line[0:7]
indistr = individual.strip()
csv_out.writerow((LocusID[0], indistr, line[pos], str(pos)))
datafile.close()
outfile.close()
However, now I am trying to modify it to work on the full dataset. I am having trouble finding a way to iterate over the data in the correct way.
I need to search through the file, and when a line containing '' is reached, I need to do as in the above code for the sequences corresponding to the given line, and then continue to the next line containing an ''. Do I need to split up my data with regards to the //-lines or what is the best approach?
I have uploaded a sample of my data to dropbox:
Data_Sample.txt contains several groups, and is the kind of data, I am trying to get the program to work on.
Data_One_Group.txt contains only one group, and is the data I have gotten the program to work on so far.
https://www.dropbox.com/sh/3j4i04s2rg6b63h/AADkWG3OcsutTiSsyTl8L2Vda?dl=0
--------EDIT---------
I am trying to implement the suggestion by #Julien Spronck below.
However, I am having trouble processing the produced block. How would I be able to search through the block line for line. E.g., why does the below not work as intended? It just prints the asterix' and not the line itself.
block =''
with open('onelocus.txt', 'r') as searchfile:
for line in searchfile:
if line.startswith('//'):
#print line
if block:
for line in block:
if '*' in line:
print line
block = line
else:
block += line
---------EDIT 2----------
I am getting closer. I understand that fact, that I need to split the string into line, to be able to search through them. The below works on one group, but when I try to itereate over several, it prints the information for the first group only. But does it for as many groups, as there are. I have tried clearing LocusID and poslist before next iteration, but this does not seem to be the solution.
block =''
with (datafile) as searchfile:
for line in searchfile:
if line.startswith('//'):
if block:
var_line = [line for line in block.splitlines() if '*' in line]
LocusID = [line[2:13].strip() for line in var_line]
print LocusID
poslist = [i for line in var_line for i, x in enumerate(line) if x == '*']
print poslist
block = line
else:
block += line
Can't you do something like:
block =''
with open(filename, 'r') as fil:
for line in fil:
if line.startswith('//'):
if block:
do_something_with(block)
block = line
else:
block += line
if block:
do_something_with(block)
In this code, I just append the lines of the file to a variable block. Once I find a line that starts with //, I process the previous block and reinitialize the block for the next iteration.
The last two lines will take care of processing the last block, which would not be processed otherwise.
do_something_with(block) could be something like this:
def do_something_with(block):
lines = block.splitlines()
j = 0
first_line = lines[j]
while first_line.strip() == '':
j += 1
first_line = lines[j]
pos = []
position = first_line.find('*')
while position != -1:
pos.append(position)
position = first_line.find('*', position+1)
for k, line in enumerate(lines):
if k > j:
for p in pos:
print line[p],
print
## prints
## z y
## x z
## z y
I have created a way to make this work with the data you provided.
You should run it with 2 file locations, 1 should be your input.txt and 2 should be your output.csv
explanation
first we create a dictionary with the locus as key and the sequences as values.
We iterate over this dictionary and get the * locations in the locus and append these to a list indexes.
We iterate over the values belonging to this key and extract the sequence
per iteration we iterate over indexes so that we gather the snps.
per iteration we append to our csv file.
We empty the indexes list so we can go to the next key.
Keep in mind
This method is highly dependant on the amount of spaces you have inside your input.txt.
You should know that this will not be the fastest way to get it done. but it does get it done.
I hope this helped, if you have any questions, feel free to ask them, and if I have time, I will happily try to answer them.
script
import sys
import csv
sequences = []
dic = {}
indexes = []
datafile = sys.argv[1]
outfile = sys.argv[2]
with open(datafile,'r') as snp_file:
lines = snp_file.readlines()
for i in range(0,len(lines)):
if lines[i].startswith("//"):
dic[lines[i].rstrip()] = sequences
del sequences[:]
if lines[i].startswith(">"):
sequences.append(lines[i].rstrip())
for key in dic:
locus = key.split(" ")[0].replace("//","")
for i, x in enumerate(key):
if x == '*':
indexes.append(i-11)
for sequence in dic[key]:
seq = sequence.split(" ")[1]
seq_id = sequence.split(" ")[0].replace(">","")
for z in indexes:
position = z+1
nucleotide = seq[z]
with open(outfile,'a')as handle:
csv_out = csv.writer(handle, delimiter=',')
csv_out.writerow([locus,seq_id,position,nucleotide])
del indexes[:]
input.txt
//Locus_1 * *
>Safr01 AATCCGTTTTAAACCAGNTCYAT
>Safr02 TTAATCCGTTTTAAACCAGNTCY
//Locus_2 * *
>Safr01 AATCCGTTTTAAACCAGNTCYAT
>Safr02 TTAATCCGTTTTAAACCAGNTCY
output.csv
Locus_1,Safr01,1,A
Locus_1,Safr01,22,A
Locus_1,Safr02,1,T
Locus_1,Safr02,22,C
Locus_2,Safr01,5,C
Locus_2,Safr01,19,T
Locus_2,Safr02,5,T
Locus_2,Safr02,19,G
This is how I ended up solving the problem:
def do_something_with(block):
lines = block.splitlines()
for line in lines:
if '*' in line:
hit = line
LocusID = hit[2:13].strip()
for i, x in enumerate(hit):
if x=='*':
poslist.append(i)
for pos in poslist:
for line in lines:
if line.startswith('>'):
individual = line[0:7].strip()
snp = line[pos]
print LocusID, individual, snp, pos,
csv_out.writerow((LocusID, individual, snp, pos))
with (datafile) as searchfile:
for line in searchfile:
if line.startswith('//'):
if block:
do_something_with(block)
poslist = list()
block = line
else:
block += line
if block:
do_something_with(block)

Extract from current position until end of file

I want to pull all data from a text file from a specified line number until the end of a file. This is how I've tried:
def extract_values(f):
line_offset = []
offset = 0
last_line_of_heading = False
if not last_line_of_heading:
for line in f:
line_offset.append(offset)
offset += len(line)
if whatever_condition:
last_line_of_heading = True
f.seek(0)
# non-functioning pseudocode follows
data = f[offset:] # read from current offset to end of file into this variable
There is actually a blank line between the header and the data I want, so ideally I could skip this also.
Do you know the line number in advance? If so,
def extract_values(f):
line_number = # something
data = f.readlines()[line_number:]
If not, and you need to determine the line number based on the content of the file itself,
def extract_values(f):
lines = f.readlines()
for line_number, line in enumerate(lines):
if some_condition(line):
data = lines[line_number:]
break
This will not be ideal if your files are enormous (since the lines of the file are loaded into memory); in that case, you might want to do it in two passes, only storing the file data on the second pass.
Your if clause is at the wrong position:
for line in f:
if not last_line_of_heading:
Consider this code:
def extract_values(f):
rows = []
last_line_of_heading = False
for line in f:
if last_line_of_heading:
rows.append(line)
elif whatever_condition:
last_line_of_heading = True
# if you want a string instead of an array of lines:
data = "\n".join(rows)
you can use enumerate:
f=open('your_file')
for i,x in enumerate(f):
if i >= your_line:
#do your stuff
here i will store line number starting from 0 and x will contain the line
using list comprehension
[ x for i,x in enumerate(f) if i >= your_line ]
will give you list of lines after specified line
using dictionary comprehension
{ i:x for i,x in enumerate(f) if i >= your_line }
this will give you line number as key and line as value, from specified line number.
Try this small python program, LastLines.py
import sys
def main():
firstLine = int(sys.argv[1])
lines = sys.stdin.read().splitlines()[firstLine:]
for curLine in lines:
print curLine
if __name__ == "__main__":
main()
Example input, test1.txt:
a
b
c
d
Example usage:
python LastLines.py 2 < test1.txt
Example output:
c
d
This program assumes that the first line in a file is the 0th line.

Reading coordinates from text file in python

I have a text file (coordinates.txt):
30.154852145,-85.264584254
15.2685169645,58.59854265854
...
I have a python script and there is a while loop inside it:
count = 0
while True:
count += 1
c1 =
c2 =
For every run of the above loop, I need to read each line (count) and set c1,c2 to the numbers of each line (separated by the comma). Can someone please tell me the easiest way to do this?
============================
import csv
count = 0
while True:
count += 1
print 'val:',count
for line in open('coords.txt'):
c1, c2 = map(float, line.split(','))
break
print 'c1:',c1
if count == 2: break
The best way to do this would be, as I commented above:
import csv
with open('coordinates.txt') as f:
reader = csv.reader(f)
for count, (c1, c2) in enumerate(reader):
# Do what you want with the variables.
# You'll probably want to cast them to floats.
I have also included a better way to use the count variable using enumerate, as pointed out by #abarnert.
f=open('coordinates.txt','r')
count =0
for x in f:
x=x.strip()
c1,c2 = x.split(',')
count +=1

Categories

Resources