get non-matching line numbers python - python

Hi I wrote a simple code in python to do the following:
I have two files summarizing genomic data. The first file has the names of loci I want to get rid of, it looks something like this
File_1:
R000002
R000003
R000006
The second file has the names and position of all my loci and looks like this:
File_2:
R000001 1
R000001 2
R000001 3
R000002 10
R000002 2
R000002 3
R000003 20
R000003 3
R000004 1
R000004 20
R000004 4
R000005 2
R000005 3
R000006 10
R000006 11
R000006 123
What I wish to do is get all the corresponding line numbers of loci from File2 that are not in File1, so the end result should look like this:
Result:
1
2
3
9
10
11
12
13
I wrote the following simple code and it gets the job done
#!/usr/bin/env python
import sys
File1 = sys.argv[1]
File2 = sys.argv[2]
F1 = open(File1).readlines()
F2 = open(File2).readlines()
F3 = open(File2 + '.np', 'w')
Loci = []
for line in F1:
Loci.append(line.strip())
for x, y in enumerate(F2):
y2 = y.strip().split()
if y2[0] not in Loci:
F3.write(str(x+1) + '\n')
However when I run this on my real data set where the first file has 58470 lines and the second file has 12881010 lines it seems to take forever. I am guessing that the bottleneck is in the
if y2[0] not in Loci:
part where the code has to search through the whole of File_2 repeatedly but I have not been able to find a speedier solution.
Can anybody help me out and show a more pythonic way of doing things.
Thanks in advance

Here's some slightly more Pythonic code that doesn't care if your files are ordered. I'd prefer to just print everything out and redirect it to a file ./myscript.py > outfile.txt, but you could also pass in another filename and write to that.
#!/usr/bin/env python
import sys
ignore_f = sys.argv[1]
loci_f = sys.argv[2]
with open(ignore_f) as f:
ignore = set(x.strip() for x in f)
with open(loci_f) as f:
for n, line in enumerate(f, start=1):
if line.split()[0] not in ignore:
print n
Searching for something in a list is O(n), while it takes only O(1) for a set. If order doesn't matter and you have unique things, use a set over a list. While this isn't optimal, it should be O(n) instead of O(n × m) like your code.
You're also not closing your files, which when reading from isn't that big of a deal, but when writing it is. I use context managers (with) so Python does that for me.
Style-wise, use descriptive variable names. and avoid UpperCase names, those are typically used for classes (see PEP-8).
If your files are ordered, you can step through them together, ignoring lines where the loci names are the same, then when they differ, take another step in your ignore file and recheck.

To make the searching for matches more efficient you can simply use a set instead of list:
Loci = set()
for line in F1:
Loci.add(line.strip())
The rest should work the same, but faster.
Even more efficient would be to walk down the files in a sort of lockstep, since they're both sorted, but that will require more code and maybe isn't necessary.

Related

python merge files by rules

I need to write script in python that accept and merge 2 files to a new file according to the following rule:
1)take 1 word from 1st file followed by 2 words from the second file.
2) when we reach the end of 1 file i'll need to copy the rest of the other file to the merged file without change.
I wrote that script, but i managed to only read 1 word from each file.
Complete script will be nice, but I really want to understand by words how i can do this by my own.
This is what i wrote:
def exercise3(file1,file2):
lstFile1=readFile(file1)
lstFile2=readFile(file2)
with open("mergedFile", 'w') as outfile:
merged = [j for i in zip(lstFile1, lstFile2) for j in i]
for word in merged:
outfile.write(word)
def readFile(filename):
lines = []
with open(filename) as file:
for line in file:
line = line.strip()
for word in line.split():
lines.append(word)
return lines
Your immediate problem is that zip alternates items from the iterables you give it: in short, it's a 1:1 mapping, where you need 1:2. Try this:
lstFile2a = listfile2[0::2]
lstFile2b = listfile2[1::2]
... zip(lstfile1, listfile2a, lstfile2b)
This is a bit inefficient, but gets the job done.
Another way is to zip up pairs (2-tuples) in lstFile2 before zipping it with lstFile1. A third way is to forget zipping altogether, and run your own indexing:
for i in min(len(lstFile1), len(lstFile2)//2):
outfile.write(lstFile1[i])
outfile.write(lstFile2[2*i])
outfile.write(lstFile2[2*i+1])
However, this leaves you with the leftovers of the longer file to handle.
These aren't particularly elegant, but they should get you moving.

Add missing lines in file with python

I am a beginner when it comes to programming and python and such.
So apologies if this is kind of a simple question.
But I have large files that for example contain lines like this:
10000 7
20000 1
30000 2
60000 3
What I want to have, is a file that also contains the 'missing' lines, like this:
10000 7
20000 1
30000 2
40000 0
50000 0
60000 3
The files are rather large as I am working with whole genome sequence data. The first column is basically a position in the genome and the second column is the number of SNPs I find within that 10kb window. However, I don't think this information is even relevant, I just want to write a simple python code that will add these lines to the file by using if else statements.
So if the position does not match the position of the previous line + 10000, the 'missing line' is written, otherwise the normal occurring line is written.
I just foresee one problem in this, namely when several lines in a row are missing (as in my example).
Does anyone have a smart solution for this simple problem?
Many thanks!
How about this:
# Replace lines.txt with your actual file
with open("lines.txt", "r") as file:
last_line = 0
lines = []
for line in file:
num1, num2 = [int(i) for i in line.split("\t")]
while num1 != last_line + 10000:
# A line is missing
lines.append((last_line + 10000, 0))
last_line += 10000
lines.append((num1, num2))
last_line = num1
for num1, num2 in lines:
# You should print to a different file here
print(num1, num2)
Instead of the last print statement you would write the values to a new file.
Edit: I ran this code on this sample. Output below.
lines.txt
10000 7
20000 1
30000 2
60000 3
Output
10000 7
20000 1
30000 2
40000 0
50000 0
60000 3
I would suggest a program along the following lines. You keep track of the genome position you saw last (it would be 0 at the start, I guess). Then you read lines from the input file, one by one. For each one, you output first any missing lines (from the previous genome position + 10kb, in 10kb steps, to 10kb before the new line you've read) and then the line you have just read.
In other words, the tiny thing you're missing is that when "the position does not match the position of the previous line + 10000", you should have a little loop to generate the missing output, rather than just writing out one line. (The following remark may make no sense until you actually start writing the code: You don't actually need to test whether the position matches; if you write it right, you will find that when it matches your loop outputs no extra lines)
For various good reasons, the usual practice here is not to write the code for you :-), but I hope the above will help.
from collections import defaultdict
d = defaultdict(int)
with open('file1.txt') as infile:
for l in infile:
pos, count = l.split()
d[int(pos)] = int(count)
with open('file2.txt') as outfile:
for i in range(10000, pos+1, 10000):
outfile.write('{}\t{}'.format(i, d[i]))
Here's a quick version. We read the file into a defaultdict. When we access the values later, any key that doesn't have an associated value will get the default value of zero. Then we take every number in the range 10000 to pos where pos is the last position in the first file, taken in steps of 10000. We access these values in the defaultdict and write them to the second file.
I would use defaultdict which will use 0 as default value
So you will read your file to this defaultdict and than read it (handling keys manually) and write it back to file.
It will look somewhat like this
from collections import defaultdict
x = defaultdict(int)
with open(filename) as f:
data = x.split()
x[data[0]] = x[data[-1]]
with open(filename, 'w') as f:
for i in range(0, max(x.keys())+1, 10000):
f.write('{}\t{}\n'.format(i, x[i]))

Line has wrong number of columns, but I can't find which line

I have this very big text file (about 2.5 Gb), which I need to load and put in a numpy array of 2 columns using Python. Somewhere in the text file the number of columns seems to be wrong, so it can't load it.
I am trying to find out where exactly this happens, so I can fix it. However, the line number I get is not much help. I would like to get the first value of the line.
The file looks like this:
1.001 1
1.002 0
1.003 3
1.004 1
etc...
I am opening the file like this:
import numpy as np
with open('paths 8_10.txt', 'r') as paths_list:
for file_path in paths_list:
with open(file_path.strip(), 'r') as file:
data = np.loadtxt(file_path.strip())
t = data[:,0]
x = data[:,1]
So I would like t at the location where the program crashes.
I was thinking about a for-loop which prints the value up until where it stops loading, but I can't get it to work.
If speed is not an issue, I suggest you write a small test harness as follows:
import csv
with open('paths 8_10.txt', 'rb') as paths_list:
csv_reader = csv.reader(paths_list)
for line_number, line in enumerate(csv_reader, start=1):
if len(line) != 2:
print "Line {} has {} columns: {}".format(line_number, len(line), line)
This would let you identify which entries need fixing for use in your main script.
If needed, this approach could easily be extended to skip over erroneous lines or truncate the extra columns and write out the file automatically, thus fixing it for future use.

Read multiple files with fileinput at a certain line

I have multiple files which I need to open and read (I thought may it will be easier with fileinput.input()). Those files contain at the very beginning non-relevant information, what I need is all the information below this specific line ID[tab]NAME[tab]GEO[tab]FEATURE (some times from line 32, but unfortunately some times at any other line), then I want to store them in a list ("entries")
ID[tab]NAME[tab]GEO[tab]FEATURE
1 aa us A1
2 bb ko B1
3 cc ve C1
.
.
.
Now, before reading from line 32 (see code below), I will like to read from the above line. Is it possible to do this with fileinput? or am I going the wrong way. Is there another mor simple way to do this? Here is my code until now:
entries = list()
for line in fileinput.input():
if fileinput.filelineno() > 32:
entries.append(line.strip().split("\t"))
I'm trying to implement this idea with Python 3.2
UPDATE:
Here is how my code looks now, but still out of range. I need to add some of the entries to a dictionary. Am I missing something?
filelist = fileinput.input()
entries = []
for fn in filelist:
for line in fn:
if line.strip() == "ID\tNAME\tGEO\tFEATURE":
break
entries.extend(line.strip().split("\t")for line in fn)
dic = collections.defaultdict(set)
for e in entries:
dic[e[1]].add(e[3])
Error:
dic[e[1]].add(e[3])
IndexError: list index out of range
Just iterate through the file looking for the marker line and add everything after that to the list.
EDIT Your second problem happens because not all of the lines in the original file split to at least 3 fields. A blank line, for instance, results in an empty list so e[1] is invalid. I've updated the example with a nested iterator that filters out lines that are not the right size. You may want to do something different (maybe strip empty lines but otherwise assert that the remaining lines need to split to exactly 3 columns), but you get the idea
entries = []
for fn in filelist:
with open('fn') as fp:
for line in fp:
if line.strip() == 'ID\tNAME\tGEO\tFEATURE':
break
#entries.extend(line.strip().split('\t') for line in fp)
entries.extend(items for items in (line.strip().split('\t') for line in fp) if len(items) >= 3)

Trying to figure out the correct loop for processing multiple lines into a dictionary

grocery_stock.txt contains
lemonade 6 1.1
bread 34 1.43
chips 47 3.76
banans 16 0.79
pizza 15 5.0
This is the code i have written for it so far.
infile=open("grocery_stock.txt", 'r+')
lines=infile.readlines()
line=infile.readline()
none= ' '
count = 0
index= 0
while line !=none:
line1=infile.readline()
while line1 in lines:
line1=line1.split()
name1=str(line1[:0])
quant1=(str(line1[:1]))
price1=[(str(line1[:2]))]
grocerystock[name1[0]]=(quant1,name1)
print (grocerystock)
line2=infile.readline()
for line2 in line:
line1=line2.split()
name1=str(line1[0])
quant1=(str(line1[1]))
price1=[(str(line1[2]))]
grocerystock[name1[1]]=(quant1,name1)
print (line1[1], line[2],line1[0])
print (grocerystock)
line3=infile.readline()
line4=infile.readline()
line5=infile.readline()
infile.close()
grocerystock={}
The reason I am doing this is because later in my project im going to have to remove some keys and change some values so i want a function that i can call anywhere into my program when I read a file to convert the data into a dictionary.
My loops might look crazy to you but I was at the point where I was just trying anything that popped in my head.
Also as you can see i havent finished going through line5, I thought it would be better to figure out the correct loop rather than type random loops and see what happens.
Thank you in advance.
When you use open() and you get a file object, you can just iterate on the file object. Like so:
for line in f:
# do something with the line
I recommend the above rather than calling f.readline() in a loop.
f.readlines() will read the entire file into memory, and build a list of input lines. For very large files, this can cause performance problems. For a small file such as you are using here, it will work, but if you learn the standard Python idiom you can use it for small files or for large ones.
Perhaps this would help:
with file('grocery_stock.txt') as f:
lines = f.readlines()
# filter out empty lines
lines = [line for line in lines if line.strip() != '']
# split all lines
lines = [line.split() for line in lines]
# convert to a dictionary
grocerystock = dict((a, (b, c)) for a, b, c in lines)
# print
for k, v in grocerystock.items():
print k, v
The only loop you need is the loop to move the data in line-by-line. That can be accomplished with this:
with open("grocery_stock.txt", "r+") as f:
for line in f: # Changed to be a more elegant solution; no need to use while here.
# Here we would split (hint) the line of text.
dict[split_text[0]] = [split_text[1], split_text[2]]
Since it's homework, I encourage you to look into this solution as well as others. I can't just give you the answer, now can I?
Just some hints, since this is HW:
try to use a context manager to open files, that is a with statement,
even if the while is not wrong, I'd prefer a for loop in this case (the number of iterations is fixed - the number of lines) – I sometimes think of while loops as the gateway to the halting problem ...
try use readlines(), since the number of lines is probably small; or use something like this (which looks just natural):
for line in f:
# do something with line

Categories

Resources