Add missing lines in file with python - python

I am a beginner when it comes to programming and python and such.
So apologies if this is kind of a simple question.
But I have large files that for example contain lines like this:
10000 7
20000 1
30000 2
60000 3
What I want to have, is a file that also contains the 'missing' lines, like this:
10000 7
20000 1
30000 2
40000 0
50000 0
60000 3
The files are rather large as I am working with whole genome sequence data. The first column is basically a position in the genome and the second column is the number of SNPs I find within that 10kb window. However, I don't think this information is even relevant, I just want to write a simple python code that will add these lines to the file by using if else statements.
So if the position does not match the position of the previous line + 10000, the 'missing line' is written, otherwise the normal occurring line is written.
I just foresee one problem in this, namely when several lines in a row are missing (as in my example).
Does anyone have a smart solution for this simple problem?
Many thanks!

How about this:
# Replace lines.txt with your actual file
with open("lines.txt", "r") as file:
last_line = 0
lines = []
for line in file:
num1, num2 = [int(i) for i in line.split("\t")]
while num1 != last_line + 10000:
# A line is missing
lines.append((last_line + 10000, 0))
last_line += 10000
lines.append((num1, num2))
last_line = num1
for num1, num2 in lines:
# You should print to a different file here
print(num1, num2)
Instead of the last print statement you would write the values to a new file.
Edit: I ran this code on this sample. Output below.
lines.txt
10000 7
20000 1
30000 2
60000 3
Output
10000 7
20000 1
30000 2
40000 0
50000 0
60000 3

I would suggest a program along the following lines. You keep track of the genome position you saw last (it would be 0 at the start, I guess). Then you read lines from the input file, one by one. For each one, you output first any missing lines (from the previous genome position + 10kb, in 10kb steps, to 10kb before the new line you've read) and then the line you have just read.
In other words, the tiny thing you're missing is that when "the position does not match the position of the previous line + 10000", you should have a little loop to generate the missing output, rather than just writing out one line. (The following remark may make no sense until you actually start writing the code: You don't actually need to test whether the position matches; if you write it right, you will find that when it matches your loop outputs no extra lines)
For various good reasons, the usual practice here is not to write the code for you :-), but I hope the above will help.

from collections import defaultdict
d = defaultdict(int)
with open('file1.txt') as infile:
for l in infile:
pos, count = l.split()
d[int(pos)] = int(count)
with open('file2.txt') as outfile:
for i in range(10000, pos+1, 10000):
outfile.write('{}\t{}'.format(i, d[i]))
Here's a quick version. We read the file into a defaultdict. When we access the values later, any key that doesn't have an associated value will get the default value of zero. Then we take every number in the range 10000 to pos where pos is the last position in the first file, taken in steps of 10000. We access these values in the defaultdict and write them to the second file.

I would use defaultdict which will use 0 as default value
So you will read your file to this defaultdict and than read it (handling keys manually) and write it back to file.
It will look somewhat like this
from collections import defaultdict
x = defaultdict(int)
with open(filename) as f:
data = x.split()
x[data[0]] = x[data[-1]]
with open(filename, 'w') as f:
for i in range(0, max(x.keys())+1, 10000):
f.write('{}\t{}\n'.format(i, x[i]))

Related

Too many values to unpack in python: Caused by the file format

I have two files, which have two columns as following:
file 1
------
main 46
tag 23
bear 15
moon 2
file 2
------
main 20
rocky 6
zoo 4
bear 2
I am trying to compare the first 2 rows of each file together and in case there are some words that are the same, I will sum up the numbers and write those in a new file.
I read the file and used a foreach loop to go through each line, but it returns a ValueError:too many values to unpack.
import os
from itertools import islice
DIR = r'dir'
for filename in os.listdir(DIR):
with open(os.path.sep.join([DIR, filename]), 'r') as f:
for i in range(2):
line = f.readline().strip()
word, freq = line.split():
print(word)
print(count)
In the file, there is an extra empty line after each line of the text. I searched for the \n; but nothing is there.
then I removed them manually and then it worked.
If you don't know how many items you have in the line, then you can't use the nice unpack facility. You'll need to split and check how many you got. For instance:
with open(os.path.sep.join([DIR, filename]), 'r') as f:
for line in f:
data = line.split()
if len(data) >= 2:
word, count = line[:2]
This will get you the first two fields of any line containing at least that many. Since you haven't specified what to do with other lines or extra fields, I'll leave that (any else part) up to you. I've also left out the strip part to accent the existing code; line input and split will get rid of newlines and spaces, but not necessarily all white space.

Python multiple pairs replace in txt file

I've got 2 .txt files, the first one is organized like this:
1:NAME1
2:NAME2
3:NAME3
...
and the second one like this:
1
1
1
2
2
2
3
What I want to do is to substitute every line in .txt 2 according to pairs in .txt 1, like this:
NAME1
NAME1
NAME1
NAME2
NAME2
NAME2
NAME3
Is there a way to do this? I was thinking to organize the first txt deleting the 1: 2: 3: and read it as an array, then make a loop for i in range(1, number of lines in txt 1) and then in the txt 2 find lines containing "i" and substituting with the i-element of the array. But of course I've no idea how to do this.
As Rodrigo commented. There are many ways to implement it, but storing the names in a dictionary is probably the way to go.
# Read the names
with open('names.txt') as f_names:
names = dict(line.strip().split(':') for line in f_names)
# Read the numbers
with open('numbers.txt') as f_numbers:
numbers = list(line.strip() for line in f_numbers)
# Replace numbers with names
with open('numbers.txt', 'w') as f_output:
for n in numbers:
f_output.write(names[n] + '\n')
This should do the trick. It reads the first file and stores the k, v pairs in a dict. That dict is the used to output the v for every k you find in the second file.
But.... if you want to prevent these downvotes it is better to post a code snippet of your own to show what you have tried... Right now your question is a red blanket for the horde of SO'ers that downvote everything that has no code in it. Hell, they even downvote answers because there is no code in the question...
lookup = {}
with open("first.txt") as fifile:
for line in fifile:
lookup[line.split[":"][0]] = line.split[":"][1]
with open("second.txt") as sifile:
with open("output.txt", "w") as ofile:
for line in sifile:
ofile.write("{}\n".format(lookup[line])

What is the most efficient way of looping over each line of a file?

I have a file, dataset.nt, which isn't too large (300Mb). I also have a list, which contains around 500 elements. For each element of the list, I want to count the number of lines in the file which contain it, and add that key/value pair to a dictionary (the key being the name of the list element, and the value the number of times this element appears in the file).
This is the first thing I tired to achieve that result:
mydict = {}
for i in mylist:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
with open("dataset.nt", "rb") as input:
for line in input:
if regex.search(line):
total = total+1
mydict[i] = total
It didn't work (as in, it runs indefinitely), and I figured I should find a way not to read each line 500 times. So I tried this:
mydict = {}
with open("dataset.nt", "rb") as input:
for line in input:
for i in mylist:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
Performance din't improve, the script still runs indefinitely. So I googled around, and I tried this:
mydict = {}
file = open("dataset.nt", "rb")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
for i in list:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
That one has been running for the last 30 minutes, so I'm assuming it's not any better.
How should I structure this code so that it completes in a reasonable amount of time?
I'd favor a slight modification of your second version:
mydict = {}
re_list = [re.compile(r"/Main/"+re.escape(i)) for i in mylist]
with open("dataset.nt", "rb") as input:
for line in input:
# any match has to contain the "/Main/" part
# -> check it's there
# that may help a lot or not at all
# depending on what's in your file
if not '/Main/' in line:
continue
# do the regex-part
for i, regex in zip(mylist, re_list):
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
As #matsjoyce already suggested, this avoids re-compiling the regex on each iteration.
If you really need to that many different regex patterns then I don't think there's much you can do.
Maybe it's worth checking if you can regex-capture whatever comes after "/Main/" and then compare this to your list. That may help reducing the amount of "real" regex searches.
Looks like a good candidate for some map/reduce like parallelisation... You could split your dataset file in N chunks (where N = how many processors you have), launch N subprocesses each scanning one chunk, then sum the results.
This of course doesn't prevent you from first optimizing the scan, ie (based on sebastian's code):
targets = [(i, re.compile(r"/Main/"+re.escape(i))) for i in mylist]
results = dict.fromkeys(mylist, 0)
with open("dataset.nt", "rb") as input:
for line in input:
# any match has to contain the "/Main/" part
# -> check it's there
# that may help a lot or not at all
# depending on what's in your file
if '/Main/' not in line:
continue
# do the regex-part
for i, regex in targets:
if regex.search(line):
results[i] += 1
Note that this could be better optimized if you posted a sample from your dataset. If for exemple your dataset can be sorted on "/Main/{i}" (using the system sort program for exemple), you wouldn't have to check each line for each value of i. Or if the position of "/Main/" in the line is known and fixed, you could use a simple string comparison on the relevant part of the string (which can be faster than a regexp).
The other solutions are very good. But, since there is a regex for each element, and is not important if the element appears more than once per line, you could count the lines containing target expression using re.findall.
Also after certain ammount of lines is better to read the hole file (if you have enough memory and it isn't a design restriction) to memory.
import re
mydict = {}
mylist = [...] # A list with 500 items
# Optimizing calls
findall = re.findall # Then python don't have to resolve this functions for every call
escape = re.escape
with open("dataset.nt", "rb") as input:
text = input.read() # Read the file once and keep it in memory instead access for read each line. If the number of lines is big this is faster.
for elem in mylist:
mydict[elem] = len(findall(".*/Main/{0}.*\n+".format(escape(elem)), text)) # Count the lines where the target regex is.
I test this with a file of size 800Mb (I wanted to see how much time take load a file as big like this into memory, is more fast that you would think).
I don't test the whole code with real data, just the findall part.

get non-matching line numbers python

Hi I wrote a simple code in python to do the following:
I have two files summarizing genomic data. The first file has the names of loci I want to get rid of, it looks something like this
File_1:
R000002
R000003
R000006
The second file has the names and position of all my loci and looks like this:
File_2:
R000001 1
R000001 2
R000001 3
R000002 10
R000002 2
R000002 3
R000003 20
R000003 3
R000004 1
R000004 20
R000004 4
R000005 2
R000005 3
R000006 10
R000006 11
R000006 123
What I wish to do is get all the corresponding line numbers of loci from File2 that are not in File1, so the end result should look like this:
Result:
1
2
3
9
10
11
12
13
I wrote the following simple code and it gets the job done
#!/usr/bin/env python
import sys
File1 = sys.argv[1]
File2 = sys.argv[2]
F1 = open(File1).readlines()
F2 = open(File2).readlines()
F3 = open(File2 + '.np', 'w')
Loci = []
for line in F1:
Loci.append(line.strip())
for x, y in enumerate(F2):
y2 = y.strip().split()
if y2[0] not in Loci:
F3.write(str(x+1) + '\n')
However when I run this on my real data set where the first file has 58470 lines and the second file has 12881010 lines it seems to take forever. I am guessing that the bottleneck is in the
if y2[0] not in Loci:
part where the code has to search through the whole of File_2 repeatedly but I have not been able to find a speedier solution.
Can anybody help me out and show a more pythonic way of doing things.
Thanks in advance
Here's some slightly more Pythonic code that doesn't care if your files are ordered. I'd prefer to just print everything out and redirect it to a file ./myscript.py > outfile.txt, but you could also pass in another filename and write to that.
#!/usr/bin/env python
import sys
ignore_f = sys.argv[1]
loci_f = sys.argv[2]
with open(ignore_f) as f:
ignore = set(x.strip() for x in f)
with open(loci_f) as f:
for n, line in enumerate(f, start=1):
if line.split()[0] not in ignore:
print n
Searching for something in a list is O(n), while it takes only O(1) for a set. If order doesn't matter and you have unique things, use a set over a list. While this isn't optimal, it should be O(n) instead of O(n × m) like your code.
You're also not closing your files, which when reading from isn't that big of a deal, but when writing it is. I use context managers (with) so Python does that for me.
Style-wise, use descriptive variable names. and avoid UpperCase names, those are typically used for classes (see PEP-8).
If your files are ordered, you can step through them together, ignoring lines where the loci names are the same, then when they differ, take another step in your ignore file and recheck.
To make the searching for matches more efficient you can simply use a set instead of list:
Loci = set()
for line in F1:
Loci.add(line.strip())
The rest should work the same, but faster.
Even more efficient would be to walk down the files in a sort of lockstep, since they're both sorted, but that will require more code and maybe isn't necessary.

How to count the number of characters in a file (not using the len function)?

Basically, I want to be able to count the number of characters in a txt file (with user input of file name). I can get it to display how many lines are in the file, but not how many characters. I am not using the len function and this is what I have:
def length(n):
value = 0
for char in n:
value += 1
return value
filename = input('Enter the name of the file: ')
f = open(filename)
for data in f:
data = length(f)
print(data)
All you need to do is sum the number of characters in each line (data):
total = 0
for line in f:
data = length(line)
total += data
print(total)
There are two problems.
First, for each line in the file, you're passing f itself—that is, a sequence of lines—to length. That's why it's printing the number of lines in the file. The length of that sequence of lines is the number of lines in the file.
To fix this, you want to pass each line, data—that is, a sequence of characters. So:
for data in f:
print length(data)
Next, while that will properly calculate the length of each line, you have to add them all up to get the length of the whole file. So:
total_length = 0
for data in f:
total_length += length(data)
print(total_length)
However, there's another way to tackle this that's a lot simpler. If you read() the file, you will get one giant string, instead of a sequence of separate lines. So you can just call length once:
data = f.read()
print(length(data))
The problem with this is that you have to have enough memory to store the whole file at once. Sometimes that's not appropriate. But sometimes it is.
When you iterate over a file (opened in text mode) you are iterating over its lines.
for data in f: could be rewritten as for line in f: and it is easier to see what it is doing.
Your length function looks like it should work but you are sending the open file to it instead of each line.

Categories

Resources