Print lines between line numbers from a large file - python

I have a very large text file has more than 30 GB size. For some reasons, I want to read lines between 1000000 and 2000000 and compare with user input string. If it matches, I need to write the line content into to another file.
I know how to read a file line by line.
input_file = open('file.txt', 'r')
for line in input_file:
print line
But if the size of file is large, it really affect performance right? How to address this in an optimized way.

You can use itertools.islice:
from itertools import islice
with open('file.txt') as fin:
lines = islice(fin, 1000000, 2000000) # or whatever ranges
for line in lines:
# do something
Of course, if your lines are fixed length, you can use that to directly fin.seek() to the start of the line. Otherwise, the approach above still has to read n lines until islice starts producing output, but is just really a convenient way to limit the range.

You could use linecache.
Let me cite from the docs: "The linecache module allows one to get any line from any file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file.":
import linecache
for i in xrange(1000000, 2000000)
print linecache.getline('file.txt', i)

Do all your lines have the same size? If that were the case you could probably use seek() to directly jump to the first line you are interested into. Otherwise, you're going to have to iterate through the entire file because there is no way of telling in advance where each line starts:
input_file = open('file.txt', 'r')
for index, line in enumerate(input_file):
# Assuming you start counting from zero
if 1000000 <= index <= 2000000:
print line
For small files, the linecache module can be useful.

If you're on Linux, have you considered using the os.system or commands Python modules to directly execute shell commands like sed, awk, head or tail to do this?
Running the command: os.system("tail -n+50000000 test.in | head -n10")
will read line 50.000.000 to 50.000.010 from the file test.in This post on stackoverflow discusses different ways of calling commands, if performance is key there may be more efficient methods than os.system.
This discussion on unix.stackexchange discusses in-depth how to select specific ranges of a text file using the command line:
100,000,000-line file generated by seq 100000000 > test.in
Reading lines 50,000,000-50,000,010
Tests in no particular order
real time as reported by bash's builtin time
The combination of tail and head, or using sed seem to offer the quickest solutions.
4.373 4.418 4.395 tail -n+50000000 test.in | head -n10
5.210 5.179 6.181 sed -n '50000000,50000010p;57890010q' test.in
5.525 5.475 5.488 head -n50000010 test.in | tail -n10
8.497 8.352 8.438 sed -n '50000000,50000010p' test.in
22.826 23.154 23.195 tail -n50000001 test.in | head -n10
25.694 25.908 27.638 ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574 awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127 awk 'NR >= 57890000 && NR <= 57890010' test.in

Generally, you cannot just jump to line number x in file, because text line can have variable lenght, so they can occupy anything between one and gazillion bytes.
However, if you expect to seek in those files very often, you can index them, remembering in separate files at which bytes starts, let's say, every thousandth line. They you can open file and use file.seek() to go to part of file you are interested in, and start itereting from there.

Best way i found is :
lines_data = []
text_arr = multilinetext.split('\n')
for i in range(line_number_begin, line_number_end):
lines_data.append(multilinetext[i])

Related

How to concatenate sequences in the same multiFASTA files and then print result to a new FASTA file?

I have a folder with over 50 FASTA files each with anywhere from 2-8 FASTA sequences within them, here's an example:
testFOR.id_AH004930.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGA
>AH004930|2:237-401_Miopithecus_talapoin
GGGT
>AH004930|2:502-580_Miopithecus_talapoin
CTTTGCT
>AH004930|2:681-747_Miopithecus_talapoin
GGTG
testFOR.id_M95099.fasta
>M95099|1:1-90_Homo_sapien
TCTTTGC
>M95099|1:100-243_Homo_sapien
ATGGTCTTTGAA
They're all grouped based on their ID number (in this case AH004930 and M95099), which I've managed to extract from the original raw multiFASTA file using the very handy seqkit code found HERE.
What I am aiming to do is:
Use cat to put these sequences together within the file like this:
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTG
>M95099|1:1-90_Homo_sapien
TCTTTGCATGGTCTTTGAA
(I'm not fussed about the nucleotide position, I'm fussed about the ID and species name!)
Print this result out into a new FASTA file.
Ideally I'd really like to have all of these 50 files condensed into 1 FASTA that I can then go ahead and filter/align:
GENE_L.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTG
>M95099|1:1-90_Homo_sapien
TCTTTGCATGGTCTTTGAA
....
So far I have found a way to achieve what I want but only one file at a time (using this code: cat myfile.fasta | sed -e '1!{/^>.*/d;}' | sed ':a;N;$!ba;s/\n//2g' > output.fasta which I've sadly lost the link for the credit for) but a lot of these file names are very similar, so it's inevitable that if I did it manually, I'd miss some/it would be way too slow.
I have tried to put this into a loop and it's kind of there! But what it does is it cats each FASTA file, put's it into a new one BUT only keeps the first header, leaving me with a massive stitched together sequence;
for FILE in *; do cat *.fasta| sed -e '1!{/^>.*/d;}'| sed ':a;N;$!ba;s/\n//2g' > output.fasta; done
output.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTGTCTTTGCATGGTCTTTGAAGGTCTTTGAAATGAGTGGT...
I wondered if making a loop similar to the one HERE would be any good but I am really unsure how to get it to print each header once it opens a new file.
How can I cat these sequences, print them into a new file and still keep these headers?
I would really appreciate any advice on where I've gone wrong in the loop and any solutions suitable for a zsh shell! I'm open to any python or linux solution. Thank you kindly in advance
This might work for you (GNU sed):
sed -s '1h;/>/d;H;$!d;x;s/\n//2g' file1 file2 file3 ...
Set -s to treat each file separately.
Copy the first line.
Delete any other lines containing >.
Append all other lines to the first.
Delete these lines except for the last.
At the end of the file, swap to the copies and remove all newlines except the first.
Repeat for all files.
Alternative for non-GNU seds:
for file in *.fasta; do sed '1h;/>/d;H;$!d;x;s/\n//2g' "$file"; done
N.B. MacOS sed may need to be put into a script and invoked using the -f option or split into several pieces using the -e option (less the ; commands), your luck may vary.
Or perhaps:
for file in file?; do sed $'1h;/>/d;H;$!d;x;s/\\n/#/;s/\\n//g;s/#/\\n/' "$file"; done
Not sure I understand exactly your issue, but if you simply want to concatenate contents from many files to a single file I believe the (Python) code below should work:
import os
input_folder = 'path/to/your/folder/with/fasta/files'
output_file = 'output.fasta'
with open(output_file, 'w') as outfile:
for file_name in os.listdir(input_folder):
if not file_name.endswith('.fasta'): # ignore this
continue
file_path = os.path.join(input_folder, file_name)
with open(file_path, 'r') as inpfile:
outfile.write(inpfile.read())

How can I concatenate multiple text or xml files but omit specific lines from each file?

I have a number of xml files (which can be considered as text files in this situation) that I wish to concatenate. Normally I think I could do something like this from a Linux command prompt or bash script:
cat somefile.xml someotherfile.xml adifferentfile.xml > out.txt
Except that in this case, I need to copy the first file in its entirety EXCEPT for the very last line, but in all subsequent files omit exactly the first four lines and the very last line (technically, I do need the last line from the last file but it is always the same, so I can easily add it with a separate statement).
In all these files the first four lines and the last line are always the same, but the contents in between varies. The names of the xml files can be hardcoded into the script or read from a separate data file, and the number of them may vary from time to time but always will number somewhere around 10-12.
I'm wondering what would be the easiest and most understandable way to do this. I think I would prefer either a bash script or maybe a python script, though I generally understand bash scripts a little better. What I can't get my head around is how to trim off just those first four lines (on all but the first file) and the last line of every file. My suspicion is there's some Linux command that can do this, but I have no idea what it would be. Any suggestions?
sed '$d' firstfile > out.txt
sed --separate '1,4d; $d' file2 file3 file4 >> out.txt
sed '1,4d' lastfile >> out.txt
It's important to use the --separate (or shorter -s) option so that the range statements 1,4 and $ apply to each file individually.
From GNU sed manual:
-s, --separate
By default, sed will consider the files specified on the command line as a single continuous long stream. This GNU sed
extension allows the user to consider them as separate files.
Do it in two steps:
use the head command (to get the lines you want)
Use cat to combine
You could use temp files or bash trickery.

Count columns of gzipped tsv without loading

I have a large tab-delimited file that has been gzipped, and I would like to know how many columns it has. For small files I can just unzip and read into python, for large files this is slow. Is there a way to count the columns quickly without loading the file into python?
Effeciently counting number of columns of text file is almost identical, but since my files are gzipped just reading the first line won't work. Is there a way to make python efficiently unzip just enough to read the first line?
... but since my files are gzipped just reading the first line won't work.
Yes it will.
import csv
import gzip
with gzip.open('file.tsv.gz', 'rt') as gzf:
reader = csv.reader(gzf, dialect=csv.excel_tab)
print(len(next(reader)))
This can be done with traditional unix command line tools. For example:
$ zcat file.tsv.gz | head -n 1 | tr $'\t' '\n' | wc -l
zcat (or gunzip -c) unzips and outputs to standard output, without modifying the file. 'head -n 1' reads exactly one line and outputs it. The 'tr' command replaces tabs with newlines, and 'wc -l' counts the number of lines. Because 'head -n 1' exits after one line, this has the effect of terminating the zcat command as well. It's quite fast. If first line of the file is a header, simply omit the 'wc -l' to see what the headers are.

How to use awk if statement and for loop in subprocess.call

Trying to print filename of files that don't have 12 columns.
This works at the command line:
for i in *dim*; do awk -F',' '{if (NR==1 && NF!=12)print FILENAME}' $i; done;
When I try to embed this in subprocess.call in a python script, it doesn't work:
subprocess.call("""for %i in (*dim*.csv) do (awk -F, '{if ("NR==1 && NF!=12"^) {print FILENAME}}' %i)""", shell=True)
The first error I received was "Print is unexpected at this time" so I googled and added ^ within parentheses. Next error was "unexpected newline or end of string" so googled again and added the quotes around NR==1 && NF!=12. With the current code it's printing many lines in each file so I suspect something is wrong with the if statement. I've used awk and for looped before in this style in subprocess.call but not combined and with an if statement.
Multiple input files in AWK
In the string you are passing to subprocess.call(), your if statement is evaluating a string (probably not the comparison you want). It might be easier to just simplify the shell command by doing everything in AWK. You are executing AWK for every $i in the shell's for loop. Since you can give multiple input files to AWK, there is really no need for this loop.
You might want to scan through the entire files until you find any line that has other than 12 fields, and not only check the first line (NR==1). In this case, the condition would be only NF!=12.
If you want to check only the first line of each file, then NR==1 becomes FNR==1 when using multiple files. NR is the "number of records" (across all input files) and FNR is "file number of records" for the current input file only. These are special built-in variables in AWK.
Also, the syntax of AWK allows for the blocks to be executed only if the line matches some condition. Giving no condition (as you did) runs the block for every line. For example, to scan through all files given to AWK and print the name of a file with other than 12 fields on the first line, try:
awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv
I have added the .csv to your wildcard *dim* as you had in the Python version. The -F, of course changes the field separator to a comma from the default space. For every line in each file, AWK checks if the number of fields NF is 12, if it's not, it executes the block of code, otherwise it goes on to the next line. This block prints the FILENAME of the current file AWK is processing, then skips to the beginning of the next file with nextfile.
Try running this AWK version with your subprocess module in Python:
subprocess.call("""awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv""", shell=True)
The triple quotes makes it a literal string. The output of AWK goes to stdout and I'm assuming you know how to use this in Python with the subprocess module.
Using only Python
Don't forget that Python is itself an expressive and powerful language. If you are already using Python, it may be simpler, easier, and more portable to use only Python instead of a mixture of Python, bash, and AWK.
You can find the names of files (selected from *dim*.csv) with the first line of each file having other than 12 comma-separated fields with:
import glob
files_found = []
for filename in glob.glob('*dim*.csv'):
with open(filename, 'r') as f:
firstline = f.readline()
if len(firstline.split(',')) != 12:
files_found.append(filename)
f.close()
print(files_found)
The glob module gives the listing of files matching the wildcard pattern *dim*.csv. The first line of each of these files is read and split into fields separated by commas. If the number of these fields is not 12, it is added to the list files_found.

Simple regex problem: Removing all new lines from a file

I'm becoming acquainted with python and am creating problems in order to help myself learn the ins and outs of the language. My next problem comes as follows:
I have copied and pasted a huge slew of text from the internet, but the copy and paste added several new lines to break up the huge string. I wish to programatically remove all of these and return the string into a giant blob of characters. This is obviously a job for regex (I think), and parsing through the file and removing all instances of the newline character sounds like it would work, but it doesn't seem to be going over all that well for me.
Is there an easy way to go about this? It seems rather simple.
The two main alternatives: read everything in as a single string and remove newlines:
clean = open('thefile.txt').read().replace('\n', '')
or, read line by line, removing the newline that ends each line, and join it up again:
clean = ''.join(l[:-1] for l in open('thefile.txt'))
The former alternative is probably faster, but, as always, I strongly recommend you MEASURE speed (e.g., use python -mtimeit) in cases of your specific interest, rather than just assuming you know how performance will be. REs are probably slower, but, again: don't guess, MEASURE!
So here are some numbers for a specific text file on my laptop:
$ python -mtimeit -s"import re" "re.sub('\n','',open('AV1611Bible.txt').read())"
10 loops, best of 3: 53.9 msec per loop
$ python -mtimeit "''.join(l[:-1] for l in open('AV1611Bible.txt'))"
10 loops, best of 3: 51.3 msec per loop
$ python -mtimeit "open('AV1611Bible.txt').read().replace('\n', '')"
10 loops, best of 3: 35.1 msec per loop
The file is a version of the KJ Bible, downloaded and unzipped from here (I do think it's important to run such measurements on one easily fetched file, so others can easily reproduce them!).
Of course, a few milliseconds more or less on a file of 4.3 MB, 34,000 lines, may not matter much to you one way or another; but as the fastest approach is also the simplest one (far from an unusual occurrence, especially in Python;-), I think that's a pretty good recommendation.
I wouldn't use a regex for simply replacing newlines - I'd use string.replace(). Here's a complete script:
f = open('input.txt')
contents = f.read()
f.close()
new_contents = contents.replace('\n', '')
f = open('output.txt', 'w')
f.write(new_contents)
f.close()
import re
re.sub(r"\n", "", file_contents_here)
I know this is a python learning problem, but if you're ever trying to do this from the command-line, there's no need to write a python script. Here are a couple of other ways:
cat $FILE | tr -d '\n'
awk '{printf("%s", $0)}' $FILE
Neither of these has to read the entire file into memory, so if you've got an enormous file to process, they might be better than the python solutions provided.
Old question, but since it was in my search results for a similar query, and no one has mentioned the python string functions strip() || lstrip() || rstrip(), I'll just add that for posterity (and anyone who prefers not to use re when not necessary):
old = open('infile.txt')
new = open('outfile.txt', 'w')
stripped = [line.strip() for line in old]
old.close()
new.write("".join(stripped))
new.close()
All the examples using <string>.replace('\n','') is the correct method to remove all carriage returns.
If you are interested in removing all redundant new lines for debugging etc., here is how:
import re
re.sub(r"(\n)\1{2,}", "", _your_string).strip()

Categories

Resources