python: what is the quickest way to split a file into two files, each file having half of the number of lines in the original file, such that the lines in each of the two files are random?
for example: if the file is
1
2
3
4
5
6
7
8
9
10
it could be split into:
3
2
10
9
1
4
6
8
5
7
This sort of operation is often called "partition". Although there isn't a built-in partition function, I found this article: Partition in Python.
Given that definition, you can do this:
import random
def partition(l, pred):
yes, no = [], []
for e in l:
if pred(e):
yes.append(e)
else:
no.append(e)
return yes, no
lines = open("file.txt").readlines()
lines1, lines2 = partition(lines, lambda x: random.random() < 0.5)
Note that this won't necessarily exactly split the file in two, but it will on average.
You can just load the file, call random.shuffle on the resulting list, and then split it into two files (untested code):
def shuffle_split(infilename, outfilename1, outfilename2):
from random import shuffle
with open(infilename, 'r') as f:
lines = f.readlines()
# append a newline in case the last line didn't end with one
lines[-1] = lines[-1].rstrip('\n') + '\n'
shuffle(lines)
with open(outfilename1, 'w') as f:
f.writelines(lines[:len(lines) // 2])
with open(outfilename2, 'w') as f:
f.writelines(lines[len(lines) // 2:])
random.shuffle shuffles lines in-place, and pretty much does all the work here. Python's sequence indexing system (e.g. lines[len(lines) // 2:]) makes things really convenient.
I'm assuming that the file isn't huge, i.e. that it will fit comfortably in memory. If that's not the case, you'll need to do something a bit more fancy, probably using the linecache module to read random line numbers from your input file. I think probably you would want to generate two lists of line numbers, using a similar technique to what's shown above.
update: changed / to // to evade issues when __future__.division is enabled.
import random
data=open("file").readlines()
random.shuffle(data)
c=1
f=open("test."+str(c),"w")
for n,i in enumerate(data):
if n==len(data)/2:
c+=1
f.close()
f=open("test."+str(c),"w")
f.write(i)
Other version:
from random import shuffle
def shuffle_split(infilename, outfilename1, outfilename2):
with open(infilename, 'r') as f:
lines = f.read().splitlines()
shuffle(lines)
half_lines = len(lines) // 2
with open(outfilename1, 'w') as f:
f.write('\n'.join(lines.pop() for count in range(half_lines)))
with open(outfilename2, 'w') as f:
f.writelines('\n'.join(lines))
Related
I have a train_file.txt which has 3 columns on each row.
For example;
1 10 1
1 12 1
2 64 2
6 17 1
...
I am reading this txt file with
train_data = open("train_file.txt", 'r').readlines()
Then I am trying to get each value with for loop
for eachline in train_data:
uid, lid, x = eachline.strip().split()
Question: Train data is a huge file that's why I want to just get the first 1000 rows.
I was trying to execute the following code but I am getting an error ('list' object cannot be interpreted as an integer)
for eachline in range(train_data,1000)
uid, lid, x = eachline.strip().split()
It is not necessary to read the entire file at all. You could use enumerate on the file directly and break early or use itertools.islice:
from itertools import islice
train_data = list(islice(open("train_file.txt", 'r'), 1000))
You can also keep using the same file handle to read more data later:
f = open("train_file.txt", 'r')
train_data = list(islice(f, 1000)) # reads first 1000
test_data = list(islice(f, 100)) # reads next 100
Maybe try changing this line:
train_data = open("train_file.txt", 'r').readlines()
To:
train_data = open("train_file.txt", 'r').readlines()[:1000]
train_data is a list, use slicing:
for eachline in train_data[:1000]:
As the file is "huge" in your words a better approach is to read just first 1000 rows (readlines() will read the whole file in memory)
with open("train_file.txt", 'r'):
train_data = []
for idx, line in enumerate(f, start=1):
train_data.append(line.strip.split())
if idx == 1000:
break
Note that data will be str, not int. You probably want to convert them to int.
You could use enumerate and a break:
for k, line in enumerate(lines):
if k > 1000:
break # exit the loop
# do stuff on the line
I would recommend using the csv built in library since the data is csv-like (or the pandas one if you're using it), and using with. So something like this:
import csv
from itertools import islice
with open('./test.csv', 'r') as input_file:
csv_reader = csv.reader(input_file, delimiter=' ')
rows = list(islice(csv_reader, 1000))
# Use rows
print(rows)
You don't need it right now but it will make escaped characters or multiline entries way easier to parse. Also, if there are headers you can use csv.DictReader to include them.
Regarding your original code:
The call the readlines() will read all lines at that point so doing any filtering after won't make a difference.
If you did read it that way, to get the first 1000 lines your for loop should be:
for eachline in traindata[:1000]:
...
I have fasta file that contains two gene sequences and what i want to is to remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence in length of 50 characters per line. I made some progress but got struck at the end.
Here is my fasta sequence:
>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
And the output that i want is something like this
>conc
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGA
TCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGG
CATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCC
TTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATA
AGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAAACCATATGGC
ATTTTGCATCCATTTGTGCATTTCATTTAGTTTACTTGCATTCATTCAGG
My script so far is
final = list()
with open("test.fa", 'r') as fh_in:
for line in fh_in:
line = line.strip()
if not line.startswith(">"):
final.append(line)
final2 = "".join(final)
with open("testconcat.fa", 'w') as fh_out:
fh_out.write(">con")
fh_out.write("\n")
fh_out.write(final2)
How can i make sure that i only write 50 characters in each line?
You can use the inbuilt textwrap library
import textwrap
final2 = "".join(final)
print '\n'.join(textwrap.wrap(final2, 50)
When dealing with large files, if you do the joining, slicing etc. in memory, you may end up getting weird issues as you'll be consuming relatively much more memory for appending each line and then splitting them again into equally divided chunks before actually writing to file.
I think the best way to do avoid such issues is operating on file not on memory, in other words, you should write as you read at the same time.
>>> with open('test.fa', 'r') as r, open('testconcat.fa', 'w') as w:
... for line in r:
... if not line.startswith(">"):
... w.write(line.strip())
>>> with open('testconcat.fa', 'r+') as w:
... chunk = 50
... i = 0
... while next(w, None):
... w.seek(((i + 1) * chunk) + i)
... w.write('\n')
... i = i + 1
>>> cat testconcat.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGA
CAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGC
TGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTT
GTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGT
AAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAG
ACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAAC
ATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGA
TTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Hope this helps.
I am trying to figure out a way to split a big txt file with columns of data into smaller files for uploading purposes. The big file has 4000 lines and I wondering if there is a way to divide it into four parts such as
file 1 (lines 1-1000)
file 2 (lines 1001-2000)
file 3 (lines 2001-3000)
file 4 (lines 3001-4000)
I appreciate the help.
This works (you could implement a for rather than a while loop but it makes little difference and does not assume how many files will be necessary):
with open('longFile.txt', 'r') as f:
lines = f.readlines()
threshold=1000
fileID=0
while fileID<len(lines)/float(threshold):
with open('fileNo'+str(fileID)+'.txt','w') as currentFile:
for currentLine in lines[threshold*fileID:threshold*(fileID+1)]:
currentFile.write(currentLine)
fileID+=1
Hope this helps. Try to use open in a with block as suggested in python docs.
Give this a try:
fhand = open(filename, 'r')
all_lines = fhand.readlines()
for x in xrange(4):
new_file = open(new_file_names[x], 'w')
new_file.write(all_lines[x * 1000, (x + 1) * 1000])
I like Aleksander Lidtke's, but with a for loop and a pop() twist for fun. I also like to maintain some of the files original naming when I do this, since it is usually to multiple files. So I added the name "split" in it.
with open('Data.txt','r') as f:
lines = f.readlines()
limit=1000
for o in range(len(lines)):
if lines!=[]:
with open(f.name.split(".")[0] +"_" + str(o) + '.txt','w') as NewFile:
for i in range(limit):
if lines!=[]:NewFile.write(lines.pop(0))
I want to skip the first 17 lines while reading a text file.
Let's say the file looks like:
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
good stuff
I just want the good stuff. What I'm doing is a lot more complicated, but this is the part I'm having trouble with.
Use a slice, like below:
with open('yourfile.txt') as f:
lines_after_17 = f.readlines()[17:]
If the file is too big to load in memory:
with open('yourfile.txt') as f:
for _ in range(17):
next(f)
for line in f:
# do stuff
Use itertools.islice, starting at index 17. It will automatically skip the 17 first lines.
import itertools
with open('file.txt') as f:
for line in itertools.islice(f, 17, None): # start=17, stop=None
# process lines
for line in dropwhile(isBadLine, lines):
# process as you see fit
Full demo:
from itertools import *
def isBadLine(line):
return line=='0'
with open(...) as f:
for line in dropwhile(isBadLine, f):
# process as you see fit
Advantages: This is easily extensible to cases where your prefix lines are more complicated than "0" (but not interdependent).
Here are the timeit results for the top 2 answers. Note that "file.txt" is a text file containing 100,000+ lines of random string with a file size of 1MB+.
Using itertools:
import itertools
from timeit import timeit
timeit("""with open("file.txt", "r") as fo:
for line in itertools.islice(fo, 90000, None):
line.strip()""", number=100)
>>> 1.604976346003241
Using two for loops:
from timeit import timeit
timeit("""with open("file.txt", "r") as fo:
for i in range(90000):
next(fo)
for j in fo:
j.strip()""", number=100)
>>> 2.427317383000627
clearly the itertools method is more efficient when dealing with large files.
If you don't want to read the whole file into memory at once, you can use a few tricks:
With next(iterator) you can advance to the next line:
with open("filename.txt") as f:
next(f)
next(f)
next(f)
for line in f:
print(f)
Of course, this is slighly ugly, so itertools has a better way of doing this:
from itertools import islice
with open("filename.txt") as f:
# start at line 17 and never stop (None), until the end
for line in islice(f, 17, None):
print(f)
This solution helped me to skip the number of lines specified by the linetostart variable.
You get the index (int) and the line (string) if you want to keep track of those too.
In your case, you substitute linetostart with 18, or assign 18 to linetostart variable.
f = open("file.txt", 'r')
for i, line in enumerate(f, linetostart):
#Your code
If it's a table.
pd.read_table("path/to/file", sep="\t", index_col=0, skiprows=17)
You can use a List-Comprehension to make it a one-liner:
[fl.readline() for i in xrange(17)]
More about list comprehension in PEP 202 and in the Python documentation.
Here is a method to get lines between two line numbers in a file:
import sys
def file_line(name,start=1,end=sys.maxint):
lc=0
with open(s) as f:
for line in f:
lc+=1
if lc>=start and lc<=end:
yield line
s='/usr/share/dict/words'
l1=list(file_line(s,235880))
l2=list(file_line(s,1,10))
print l1
print l2
Output:
['Zyrian\n', 'Zyryan\n', 'zythem\n', 'Zythia\n', 'zythum\n', 'Zyzomys\n', 'Zyzzogeton\n']
['A\n', 'a\n', 'aa\n', 'aal\n', 'aalii\n', 'aam\n', 'Aani\n', 'aardvark\n', 'aardwolf\n', 'Aaron\n']
Just call it with one parameter to get from line n -> EOF
I'm pretty new to Python - just wondering if there was a library function or easy way to truncate a file to the first 100 lines or less?
with open("my.file", "r+") as f:
[f.readline() for x in range(100)]
f.truncate()
EDIT A 5 % speed increase can be had by instead using the xrange iterator and not storing the entire list:
with open("my.file", "r+") as f:
for x in xrange(100):
f.readline()
f.truncate()
Use one of the solutions here: Iterate over the lines of a string and just grab the first hundred, i.e.
import itertools
lines = itertools.islice(iter, 100)