File division into 2 random files - python

I want to divide a file in a two random halfs with python. I have a small script, but it did not divide exactly into 2. Any suggestions?
import random
fin = open("test.txt", 'rb')
f1out = open("test1.txt", 'wb')
f2out = open("test2.txt", 'wb')
for line in fin:
r = random.random()
if r < 0.5:
f1out.write(line)
else:
f2out.write(line)
fin.close()
f1out.close()
f2out.close()

The notion of randomness means that you will not be able to deterministically rely on the number to produce an equal amount of results below 0.5 and above 0.5.
You could use a counter and check if it is even or odd after shuffling all the lines in a list:
file_lines = [line for line in fin]
random.shuffle(file_lines)
counter = 0
for line in file_lines:
counter += 1
if counter % 2 == 0:
f1out.write(line)
else:
f2out.write(line)
You can use this pattern with any number (10 in this example):
counter = 0
for line in file_lines:
counter += 1
if counter % 10 == 0:
f1out.write(line)
elif counter % 10 == 1:
f2out.write(line)
elif counter % 10 == 2:
f3out.write(line)
elif counter % 10 == 3:
f4out.write(line)
elif counter % 10 == 4:
f5out.write(line)
elif counter % 10 == 5:
f6out.write(line)
elif counter % 10 == 6:
f7out.write(line)
elif counter % 10 == 7:
f8out.write(line)
elif counter % 10 == 8:
f9out.write(line)
else:
f10out.write(line)

random will not give you exactly half each time. If you flip a coin 10 times, you dont necessarily get 5 heads and 5 tails.
One approach would be to use the partitioning method described in Python: Slicing a list into n nearly-equal-length partitions, but shuffling the result beforehand.
import random
N_FILES = 2
out = [open("test{}.txt".format(i), 'wb') for i in range(min(N_FILES, n))]
fin = open("test.txt", 'rb')
lines = fin.readlines()
random.shuffle(lines)
n = len(lines)
size = n / float(N_FILES)
partitions = [ lines[int(round(size * i)): int(round(size * (i + 1)))] for i in xrange(n) ]
for f, lines in zip(out, partitions):
for line in lines:
f.write(line)
fin.close()
for f in out:
f.close()
The code above will split the input file into N_FILES (defined as a constant at the top) of approximately equal size, but never splitting beyond one line per file. Handling things this way would let you put this into a function that can take a variable number of files to split into without having to alter code for each case.

Related

Random dont chose 8 character words

I used random module in my ptoject. Im not a programmer, but i need to do a script which will create a randomized database from another database. In my case, I need the script to select random words lined up in a column from the "wordlist.txt" file, and then also absolutely randomly line them up in a line of 12 words, write them to another file (for example, "result1.txt") and switch to a new line, and so many times, ad infinitum. Everything seems to be working, but I noticed that it adds absolutely all words, except for words consisting of 8 letters.
And also i want to increase the perfomance of this code.
Code:
import random
# Initial wordlists
fours = []
fives = []
sixes = []
sevens = []
eights = []
# Fill above lists with corresponding word lengths from wordlist
with open('wordlist.txt') as wordlist:
for line in wordlist:
if len(line) == 4:
fours.append(line.strip())
elif len(line) == 5:
fives.append(line.strip())
elif len(line) == 6:
sixes.append(line.strip())
elif len(line) == 7:
sevens.append(line.strip())
elif len(line) == 8:
eights.append(line.strip())
# Create new lists and fill with number of items in fours
fivesLess = []
sixesLess = []
sevensLess = []
eightsLess = []
fivesCounter = 0
while fivesCounter < len(fours):
randFive = random.choice(fives)
if randFive not in fivesLess:
fivesLess.append(randFive)
fivesCounter += 1
sixesCounter = 0
while sixesCounter < len(fours):
randSix = random.choice(sixes)
if randSix not in sixesLess:
sixesLess.append(randSix)
sixesCounter += 1
sevensCounter = 0
while sevensCounter < len(fours):
randSeven = random.choice(sevens)
if randSeven not in sevensLess:
sevensLess.append(randSeven)
sevensCounter += 1
eightsCounter = 0
while eightsCounter < len(fours):
randEight = random.choice(eights)
if randEight not in eightsLess:
eightsLess.append(randEight)
eightsCounter += 1
choices = [eights]
# Generate n number of seeds and print
seedCounter = 0
while seedCounter < 1:
seed = []
while len(seed) < 12:
wordLengthChoice = random.choice(choices)
wordChoice = random.choice(wordLengthChoice)
seed.append(wordChoice)
seedCounter += 0
with open("result1.txt", "a") as f:
f.write(' '.join(seed))
f.write('\n')
If I'm understanding you correctly, something like
import random
from collections import defaultdict
words_by_length = defaultdict(list)
def generate_line(*, n_words, word_length):
return ' '.join(random.choice(words_by_length[word_length]) for _ in range(n_words))
with open('wordlist.txt') as wordlist:
for line in wordlist:
line = line.strip()
words_by_length[len(line)].append(line)
for x in range(10):
print(generate_line(n_words=12, word_length=8))
should be enough – you can just use a single dict-of-lists to contain all of the words, no need for separate variables. (Also, I suspect your original bug stemmed from not strip()ing the line before looking at its length.)
If you need a single line to never repeat a word, you'll want
def generate_line(*, n_words, word_length):
return ' '.join(random.sample(words_by_length[word_length], n_words))
instead.

How to divide a file?

I tried to split the file to multiple chunk file by divide the total line of the text file equally. However, it does not split the size equally. Is there a way to divide the file to multiple equal chunks without write any incomplete lines to the files using Python 3.xx? For example, a 100mb text file will be divided in to 33mb, 33mb and 34mb.
Here is what I get so far:
chunk=3
my_file = 'file.txt'
NUM_OF_LINES=-(-(sum(1 for line in open(my_file)))//chunk)+1
print(NUM_OF_LINES)
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
for row in text_file:
hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
count = 0
increment = (outer_count-1) * NUM_OF_LINES
left = len(hold_lines) - increment
file_name = "text.txt_" + str(outer_count * NUM_OF_LINES) + ".txt"
hold_new_lines = []
if left < NUM_OF_LINES:
while count < left:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
sorting = False
else:
while count < NUM_OF_LINES:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
outer_count += 1
with open(file_name,'w') as next_file:
for row in hold_new_lines:
next_file.write(row)
If maintaining sequential order of lines is not important, https://stackoverflow.com/a/30583482/783836 is a pretty straightforward solution.
This code attempt to equalize, as faithfully as possible, the size of subfiles (not the number of lines, the two criteria cannot be met simultaneously). I use some numpy tools for improve conciseness and reliability. np.searchsorted find the line numbers where splits are done in original file.
import numpy as np,math
lines=[]
lengths=[]
n=6
with open('file.txt') as f:
for line in f:
lines.append(line)
lengths.append(len(line))
cumlengths = np.cumsum(lengths)
totalsize = cumlengths[-1]
chunksize = math.ceil(totalsize/n) # round to the next
starts = np.searchsorted(cumlengths,range(0,(n+1)*chunksize,chunksize)) # places to split
for k in range(n):
with open('out' + str(k+1) + '.txt','w') as f:
s=slice(starts[k],starts[k+1])
f.writelines(lines[s])
print(np.sum(lengths[s])) # check the size
without external modules, starts can also be built by :
chuncksize = (sum(lengths)-1)//n+1
starts=[]
split=0
cumlength=0
for k,length in enumerate(lengths):
cumlength += length
if cumlength>=split:
starts.append(k)
split += chunksize
starts.append(k+1)

Splitting a large json file into multiple json files using python [duplicate]

I have a text file say really_big_file.txt that contains:
line 1
line 2
line 3
line 4
...
line 99999
line 100000
I would like to write a Python script that divides really_big_file.txt into smaller files with 300 lines each. For example, small_file_300.txt to have lines 1-300, small_file_600 to have lines 301-600, and so on until there are enough small files made to contain all the lines from the big file.
I would appreciate any suggestions on the easiest way to accomplish this using Python
lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
for lineno, line in enumerate(bigfile):
if lineno % lines_per_file == 0:
if smallfile:
smallfile.close()
small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
smallfile = open(small_filename, "w")
smallfile.write(line)
if smallfile:
smallfile.close()
Using itertools grouper recipe:
from itertools import zip_longest
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return zip_longest(fillvalue=fillvalue, *args)
n = 300
with open('really_big_file.txt') as f:
for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
with open('small_file_{0}'.format(i * n), 'w') as fout:
fout.writelines(g)
The advantage of this method as opposed to storing each line in a list, is that it works with iterables, line by line, so it doesn't have to store each small_file into memory at once.
Note that the last file in this case will be small_file_100200 but will only go until line 100000. This happens because fillvalue='', meaning I write out nothing to the file when I don't have any more lines left to write because a group size doesn't divide equally. You can fix this by writing to a temp file and then renaming it after instead of naming it first like I have. Here's how that can be done.
import os, tempfile
with open('really_big_file.txt') as f:
for i, g in enumerate(grouper(n, f, fillvalue=None)):
with tempfile.NamedTemporaryFile('w', delete=False) as fout:
for j, line in enumerate(g, 1): # count number of lines in group
if line is None:
j -= 1 # don't count this line
break
fout.write(line)
os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))
This time the fillvalue=None and I go through each line checking for None, when it occurs, I know the process has finished so I subtract 1 from j to not count the filler and then write the file.
I do this a more understandable way and using less short cuts in order to give you a further understanding of how and why this works. Previous answers work, but if you are not familiar with certain built-in-functions, you will not understand what the function is doing.
Because you posted no code I decided to do it this way since you could be unfamiliar with things other than basic python syntax given that the way you phrased the question made it seem as though you did not try nor had any clue as how to approach the question
Here are the steps to do this in basic python:
First you should read your file into a list for safekeeping:
my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
for row in text_file:
hold_lines.append(row)
Second, you need to set up a way of creating the new files by name! I would suggest a loop along with a couple counters:
outer_count = 1
line_count = 0
sorting = True
while sorting:
count = 0
increment = (outer_count-1) * 300
left = len(hold_lines) - increment
file_name = "small_file_" + str(outer_count * 300) + ".txt"
Third, inside that loop you need some nested loops that will save the correct rows into an array:
hold_new_lines = []
if left < 300:
while count < left:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
sorting = False
else:
while count < 300:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
Last thing, again in your first loop you need to write the new file and add your last counter increment so your loop will go through again and write a new file
outer_count += 1
with open(file_name,'w') as next_file:
for row in hold_new_lines:
next_file.write(row)
note: if the number of lines is not divisible by 300, the last file will have a name that does not correspond to the last file line.
It is important to understand why these loops work. You have it set so that on the next loop, the name of the file that you write changes because you have the name dependent on a changing variable. This is a very useful scripting tool for file accessing, opening, writing, organizing etc.
In case you could not follow what was in what loop, here is the entirety of the function:
my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
for row in text_file:
hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
count = 0
increment = (outer_count-1) * 300
left = len(hold_lines) - increment
file_name = "small_file_" + str(outer_count * 300) + ".txt"
hold_new_lines = []
if left < 300:
while count < left:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
sorting = False
else:
while count < 300:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
outer_count += 1
with open(file_name,'w') as next_file:
for row in hold_new_lines:
next_file.write(row)
lines_per_file = 300 # Lines on each small file
lines = [] # Stores lines not yet written on a small file
lines_counter = 0 # Same as len(lines)
created_files = 0 # Counting how many small files have been created
with open('really_big_file.txt') as big_file:
for line in big_file: # Go throught the whole big file
lines.append(line)
lines_counter += 1
if lines_counter == lines_per_file:
idx = lines_per_file * (created_files + 1)
with open('small_file_%s.txt' % idx, 'w') as small_file:
# Write all lines on small file
small_file.write('\n'.join(stored_lines))
lines = [] # Reset variables
lines_counter = 0
created_files += 1 # One more small file has been created
# After for-loop has finished
if lines_counter: # There are still some lines not written on a file?
idx = lines_per_file * (created_files + 1)
with open('small_file_%s.txt' % idx, 'w') as small_file:
# Write them on a last small file
small_file.write('n'.join(stored_lines))
created_files += 1
print '%s small files (with %s lines each) were created.' % (created_files,
lines_per_file)
import csv
import os
import re
MAX_CHUNKS = 300
def writeRow(idr, row):
with open("file_%d.csv" % idr, 'ab') as file:
writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
writer.writerow(row)
def cleanup():
for f in os.listdir("."):
if re.search("file_.*", f):
os.remove(os.path.join(".", f))
def main():
cleanup()
with open("large_file.csv", 'rb') as results:
r = csv.reader(results, delimiter=',', quotechar='\"')
idr = 1
for i, x in enumerate(r):
temp = i + 1
if not (temp % (MAX_CHUNKS + 1)):
idr += 1
writeRow(idr, x)
if __name__ == "__main__": main()
with open('/really_big_file.txt') as infile:
file_line_limit = 300
counter = -1
file_index = 0
outfile = None
for line in infile.readlines():
counter += 1
if counter % file_line_limit == 0:
# close old file
if outfile is not None:
outfile.close()
# create new file
file_index += 1
outfile = open('small_file_%03d.txt' % file_index, 'w')
# write to file
outfile.write(line)
I had to do the same with 650000 line files.
Use the enumerate index and integer div it (//) with the chunk size
When that number changes close the current file and open a new one
This is a python3 solution using format strings.
chunk = 50000 # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')
with open('massive_web_log_file') as file_to_read:
for i, line in enumerate(file_to_read.readlines()):
file_name = f'./a_folder/{i // chunk}'
print(i, file_name) # a bit of feedback that slows the process down a
if file_name == this_small_file.name:
this_small_file.write(line)
else:
this_small_file.write(line)
this_small_file.close()
this_small_file = open(f'{file_name}', 'a')
Set files to the number of file you want to split the master file to
in my exemple i want to get 10 files from my master file
files = 10
with open("data.txt","r") as data :
emails = data.readlines()
batchs = int(len(emails)/10)
for id,log in enumerate(emails):
fileid = id/batchs
file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+')
file.write(log)
A very easy way would if you want to split it in 2 files for example:
with open("myInputFile.txt",'r') as file:
lines = file.readlines()
with open("OutputFile1.txt",'w') as file:
for line in lines[:int(len(lines)/2)]:
file.write(line)
with open("OutputFile2.txt",'w') as file:
for line in lines[int(len(lines)/2):]:
file.write(line)
making that dynamic would be:
with open("inputFile.txt",'r') as file:
lines = file.readlines()
Batch = 10
end = 0
for i in range(1,Batch + 1):
if i == 1:
start = 0
increase = int(len(lines)/Batch)
end = end + increase
with open("splitText_" + str(i) + ".txt",'w') as file:
for line in lines[start:end]:
file.write(line)
start = end
In Python files are simple iterators. That gives the option to iterate over them multiple times and always continue from the last place the previous iterator got. Keeping this in mind, we can use islice to get the next 300 lines of the file each time in a continuous loop. The tricky part is knowing when to stop. For this we will "sample" the file for the next line and once it is exhausted we can break the loop:
from itertools import islice
lines_per_file = 300
with open("really_big_file.txt") as file:
i = 1
while True:
try:
checker = next(file)
except StopIteration:
break
with open(f"small_file_{i*lines_per_file}.txt", 'w') as out_file:
out_file.write(checker)
for line in islice(file, lines_per_file-1):
out_file.write(line)
i += 1

How to recover only the second instance of a string in a text file?

I have a large number of text files (>1000) with the same format for all.
The part of the file I'm interested in looks something like:
# event 9
num: 1
length: 0.000000
otherstuff: 19.9 18.8 17.7
length: 0.000000 176.123456
# event 10
num: 1
length: 0.000000
otherstuff: 1.1 2.2 3.3
length: 0.000000 1201.123456
I need only the second index value of the second instance of the defined variable, in this case length. Is there a pythonic way of doing this (i.e. not sed)?
My code looks like:
with open(wave_cat,'r') as catID:
for i, cat_line in enumerate(catID):
if not len(cat_line.strip()) == 0:
line = cat_line.split()
#replen = re.sub('length:','length0:','length:')
if line[0] == '#' and line[1] == 'event':
num = long(line[2])
elif line[0] == 'length:':
Length = float(line[2])
If you can read the entire file into memory, just do a regex against the file contents:
for fn in [list of your files, maybe from a glob]:
with open(fn) as f:
try:
nm=pat.findall(f.read())[1]
except IndexError:
nm=''
print nm
If larger files, use mmap:
import re, mmap
nth=1
pat=re.compile(r'^# event.*?^length:.*?^length:\s[\d.]+\s(\d+\.\d+)', re.S | re.M)
for fn in [list of your files, maybe from a glob]:
with open(fn, 'r+b') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
if i==nth:
print m.group(1)
break
Use a counter:
with open(wave_cat,'r') as catID:
ct = 0
for i, cat_line in enumerate(catID):
if not len(cat_line.strip()) == 0:
line = cat_line.split()
#replen = re.sub('length:','length0:','length:')
if line[0] == '#' and line[1] == 'event':
num = long(line[2])
elif line[0] == 'length:':
ct += 1
if ct == 2:
Length = float(line[2])
ct = 0
You're on the right track. It'll probably be a bit faster deferring the splitting unless you actually need it. Also, if you're scanning lots of files and only want the second length entry, it will save a lot of time to break out of the loop once you've seen it.
length_seen = 0
elements = []
with open(wave_cat,'r') as catID:
for line in catID:
line = line.strip()
if not line:
continue
if line.startswith('# event'):
element = {'num': int(line.split()[2])}
elements.append(element)
length_seen = 0
elif line.startswith('length:'):
length_seen += 1
if length_seen == 2:
element['length'] = float(line.split()[2])

Splitting large text file into smaller text files by line numbers using Python

I have a text file say really_big_file.txt that contains:
line 1
line 2
line 3
line 4
...
line 99999
line 100000
I would like to write a Python script that divides really_big_file.txt into smaller files with 300 lines each. For example, small_file_300.txt to have lines 1-300, small_file_600 to have lines 301-600, and so on until there are enough small files made to contain all the lines from the big file.
I would appreciate any suggestions on the easiest way to accomplish this using Python
lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
for lineno, line in enumerate(bigfile):
if lineno % lines_per_file == 0:
if smallfile:
smallfile.close()
small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
smallfile = open(small_filename, "w")
smallfile.write(line)
if smallfile:
smallfile.close()
Using itertools grouper recipe:
from itertools import zip_longest
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return zip_longest(fillvalue=fillvalue, *args)
n = 300
with open('really_big_file.txt') as f:
for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
with open('small_file_{0}'.format(i * n), 'w') as fout:
fout.writelines(g)
The advantage of this method as opposed to storing each line in a list, is that it works with iterables, line by line, so it doesn't have to store each small_file into memory at once.
Note that the last file in this case will be small_file_100200 but will only go until line 100000. This happens because fillvalue='', meaning I write out nothing to the file when I don't have any more lines left to write because a group size doesn't divide equally. You can fix this by writing to a temp file and then renaming it after instead of naming it first like I have. Here's how that can be done.
import os, tempfile
with open('really_big_file.txt') as f:
for i, g in enumerate(grouper(n, f, fillvalue=None)):
with tempfile.NamedTemporaryFile('w', delete=False) as fout:
for j, line in enumerate(g, 1): # count number of lines in group
if line is None:
j -= 1 # don't count this line
break
fout.write(line)
os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))
This time the fillvalue=None and I go through each line checking for None, when it occurs, I know the process has finished so I subtract 1 from j to not count the filler and then write the file.
I do this a more understandable way and using less short cuts in order to give you a further understanding of how and why this works. Previous answers work, but if you are not familiar with certain built-in-functions, you will not understand what the function is doing.
Because you posted no code I decided to do it this way since you could be unfamiliar with things other than basic python syntax given that the way you phrased the question made it seem as though you did not try nor had any clue as how to approach the question
Here are the steps to do this in basic python:
First you should read your file into a list for safekeeping:
my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
for row in text_file:
hold_lines.append(row)
Second, you need to set up a way of creating the new files by name! I would suggest a loop along with a couple counters:
outer_count = 1
line_count = 0
sorting = True
while sorting:
count = 0
increment = (outer_count-1) * 300
left = len(hold_lines) - increment
file_name = "small_file_" + str(outer_count * 300) + ".txt"
Third, inside that loop you need some nested loops that will save the correct rows into an array:
hold_new_lines = []
if left < 300:
while count < left:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
sorting = False
else:
while count < 300:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
Last thing, again in your first loop you need to write the new file and add your last counter increment so your loop will go through again and write a new file
outer_count += 1
with open(file_name,'w') as next_file:
for row in hold_new_lines:
next_file.write(row)
note: if the number of lines is not divisible by 300, the last file will have a name that does not correspond to the last file line.
It is important to understand why these loops work. You have it set so that on the next loop, the name of the file that you write changes because you have the name dependent on a changing variable. This is a very useful scripting tool for file accessing, opening, writing, organizing etc.
In case you could not follow what was in what loop, here is the entirety of the function:
my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
for row in text_file:
hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
count = 0
increment = (outer_count-1) * 300
left = len(hold_lines) - increment
file_name = "small_file_" + str(outer_count * 300) + ".txt"
hold_new_lines = []
if left < 300:
while count < left:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
sorting = False
else:
while count < 300:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
outer_count += 1
with open(file_name,'w') as next_file:
for row in hold_new_lines:
next_file.write(row)
lines_per_file = 300 # Lines on each small file
lines = [] # Stores lines not yet written on a small file
lines_counter = 0 # Same as len(lines)
created_files = 0 # Counting how many small files have been created
with open('really_big_file.txt') as big_file:
for line in big_file: # Go throught the whole big file
lines.append(line)
lines_counter += 1
if lines_counter == lines_per_file:
idx = lines_per_file * (created_files + 1)
with open('small_file_%s.txt' % idx, 'w') as small_file:
# Write all lines on small file
small_file.write('\n'.join(stored_lines))
lines = [] # Reset variables
lines_counter = 0
created_files += 1 # One more small file has been created
# After for-loop has finished
if lines_counter: # There are still some lines not written on a file?
idx = lines_per_file * (created_files + 1)
with open('small_file_%s.txt' % idx, 'w') as small_file:
# Write them on a last small file
small_file.write('n'.join(stored_lines))
created_files += 1
print '%s small files (with %s lines each) were created.' % (created_files,
lines_per_file)
import csv
import os
import re
MAX_CHUNKS = 300
def writeRow(idr, row):
with open("file_%d.csv" % idr, 'ab') as file:
writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
writer.writerow(row)
def cleanup():
for f in os.listdir("."):
if re.search("file_.*", f):
os.remove(os.path.join(".", f))
def main():
cleanup()
with open("large_file.csv", 'rb') as results:
r = csv.reader(results, delimiter=',', quotechar='\"')
idr = 1
for i, x in enumerate(r):
temp = i + 1
if not (temp % (MAX_CHUNKS + 1)):
idr += 1
writeRow(idr, x)
if __name__ == "__main__": main()
with open('/really_big_file.txt') as infile:
file_line_limit = 300
counter = -1
file_index = 0
outfile = None
for line in infile.readlines():
counter += 1
if counter % file_line_limit == 0:
# close old file
if outfile is not None:
outfile.close()
# create new file
file_index += 1
outfile = open('small_file_%03d.txt' % file_index, 'w')
# write to file
outfile.write(line)
I had to do the same with 650000 line files.
Use the enumerate index and integer div it (//) with the chunk size
When that number changes close the current file and open a new one
This is a python3 solution using format strings.
chunk = 50000 # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')
with open('massive_web_log_file') as file_to_read:
for i, line in enumerate(file_to_read.readlines()):
file_name = f'./a_folder/{i // chunk}'
print(i, file_name) # a bit of feedback that slows the process down a
if file_name == this_small_file.name:
this_small_file.write(line)
else:
this_small_file.write(line)
this_small_file.close()
this_small_file = open(f'{file_name}', 'a')
Set files to the number of file you want to split the master file to
in my exemple i want to get 10 files from my master file
files = 10
with open("data.txt","r") as data :
emails = data.readlines()
batchs = int(len(emails)/10)
for id,log in enumerate(emails):
fileid = id/batchs
file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+')
file.write(log)
A very easy way would if you want to split it in 2 files for example:
with open("myInputFile.txt",'r') as file:
lines = file.readlines()
with open("OutputFile1.txt",'w') as file:
for line in lines[:int(len(lines)/2)]:
file.write(line)
with open("OutputFile2.txt",'w') as file:
for line in lines[int(len(lines)/2):]:
file.write(line)
making that dynamic would be:
with open("inputFile.txt",'r') as file:
lines = file.readlines()
Batch = 10
end = 0
for i in range(1,Batch + 1):
if i == 1:
start = 0
increase = int(len(lines)/Batch)
end = end + increase
with open("splitText_" + str(i) + ".txt",'w') as file:
for line in lines[start:end]:
file.write(line)
start = end
In Python files are simple iterators. That gives the option to iterate over them multiple times and always continue from the last place the previous iterator got. Keeping this in mind, we can use islice to get the next 300 lines of the file each time in a continuous loop. The tricky part is knowing when to stop. For this we will "sample" the file for the next line and once it is exhausted we can break the loop:
from itertools import islice
lines_per_file = 300
with open("really_big_file.txt") as file:
i = 1
while True:
try:
checker = next(file)
except StopIteration:
break
with open(f"small_file_{i*lines_per_file}.txt", 'w') as out_file:
out_file.write(checker)
for line in islice(file, lines_per_file-1):
out_file.write(line)
i += 1

Categories

Resources