Function to divide a text file into two files - python

I wrote a function to input a text file and a ratio (eg. 80%) to divide the first 80% of the file into a file and the other 20% to another file. The first part is correct but the second part is empty. can someone take a look and let me know my mistake?
def splitFile(inputFilePatheName, outputFilePathNameFirst, outputFilePathNameRest, splitRatio):
lines = 0
buffer = bytearray(2048)
with open(inputFilePatheName) as f:
while f.readinto(buffer) > 0:
lines += buffer.count('\n')
print lines
line80 = int(splitRatio * lines)
print line80
with open(inputFilePatheName) as originalFile:
firstNlines = originalFile.readlines()[0:line80]
restOfTheLines=originalFile.readlines()[(line80+1):lines]
print len(firstNlines)
print len(restOfTheLines)
with open(outputFilePathNameFirst, 'w') as outputFileNLines:
for item in firstNlines:
outputFileNLines.write("{}".format(item))
with open(outputFilePathNameRest,'w') as outputFileRest:
for word in restOfTheLines:
outputFileRest.write("{}".format(word))

I believe this is your problem:
firstNlines = originalFile.readlines()[0:line80]
restOfTheLines=originalFile.readlines()[(line80+1):lines]
When you call readlines() the second time, you don't get anything, because you've already read all the lines from the file. Try:
allLines = originalFile.readlines()
firstNLines, restOfTheLines = allLines[:line80], allLines[(line80+1):]
Of course, for very large files there is a problem that you are reading the entire file into memory.

Related

Write all lines for each set of a range to new file each time the range changes Python 3.6

trying to find a way of making this process work pythonically or at all. Basically, I have a really long text file that is split into lines. Every x number of lines there is one that is mainly uppercase, which should roughly be the title of that particular section. Ideally, I'd want the title and everything after to go into a text file using the title as the name for the file. This would have to happen 3039 in this case as that is as many titles will be there.
My process so far is this: I created a variable that reads through a text file tells me if it's mostly uppercase.
def mostly_uppercase(text):
threshold = 0.7
isupper_bools = [character.isupper() for character in text]
isupper_ints = [int(val) for val in isupper_bools]
try:
upper_percentage = np.mean(isupper_ints)
except:
return False
if upper_percentage >= threshold:
return True
else:
return False
Afterwards, I made a counter so that I could create an index and then I combined it:
counter = 0
headline_indices = []
for line in page_text:
if mostly_uppercase(line):
print(line)
headline_indices.append(counter)
counter+=1
headlines_with_articles = []
headline_indices_expanded = [0] + headline_indices + [len(page_text)-1]
for first, second in list(zip(headline_indices_expanded, headline_indices_expanded[1:])):
article_text = (page_text[first:second])
headlines_with_articles.append(article_text)
All of that seems to be working fine as far as I can tell. But when I try to print the pieces that I want to files, all I manage to do is print the entire text into all of the txt files.
for i in range(100):
out_pathname = '/sharedfolder/temp_directory/' + 'new_file_' + str(i) + '.txt'
with open(out_pathname, 'w') as fo:
fo.write(articles_filtered[2])
Edit: This got me halfway there. Now, I just need a way of naming each file with the first line.
for i,text in enumerate(articles_filtered):
open('/sharedfolder/temp_directory' + str(i + 1) + '.txt', 'w').write(str(text))
One conventional way of processing a single input file involves using a Python with statement and a for loop, in the following way. I have also adapted a good answer from someone else for counting uppercase characters, to get the fraction you need.
def mostly_upper(text):
threshold = 0.7
## adapted from https://stackoverflow.com/a/18129868/131187
upper_count = sum(1 for c in text if c.isupper())
return upper_count/len(text) >= threshold
first = True
out_file = None
with open('some_uppers.txt') as some_uppers:
for line in some_uppers:
line = line.rstrip()
if first or mostly_upper(line):
first = False
if out_file: out_file.close()
out_file = open(line+'.txt', 'w')
print(line, file=out_file)
out_file.close()
In the loop, we read each line, asking whether it's mostly uppercase. If it is we close the file that was being used for the previous collection of lines and open a new file for the next collection, using the contents of the current line as a title.
I allow for the possibility that the first line might not be a title. In this case the code creates a file with the contents of the first line as its names, and proceeds to write everything it finds to that file until it does find a title line.

Reading CSV file with python

filename = 'NTS.csv'
mycsv = open(filename, 'r')
mycsv.seek(0, os.SEEK_END)
while 1:
time.sleep(1)
where = mycsv.tell()
line = mycsv.readline()
if not line:
mycsv.seek(where)
else:
arr_line = line.split(',')
var3 = arr_line[3]
print (var3)
I have this Paython code which is reading the values from a csv file every time there is a new line printed in the csv from external program. My problem is that the csv file is periodically completely rewriten and then python stops reading the new lines. My guess is that python is stuck on some line number and the new update can put maybe 50 more or less lines. So for example python is now waiting a new line at line 70 and the new line has come at line 95. I think the solution is to let mycsv.seek(0, os.SEEK_END) been updated but not sure how to do that.
What you want to do is difficult to accomplish without rewinding the file every time to make sure that you are truly on the last line. If you know approximately how many characters there are on each line, then there is a shortcut you could take using mycsv.seek(-end_buf, os.SEEK_END), as outlined in this answer. So your code could work somehow like this:
avg_len = 50 # use an appropriate number here
end_buf = 3 * avg_len / 2
filename = 'NTS.csv'
mycsv = open(filename, 'r')
mycsv.seek(-end_buf, os.SEEK_END)
last = mycsv.readlines()[-1]
while 1:
time.sleep(1)
mycsv.seek(-end_buf, os.SEEK_END)
line = mycsv.readlines()[-1]
if not line == last:
arr_line = line.split(',')
var3 = arr_line[3]
print (var3)
Here, in each iteration of the while loop, you seek to a position close to the end of the file, just far back enough that you know for sure the last line will be contained in what remains. Then you read in all the remaining lines (this will probably include a partial amount of the second or third to last lines) and check if the last line of these is different to what you had before.
You can do a simpler way of reading lines in your program. Instead of trying to use seek in order to get what you need, try using readlines on the file object mycsv.
You can do the following:
mycsv = open('NTS.csv', 'r')
csv_lines = mycsv.readlines()
for line in csv_lines:
arr_line = line.split(',')
var3 = arr_line[3]
print(var3)

Need help cropping a large text file to multiple small text files with a header

I need to crop a large text file with over 10000 lines of numbers in addition to a header with the format (number_of_lines, number_difference, "Sam")
Number_difference is the difference between the first and last number.
For example, if the file looks like this:
10
12
13.5
17
20
Then, the header should be:
5 10 Sam
The problem is the flags do not work for not writing a header more than once and the big file's header carries over to the 1st small file.
The headers will never be the same for each file.
How do I add a changing header to each file?
def TextCropper():
lines_per_file = 1000
smallfile = None
with open(inputFileName) as bigfile:
for lineno, line in enumerate(bigfile):
if lineno % lines_per_file == 0:
if smallfile:
smallfile.close()
small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
smallfile = open(small_filename, "w")
if (flags[counter] == False):
smallfile.write(lines_per_file)
flags[counter] = True
smallfile.write(line)
elif smallfile:
smallfile.close()
TextCropper()
You're reading and writing the lines one at a time, which is inefficient. By doing that, you also don't know what the last line will be, so you can't write your header in advance.
Just read up to N lines, if available. islice() will do exactly that for you. If the list comes back empty, there were no lines left to read, otherwise you can proceed to write the current chunk into a file.
Since each line is read as a number with a trailing newline ('\n'), strip that, convert the first and last numbers into floats and calculate the difference. Writing the actual numbers to the file is straightforward by joining the elements of the list.
To make the function reusuable, include the variables that are likely to change as arguments. That way you can name any big file, any output small file and any number of lines you want without changing hardcoded values.
from itertools import islice
def number_difference(iterable):
return float(iterable[-1].strip('\n')) - float(iterable[0].strip('\n'))
def file_crop(big_fname, chunk_fname, no_lines):
with open(big_fname, 'r') as big_file:
ifile = 0
while True:
data = list(islice(big_file, no_lines))
if not data:
break
with open('{}_{}.txt'.format(chunk_fname, ifile), 'w') as small_file:
small_file.write('{} {} Sam\n'.format(len(data), number_difference(data)))
small_file.write(''.join(data))
ifile += 1

Splitting a CSV file into equal parts?

I have a large CSV file that I would like to split into a number that is equal to the number of CPU cores in the system. I want to then use multiprocess to have all the cores work on the file together. However, I am having trouble even splitting the file into parts. I've looked all over google and I found some sample code that appears to do what I want. Here is what I have so far:
def split(infilename, num_cpus=multiprocessing.cpu_count()):
READ_BUFFER = 2**13
total_file_size = os.path.getsize(infilename)
print total_file_size
files = list()
with open(infilename, 'rb') as infile:
for i in xrange(num_cpus):
files.append(tempfile.TemporaryFile())
this_file_size = 0
while this_file_size < 1.0 * total_file_size / num_cpus:
files[-1].write(infile.read(READ_BUFFER))
this_file_size += READ_BUFFER
files[-1].write(infile.readline()) # get the possible remainder
files[-1].seek(0, 0)
return files
files = split("sample_simple.csv")
print len(files)
for ifile in files:
reader = csv.reader(ifile)
for row in reader:
print row
The two prints show the correct file size and that it was split into 4 pieces (my system has 4 CPU cores).
However, the last section of the code that prints all the rows in each of the pieces gives the error:
for row in reader:
_csv.Error: line contains NULL byte
I tried printing the rows without running the split function and it prints all the values correctly. I suspect the split function has added some NULL bytes to the resulting 4 file pieces but I'm not sure why.
Does anyone know if this a correct and fast method to split the file? I just want resulting pieces that can be read successfully by csv.reader.
As I said in a comment, csv files would need to be split on row (or line) boundaries. Your code doesn't do this and potentially breaks them up somewhere in the middle of one — which I suspect is the cause of your _csv.Error.
The following avoids doing that by processing the input file as a series of lines. I've tested it and it seems to work standalone in the sense that it divided the sample file up into approximately equally size chunks because it's unlikely that an whole number of rows will fit exactly into a chunk.
Update
This it is a substantially faster version of the code than I originally posted. The improvement is because it now uses the temp file's own tell() method to determine the constantly changing length of the file as it's being written instead of calling os.path.getsize(), which eliminated the need to flush() the file and call os.fsync() on it after each row is written.
import csv
import multiprocessing
import os
import tempfile
def split(infilename, num_chunks=multiprocessing.cpu_count()):
READ_BUFFER = 2**13
in_file_size = os.path.getsize(infilename)
print 'in_file_size:', in_file_size
chunk_size = in_file_size // num_chunks
print 'target chunk_size:', chunk_size
files = []
with open(infilename, 'rb', READ_BUFFER) as infile:
for _ in xrange(num_chunks):
temp_file = tempfile.TemporaryFile()
while temp_file.tell() < chunk_size:
try:
temp_file.write(infile.next())
except StopIteration: # end of infile
break
temp_file.seek(0) # rewind
files.append(temp_file)
return files
files = split("sample_simple.csv", num_chunks=4)
print 'number of files created: {}'.format(len(files))
for i, ifile in enumerate(files, start=1):
print 'size of temp file {}: {}'.format(i, os.path.getsize(ifile.name))
print 'contents of file {}:'.format(i)
reader = csv.reader(ifile)
for row in reader:
print row
print ''

How to count the number of characters in a file (not using the len function)?

Basically, I want to be able to count the number of characters in a txt file (with user input of file name). I can get it to display how many lines are in the file, but not how many characters. I am not using the len function and this is what I have:
def length(n):
value = 0
for char in n:
value += 1
return value
filename = input('Enter the name of the file: ')
f = open(filename)
for data in f:
data = length(f)
print(data)
All you need to do is sum the number of characters in each line (data):
total = 0
for line in f:
data = length(line)
total += data
print(total)
There are two problems.
First, for each line in the file, you're passing f itself—that is, a sequence of lines—to length. That's why it's printing the number of lines in the file. The length of that sequence of lines is the number of lines in the file.
To fix this, you want to pass each line, data—that is, a sequence of characters. So:
for data in f:
print length(data)
Next, while that will properly calculate the length of each line, you have to add them all up to get the length of the whole file. So:
total_length = 0
for data in f:
total_length += length(data)
print(total_length)
However, there's another way to tackle this that's a lot simpler. If you read() the file, you will get one giant string, instead of a sequence of separate lines. So you can just call length once:
data = f.read()
print(length(data))
The problem with this is that you have to have enough memory to store the whole file at once. Sometimes that's not appropriate. But sometimes it is.
When you iterate over a file (opened in text mode) you are iterating over its lines.
for data in f: could be rewritten as for line in f: and it is easier to see what it is doing.
Your length function looks like it should work but you are sending the open file to it instead of each line.

Categories

Resources