sorting large text data

sorting large text data - python

I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields?
I have tried hive. I would like to see if this can be done faster using python.

Have you considered using the *nix sort program? in raw terms, it'll probably be faster than most Python scripts.
Use -t $'\t' to specify that it's tab-separated, -k n to specify the field, where n is the field number, and -o outputfile if you want to output the result to a new file.
Example:
sort -t $'\t' -k 4 -o sorted.txt input.txt
Will sort input.txt on its 4th field, and output the result to sorted.txt

you want to build an in-memory index for the file:
create an empty list
open the file
read it line by line (using f.readline(), and store in the list a tuple consisting of the value on which you want to sort (extracted with line.split('\t').strip()) and the offset of the line in the file (which you can get by calling f.tell() before calling f.readline())
close the file
sort the list
Then to print the sorted file, reopen the file and for each element of your list, use f.seek(offset) to move the file pointer to the beginning of the line, f.readline() to read the line and print the line.
Optimization: you may want to store the length of the line in the list, so that you can use f.read(length) in the printing phase.
Sample code (optimized for readability, not speed):
def build_index(filename, sort_col):
index = []
f = open(filename)
while True:
offset = f.tell()
line = f.readline()
if not line:
break
length = len(line)
col = line.split('\t')[sort_col].strip()
index.append((col, offset, length))
f.close()
index.sort()
return index
def print_sorted(filename, col_sort):
index = build_index(filename, col_sort)
f = open(filename)
for col, offset, length in index:
f.seek(offset)
print f.read(length).rstrip('\n')
if __name__ == '__main__':
filename = 'somefile.txt'
sort_col = 2
print_sorted(filename, sort_col)

Split up into files that can be sorted in memory. Sort each file in memory. Then merge the resulting files.
Merge by reading a portion of each of the files to be merged. The same amount from each file leaving enough space in memory for the merged result. Once merged saving this. Repeating adding blocks of merged data onto the file.
This minimises the file i/o and moving around the file on the disk.

I would store the file in a good relational database, index it on the field your are interested in and then read the ordered items.

Related

Read from file and write to another python

I have a file with contents as given below,
to-56 Olive 850.00 10 10
to-78 Sauce 950.00 25 20
to-65 Green 100.00 6 10
If the 4th column of data is less than or equal to the 5th column, the data should be written to a second file.
I tried the following code, but only 'to-56 Olive' is saved in the second file. I can't figure out what I'm doing wrong here.
file1=open("inventory.txt","r")
file2=open("purchasing.txt","w")
data=file1.readline()
for line in file1:
items=data.strip()
item=items.split()
qty=int(item[3])
reorder=int(item[4])
if qty<=reorder:
file2.write(item[0]+"\t"+item[1]+"\n")
file1.close()
file2.close()

You're reading only one line of input. So, you can have at most one line of output.
I see that your code is a bit "old school". Here's a more "modern" and Pythonic version.
# Modern way to open files. The closing in handled cleanly
with open('inventory.txt', mode='r') as in_file, \
open('purchasing.txt', mode='w') as out_file:
# A file is iterable
# We can read each line with a simple for loop
for line in in_file:
# Tuple unpacking is more Pythonic and readable
# than using indices
ref, name, price, quantity, reorder = line.split()
# Turn strings into integers
quantity, reorder = int(quantity), int(reorder)
if quantity <= reorder:
# Use f-strings (Python 3) instead of concatenation
out_file.write(f'{ref}\t{name}\n')

I've changed your code a tiny bit, all you need to do is iterate over lines in your file - like this:
file1=open("inventory.txt","r")
file2=open("purchasing.txt","w")
# Iterate over each line in the file
for line in file1.readlines():
# Separate each item in the line
items=line.split()
# Retrieve important bits
qty=int(items[3])
reorder=int(items[4])
# Write to the file if conditions are met
if qty<=reorder:
file2.write(items[0]+"\t"+items[1]+"\n")
# Release used resources
file1.close()
file2.close()
Here is the output in purchasing.txt:
to-56 Olive
to-65 Green

More efficient way than zipping arrays for transposing a table in Python?

I have been trying to transpose my table of 2000000+ rows and 300+ columns on a cluster, but it seems that my Python script is getting killed due to lack of memory. I would just like to know if anyone has any suggestions on a more efficient way to store my table data other than using the array, as shown in my code below?
import sys
Seperator = "\t"
m = []
f = open(sys.argv[1], 'r')
data = f.read()
lines = data.split("\n")[:-1]
for line in lines:
m.append(line.strip().split("\t"))
for i in zip(*m):
for j in range(len(i)):
if j != len(i):
print(i[j] +Seperator)
else:
print(i[j])
print ("\n")
Thanks very much.

The first thing to note is that you've been careless with your variables. You're loading a large file into memory as a single string, and then a list of a strings, then a list of list of strings, before finally transposing said list. This will result in you storing all the data in the file three times before you even begin to transpose it.
If each individual string in the file is only about 10 characters long then you're going to need 18GB of memory just to store that (2e6 rows * 300 columns * 10 bytes * 3 duplicates). This is before you factor in all the overhead of python objects (~27 bytes per string object).
You have a couple of options options here.
create each new transposed row incrementally by reading over the file once for each old row and appending each new row one at a time (sacrifices time efficiency).
create one file for each new row and combine these row files at the end (sacrifices disk space efficiency, possibly problematic if you have a lot of columns in the initial file due to a limit of the number of open files a process may have).
Transposing with a limited number of open files
delimiter = ','
input_filename = 'file.csv'
output_filename = 'out.csv'
# find out the number of columns in the file
with open(input_filename) as input:
old_cols = input.readline().count(delimiter) + 1
temp_files = [
'temp-file-{}.csv'.format(i)
for i in range(old_cols)
]
# create temp files
for temp_filename in temp_files:
open(temp_filename, 'w') as output:
output.truncate()
with open(input_filename) as input:
for line in input:
parts = line.rstrip().split(delimiter)
assert len(parts) == len(temp_files), 'not enough or too many columns'
for temp_filename, cell in zip(temp_files, parts):
with open(temp_filename, 'a') as output:
output.write(cell)
output.write(',')
# combine temp files
with open(output_filename, 'w') as output:
for temp_filename in temp_files:
with open(temp_filename) as input:
line = input.read().rstrip()[:-1] + '\n'
output.write(line)

As the number of columns is far smaller than nuber of rows I would consider writing each column to separate file. And then combine them together.
import sys
Separator = "\t"
f = open(sys.argv[1], 'r')
for line in f:
for i, c in enumerate(line.strip().split("\t")):
dest = column_file[i] # you shoud open 300+ file handlers, one for each column
dest.write(c)
dest.write(Separator)
# all you need to do after than is combine the content of you "row" files

If you cannot store all of your file into memory, you can read it n times:
column_number = 4 # if necessary, read the first line of the file to calculate it
seperetor = '\t'
filename = sys.argv[1]
def get_nth_column(filename, n):
with open(filename, 'r') as file:
for line in file:
if line: # remove empty lines
yield line.strip().split('\t')[n]
for column in range(column_number):
print(seperetor.join(get_nth_column(filename, column)))
Note that an exception will be raised if the file does not have the right format. You could catch it if necessary.
When reading files : use with construct, to ensure that your file will be closed. And iterate directly on the file, instead of reading the content first. It is more readable and more efficient.

Count of unique column values from large CSV file using Python or php

I have a Csv File That Is 217gb, how can I get the count of unique column values using python or php script without timeout?

Not sure what you mean by timeout, for big files like this it will always take a long time.
tokens = {}
with open("your.csv") as infile:
for line in infile:
columns = line.split(',')
# Where idx is your desired column index
if columns[idx] not in tokens:
tokens[columns[idx]] = 0
else:
tokens[columns[idx]] += 1
print tokens
This loads the file line by line, so your compute doesn't crash from loading the whole 217 Gb into ram. You can try this first to see if the dictionary fits in your computer's memory. Otherwise you might wanna consider splitting the files to smaller chunks in a divide and conquer approach.

You could try to increase the field_size_limit
import csv
csv.field_size_limit(1000000000)
r = csv.reader(open('doc.csv', 'rb'))
for row in r:
print(row) # do the processing

How do I efficiently crossmatch two ASCII catalogs?

I have two ASCII text files with columnated data. The first column of both files is a 'name' that is consistent across both files. One file has some 6000 rows, the other only has 800. Without doing a for line in file.readlines(): approach - e.g.,
with open('big_file.txt') as catalogue:
with open('small_file.txt') as targets:
for tline in targets.readlines()[2:]:
name = tline.split()[0]
for cline in catalogue.readlines()[8:]:
if name == cline.split()[0]
print cline
catalogue.seek(0)
break
is there an efficient way to return only the rows (or lines) from the larger file that also appear in the smaller file (using the 'name' as the check)?
It's okay if it is one row at a time for say a file.write(matching_line) the idea would be to create a third file with all the info from the large file for only the objects that are in the small file.

for line in file.readlines() is not inherently bad. What's bad is the nested loops you have there. You can use a set to keep track of and check all the names in the smaller file:
s = set()
for line in targets:
s.add(line.split()[0])
Then, just loop through the bigger file and check if the name is in s:
for line in catalogue:
if line.split()[0] in s:
print line

Find the last occurrence of a word in a large file with python

I have a very large text file. I want to search for the last occurrence of a specific word and then perform certain operations on the lines that follows it.
I can do something like:
if "word" in line.split():
do something
I am only interested in the last occurrence of "word" however.

Well an easier and quicker solution would be to open the file in reversed order and then searching the first word location.
In python 2.6 you can do something like (where word is string you are looking for)
for line in reversed(open("filename").readlines()):
if word in line:
# Do the operations here when you find the line

Try like this:
f = open('file.txt', 'r')
lines = f.read()
answer = lines.find('word')
and then you can pick the last word from this
You may also use str.rfind
str.rfind(sub[, start[, end]])
Return the highest index in the string where substring sub is found,
such that sub is contained within s[start:end]. Optional arguments
start and end are interpreted as in slice notation. Return -1 on
failure.

If the file is hundreds of megabytes or even gigabytes in size, then you may want to use mmap so you don't have to read the entire file into memory. The rfind method finds the last occurrence of a string in the file.
import mmap
with open('large_file.txt', 'r') as f:
# memory-map the file, size 0 means whole file
m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
# prot argument is *nix only
i = m.rfind('word') # search for last occurrence of 'word'
m.seek(i) # seek to the location
line = m.readline() # read to the end of the line
print line
nextline = m.readline()
Just keep calling readline() to read following lines.
If the file is extremely large (like tens of gigabytes) then you can map it in chunks with the length and offset arguments of mmap()

You can open your file, transform it into a list, reverse its order and iterate looking for your word.
with open('file.txt','r') as file_:
line_list = list(file_)
line_list.reverse()
for line in line_list:
if line.find('word') != -1:
# do something
print line
Optionally you can specify the size of the file buffer passing the buffer size (in bytes) as the third parameter of open. For instance: with open('file.txt','r', 1024) as file_:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

sorting large text data - python

I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields? I have tried hive. I would like to see if this can be done faster using python.

I would store the file in a good relational database, index it on the field your are interested in and then read the ordered items.

Related

Read from file and write to another python

More efficient way than zipping arrays for transposing a table in Python?

Count of unique column values from large CSV file using Python or php

How do I efficiently crossmatch two ASCII catalogs?

Find the last occurrence of a word in a large file with python

Categories

Resources