python similar string removal from multiple files

python similar string removal from multiple files - python

I have crawled txt files from different website, now i need to glue them into one file. There are many lines are similar to each other from various websites. I want to remove repetitions.
Here is what I have tried:
import difflib
sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
sourcefile = open('%s.txt' % sourcename)
sourcelines = sourcefile.readlines()
sourcefile.close()
for sourceline in sourcelines:
destfile = open('%s.txt' % destname, 'a+')
destlines = destfile.readlines()
similar = False
for destline in destlines:
ratio = difflib.SequenceMatcher(None, destline, sourceline).ratio()
if ratio > 0.8:
print destline
print sourceline
similar = True
if not similar:
destfile.write(sourceline)
destfile.close()
I will run it for every source, and write line by line to the same file. The result is, even if i run it for the same file multiple times, the line is always appended to the destination file.
EDIT:
I have tried the code of the answer. It's still very slow.
Even If I minimize the IO, I still need to compare O(n^2), especially when you have 1000+ lines. I have average 10,000 lines per file.
Any other ways to remove the duplicates?

Here is a short version that does minimal IO and cleans up after itself.
import difflib
sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
with open('%s.txt' % destname, 'w+') as destfile:
# we read in the file so that on subsequent runs of this script, we
# won't duplicate the lines.
known_lines = set(destfile.readlines())
with open('%s.txt' % sourcename) as sourcefile:
for line in sourcefile:
similar = False
for known in known_lines:
ratio = difflib.SequenceMatcher(None, line, known).ratio()
if ratio > 0.8:
print ratio
print line
print known
similar = True
break
if not similar:
destfile.write(line)
known_lines.add(line)
Instead of reading the known lines each time from the file, we save them to a set, which we use for comparison against. The set is essentially a mirror of the contents of 'destfile'.
A note on complexity
By its very nature, this problem has a O(n2) complexity. Because you're looking for similarity with known strings, rather than identical strings, you have to look at every previously seen string. If you were looking to remove exact duplicates, rather than fuzzy matches, you could use a simple lookup in a set, with complexity O(1), making your entire solution have O(n) complexity.
There might be a way to reduce the fundamental complexity by using lossy compression on the strings so that two similar strings compress to the same result. This is however both out of scope for a stack overflow answer, and beyond my expertise. It is an active research area so you might have some luck digging through the literature.
You could also reduce the time taken by ratio() by using the less accurate alternatives quick_ratio() and real_quick_ratio().

Your code works fine for me. it prints destline and sourceline to stdout when lines are similar (in the example I used, exactly the same) but it only wrote unique lines to file once. You might need to set your ratio threshold lower for your specific "similarity" needs.

Basically what you need to do is check every line in the source file to see if it has a potential match against every line of the destination file.
##xiaoshanwujzw.txt
##-----------------
##radically different thing
##this is data
##and more data
##bindresult.txt
##--------------
##a website line
##this is data
##and more data
from difflib import SequenceMatcher
sourcefile = open('xiaoshanwujzw.txt', 'r')
sourcelines = sourcefile.readlines()
sourcefile.close()
destfile = open('bindresult.txt', 'a+')
destlines = destfile.readlines()
has_matches = {k: False for k in sourcelines}
for d_line in destlines:
for s_line in sourcelines:
if SequenceMatcher(None, d_line, s_line).ratio() > 0.8:
has_matches[s_line] = True
break
for k in has_matches:
if has_matches[k] == False:
destfile.write(k)
destfile.close()
This will add the line radically different thing`` to the destinationfile.

Related

"pygame.error: Out of memory" when loading level with a large area [duplicate]

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')

(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()

You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.

Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.

No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations

Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1

Compare all the CSV files in a folder and print duplicate rows

I have multiple CSV files in a folder, which I want to compare and print the matching rows (where the number of columns could be different). I know how to get duplicates within a file but this case is a little different. Let's say there are two files in a folder and I want to compare them.
CSV1:
H1,H2,H4
C01,23,F
C2,45,M
CSV2:
H1,H2,H3,H4
C01,23,data,F
C01,23,some other data,M
C4,34,data,M
I need my output to check if all the available data (from the one with the least number of columns) matches exactly in another file in the same folder. My output could be like
CSV1,CSV2 (H1:C01,H2:23,H4:F(H3:data))

What about something like:
def duplines(csv_least_cols, csv_most_cols):
rowset = set()
with open(csv_least_cols) as csv1:
r = csv.reader(csv1)
csv1_cols = next(r)
for row in r:
rowset.add(tuple(row))
with open(csv_most_cols) as csv2:
dr = csv.DictReader(csv2)
for drow in dr:
refcols = tuple(drow[c] for c in csv1_cols)
if refcols in rowset: yield csv1_cols, refcols, drow
You can call this in a loop and perform whatever formatting you want -- this generator deals with the underlying logic, separating out the formatting task to its caller.
So for example to get your peculiar desired CSV1,CSV2 (H1:C01,H2:23,H4:F(H3:data)) style output you could have...:
def formatit(csv_least, csv_most):
out_start = '{},{} ('.format(csv_least, csv_most)
for c1cols, refvals, c2dict in duplines(csv_least, csv_most):
out_middle = []
for c, v in zip(c1cols, refvals):
out_middle.append('{}:{}'.format(c, v))
out_end = []
for c in c2dict:
if c in c1cols: continue
out_end.append('{}:{}'.format(c, c2dict[c]))
out = '{}{}({}))'.format(out_start, ','.join(out_middle), ','.join(out_end))
print(out)
You'll notice that the formatting work is substantially more complex than the actual logic (and hence more likely to hide bugs:-) which is why I call your desired format "peculiar".
But I hope this can at least get you started (and you can try out each function separately, making sure the logic is as you desire it before worrying about the formatting:-).

Python - Reducing Import and Parse Time for Large CSV Files

My first post:
Before beginning, I should note I am relatively new to OOP, though I have done DB/stat work in SAS, R, etc., so my question may not be well posed: please let me know if I need to clarify anything.
My question:
I am attempting to import and parse large CSV files (~6MM rows and larger likely to come). The two limitations that I've run into repeatedly have been runtime and memory (32-bit implementation of Python). Below is a simplified version of my neophyte (nth) attempt at importing and parsing in reasonable time. How can I speed up this process? I am splitting the file as I import and performing interim summaries due to memory limitations and using pandas for the summarization:
Parsing and Summarization:
def ParseInts(inString):
try:
return int(inString)
except:
return None
def TextToYearMo(inString):
try:
return 100*inString[0:4]+int(inString[5:7])
except:
return 100*inString[0:4]+int(inString[5:6])
def ParseAllElements(elmValue,elmPos):
if elmPos in [0,2,5]:
return elmValue
elif elmPos == 3:
return TextToYearMo(elmValue)
else:
if elmPos == 18:
return ParseInts(elmValue.strip('\n'))
else:
return ParseInts(elmValue)
def MakeAndSumList(inList):
df = pd.DataFrame(inList, columns = ['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14'])
return df[['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14']].groupby(
['x1','x2','x3','x4','x5']).sum().reset_index()
Function Calls:
def ParsedSummary(longString,delimtr,rowNum):
keepColumns = [0,3,2,5,10,9,11,12,13,14,15,16,17,18]
#Do some other stuff that takes very little time
return [pse.ParseAllElements(longString.split(delimtr)[i],i) for i in keepColumns]
def CSVToList(fileName, delimtr=','):
with open(fileName) as f:
enumFile = enumerate(f)
listEnumFile = set(enumFile)
for lineCount, l in enumFile:
pass
maxSplit = math.floor(lineCount / 10) + 1
counter = 0
Summary = pd.DataFrame({}, columns = ['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14'])
for counter in range(0,10):
startRow = int(counter * maxSplit)
endRow = int((counter + 1) * maxSplit)
includedRows = set(range(startRow,endRow))
listOfRows = [ParsedSummary(row,delimtr,rownum)
for rownum, row in listEnumFile if rownum in includedRows]
Summary = pd.concat([Summary,pse.MakeAndSumList(listOfRows)])
listOfRows = []
counter += 1
return Summary
(Again, this is my first question - so I apologize if I simplified too much or, more likely, too little, but I am at a loss as to how to expedite this.)
For runtime comparison:
Using Access I can import, parse, summarize, and merge several files in this size-range in <5 mins (though I am right at its 2GB lim). I'd hope I can get comparable results in Python - presently I'm estimating ~30 min run time for one file. Note: I threw something together in Access' miserable environment only because I didn't have admin rights readily available to install anything else.
Edit: Updated parsing code. Was able to shave off five minutes (est. runtime at 25m) by changing some conditional logic to try/except. Also - runtime estimate doesn't include pandas portion - I'd forgotten I'd commented that out while testing, but its impact seems negligible.

If you want to optimize performance, don't roll your own CSV reader in Python. There is already a standard csv module. Perhaps pandas or numpy have faster csv readers; I'm not sure.
From https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file:
In short, pandas.io.parsers.read_csv beats everybody else, NumPy's loadtxt is impressively slow and NumPy's from_file and load impressively fast.

How to Compare 2 very large matrices using Python

I have an interesting problem.
I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.
I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.
The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.
Example of what a file and its new file might look like:
Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...
New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...
Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.

Like this.
Step 1. Sort.
Step 2. Read each file, doing line-by-line comparison. Write differences to another file.
You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html
Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.
And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.

This is a little bit of a naive implementation but will deal with unsorted data:
import csv
file1_dict = {}
file2_dict = {}
with open('file1.csv') as handle:
for row in csv.reader(handle):
file1_dict[tuple(row[:2])] = row[2:]
with open('file2.csv') as handle:
for row in csv.reader(handle):
file2_dict[tuple(row[:2])] = row[2:]
with open('outfile.csv', 'w') as handle:
writer = csv.writer(handle)
for key, val in file1_dict.iteritems():
if key in file2_dict:
#deal with keys that are in both
if file2_dict[key] == val:
writer.writerow(key+val+('Same',))
else:
writer.writerow(key+file2_dict[key]+('Modified',))
file2_dict.pop(key)
else:
writer.writerow(key+val+('Removed',))
#deal with added keys!
for key, val in file2_dict.iteritems():
writer.writerow(key+val+('Added',))
You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. #S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.
Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:
#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap
def str2date(in):
return datetime.date(*map(int,in.split('-')))
def convert_tups(row):
key = (row[0], str2date(row[1]))
val = tuple(row[2:])
return key, val
with open('file1.csv') as handle1:
with open('file2.csv') as handle2:
with open('outfile.csv', 'w') as outhandle:
writer = csv.writer(outhandle)
gen1 = imap(convert_tups, csv.reader(handle1))
gen2 = imap(convert_tups, csv.reader(handle2))
gen2key, gen2val = gen2.next()
for gen1key, gen1val in gen1:
if gen1key == gen2key and gen1val == gen2val:
writer.writerow(gen1key+gen1val+('Same',))
gen2key, gen2val = gen2.next()
elif gen1key == gen2key and gen1val != gen2val:
writer.writerow(gen2key+gen2val+('Modified',))
gen2key, gen2val = gen2.next()
elif gen1key > gen2key:
while gen1key>gen2key:
writer.writerow(gen2key+gen2val+('Added',))
gen2key, gen2val = gen2.next()
else:
writer.writerow(gen1key+gen1val+('Removed',))

Most efficient way to concatenate and rearrange files

I am reading from several files, each file is divided into 2 pieces, first a header section of a few thousand lines followed by a body of a few thousand. My problem is I need to concatenate these files into one file where all the headers are on the top followed by the body.
Currently I am using two loops: one to pull out all the headers and write them, and the second to write the body of each file (I also include a tmp_count variable to limit the number of lines to be loading into memory before dumping to file).
This is pretty slow - about 6min for 13gb file. Can anyone tell me how to optimize this or if there is a faster way to do this in python ?
Thanks!
Here is my code:
def cat_files_sam(final_file_name,work_directory_master,file_count):
final_file = open(final_file_name,"w")
if len(file_count) > 1:
file_count=sort_output_files(file_count)
# only for # headers
for bowtie_file in file_count:
#print bowtie_file
tmp_list = []
tmp_count = 0
for line in open(os.path.join(work_directory_master,bowtie_file)):
if line.startswith("#"):
if tmp_count == 1000000:
final_file.writelines(tmp_list)
tmp_list = []
tmp_count = 0
tmp_list.append(line)
tmp_count += 1
else:
final_file.writelines(tmp_list)
break
for bowtie_file in file_count:
#print bowtie_file
tmp_list = []
tmp_count = 0
for line in open(os.path.join(work_directory_master,bowtie_file)):
if line.startswith("#"):
continue
if tmp_count == 1000000:
final_file.writelines(tmp_list)
tmp_list = []
tmp_count = 0
tmp_list.append(line)
tmp_count += 1
final_file.writelines(tmp_list)
final_file.close()

How fast would you expect it to be to move 13Gb of data around? This problem is I/O bound and not a problem with Python. To make it faster, do less I/O. Which means that you are either (a) stuck with the speed you've got or (b) should retool later elements of your toolchain to handle the files in-place rather than requiring one giant 13 Gb file.

You can save the time it takes the 2nd time to skip the headers, as long as you have a reasonable amount of spare disk space: as well as the final file, also open (for 'w+') a temporary file temp_file, and do:
import shutil
hdr_list = []
bod_list = []
dispatch = {True: (hdr_list, final_file),
False: (bod_list, temp_file)}
for bowtie_file in file_count:
with open(os.path.join(work_directory_master,bowtie_file)) as f:
for line in f:
L, fou = dispatch[line[0]=='#']
L.append(f)
if len(L) == 1000000:
fou.writelines(L)
del L[:]
# write final parts, if any
for L, fou in dispatch.items():
if L: fou.writelines(L)
temp_file.seek(0)
shutil.copyfileobj(temp_file, final_file)
This should enhance your program's performance. Fine-tuning that now-hard-coded 1000000, or even completely doing away with the lists and writing each line directly to the appropriate file (final or temporary), are other options you should benchmark (but if you have unbounded amounts of memory, then I expect that they won't matter much -- however, intuitions about performance are often misleading, so it's best to try and measure!-).

There are two gross inefficiencies in the code you meant to write (which is not the code presented):
You are building up huge lists of header lines in the first major for block instead of just writing them out.
You are skipping the headers of the files again in the second major for block line by line when you've already determined where the headers end in (1). See file.seek and file.tell

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python similar string removal from multiple files - python

Your code works fine for me. it prints destline and sourceline to stdout when lines are similar (in the example I used, exactly the same) but it only wrote unique lines to file once. You might need to set your ratio threshold lower for your specific "similarity" needs.

Related

"pygame.error: Out of memory" when loading level with a large area [duplicate]

Compare all the CSV files in a folder and print duplicate rows

Python - Reducing Import and Parse Time for Large CSV Files

How to Compare 2 very large matrices using Python

Most efficient way to concatenate and rearrange files

Categories

Resources