I have a very big file with millions of paths to various executables on windows systems. A simple example would be the following:
C:\windows\ccmcache\1d\Deploy-Application.exe
C:\WINDOWS\ccmcache\7\Deploy-Application.exe
C:\windows\ccmcache\2o\Deploy-Application.exe
C:\WINDOWS\ccmcache\6\Deploy-Application.exe
C:\WINDOWS\ccmcache\15\Deploy-Application.exe
C:\WINDOWS\ccmcache\m\Deploy-Application.exe
C:\WINDOWS\ccmcache\1g\Deploy-Application.exe
C:\windows\ccmcache\2r\Deploy-Application.exe
C:\windows\ccmcache\1l\Deploy-Application.exe
C:\windows\ccmcache\2s\Deploy-Application.exe
or
C:\Users\user23452345\temp\test\1\Another1-Application.exe
C:\Users\user1324asdf\temp\Another-Applicatiooon.exe
C:\Users\user23452---5\temp\lili\Another-Application.exe
C:\Users\user23hkjhf_5\temp\An0ther-Application.exe
As a human, I can recognize that these strings are similar and match them fairly easily with some regex in code. My issue however is to find these patterns in the first place as there are far too many of those, completely unknown to me and are changing frequently.
My goal is to write a python script that finds these similar strings with a degree of certainty and groups them for me.
Which methods, libraries, keywords etc. should I look into to solve this problem?
One possible way is to approach this by calculating the distance between strings. For that, you could use the textdistance lib.
Hope this helps!
Edit:
Two starting points to get more familiarized with the subject:
https://en.wikipedia.org/wiki/Edit_distance
https://en.wikipedia.org/wiki/Levenshtein_distance
Try fuzzywuzzy, a soft string matcher. It makes a difference if you keep the strings as they are or lower case them first:
from fuzzywuzzy import fuzz
import itertools
lines = [
'C:\windows\ccmcache\1d\Deploy-Application.exe',
'C:\WINDOWS\ccmcache\m\Deploy-Application.exe',
'user5323\A-different-Application.bat',
]
for line1, line2 in itertools.combinations(lines, r=2):
case_match = fuzz.ratio(line1, line2)
insensitive_case_match = fuzz.ratio(line1.lower(), line2.lower())
print(line1[:10], '...', line1[:-10])
print(line2[:10], '...', line2[:-10])
print(case_match, insensitive_case_match)
print()
C:\windows ... C:\windows\ccmcached\Deploy-Appli
C:\WINDOWS ... C:\WINDOWS\ccmcache\m\Deploy-Appli
80 95
C:\windows ... C:\windows\ccmcached\Deploy-Appli
user5323\A ... user5323\A-different-Appli
42 45
C:\WINDOWS ... C:\WINDOWS\ccmcache\m\Deploy-Appli
user5323\A ... user5323\A-different-Appli
40 45
One fairly straight-forward and simple way would be to simply check for "how much" a pair of strings differ. Like so:
import difflib
from collections import defaultdict
grouping_requirement = 0.75 # (0;1), the closer to 1, the stronger the equality needs to be to be grouped
s = r'''C:\windows\ccmcache\1d\Deploy-Application.exe
C:\WINDOWS\ccmcache\7\Deploy-Application.exe
C:\windows\ccmcache\2o\Deploy-Application.exe
C:\WINDOWS\ccmcache\6\Deploy-Application.exe
C:\WINDOWS\ccmcache\15\Deploy-Application.exe
C:\WINDOWS\ccmcache\m\Deploy-Application.exe
C:\WINDOWS\ccmcache\1g\Deploy-Application.exe
C:\windows\ccmcache\2r\Deploy-Application.exe
C:\windows\ccmcache\1l\Deploy-Application.exe
C:\windows\ccmcache\2s\Deploy-Application.exe
C:\Users\user23452345\temp\test\1\Another1-Application.exe
C:\Users\user1324asdf\temp\Another-Applicatiooon.exe
C:\Users\user23452---5\temp\lili\Another-Application.exe
C:\Users\user23hkjhf_5\temp\An0ther-Application.exe'''
groups = defaultdict(list)
def match_ratio(s1,s2):
return difflib.SequenceMatcher(None,s1,s2).ratio()
for line in set(s.splitlines()):
for group in groups:
if match_ratio(group, line) > grouping_requirement:
groups[group].append(line)
break
else:
groups[line].append(line)
for group in groups.values():
print(', '.join(group))
print()
The output of this little application is:
C:\WINDOWS\ccmcache\1g\Deploy-Application.exe, C:\WINDOWS\ccmcache\m\Deploy-Application.exe, C:\windows\ccmcache\1l\Deploy-Application.exe, C:\WINDOWS\ccmcache\15\Deploy-Application.exe, C:\WINDOWS\ccmcache\7\Deploy-Application.exe, C:\WINDOWS\ccmcache\6\Deploy-Application.exe, C:\windows\ccmcache\2s\Deploy-Application.exe, C:\windows\ccmcache\1d\Deploy-Application.exe, C:\windows\ccmcache\2o\Deploy-Application.exe, C:\windows\ccmcache\2r\Deploy-Application.exe
C:\Users\user23452345\temp\test\1\Another1-Application.exe, C:\Users\user23hkjhf_5\temp\An0ther-Application.exe, C:\Users\user1324asdf\temp\Another-Applicatiooon.exe, C:\Users\user23452---5\temp\lili\Another-Application.exe
As you see on the top of the code snippet, you see that there is a constant, grouping_requirement, which I arbitrarily set to 0.75. If you reduce that value closer to 0.0, more paths will be grouped together, if you raise that value closer to 1.0, fewer paths will be grouped. Good luck!
Related
I'm trying to compare two files containing key-value pairs. Here is an example of such pairs:
ORIGINAL:
z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.304856846498851e-06, 1.304754136591639e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733914e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384797e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.0045412539447373e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664106e-06, 1.3049327764458588e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n
FILE 2:
z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.3048568464988513e-06, 1.3047541365916392e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733912e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384799e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.004541253944737e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664108e-06, 1.304932776445859e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n
When I run both files trough difflib.Differ().compare() I get the following output which is wrong:
'- z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.304856846498851e-06, 1.304754136591639e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733914e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384797e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.0045412539447373e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664106e-06, 1.3049327764458588e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n', '+ z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.3048568464988513e-06, 1.3047541365916392e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733912e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384799e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.004541253944737e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664108e-06, 1.304932776445859e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n'
If you look closely you'll see that the line from both files is extremely similar with some minor differences. For some reason difflib doesn't recognize the similar characters and just gives the whole line as a difference.
Does anyone have a solution for this problem?
I'm trying to do some fuzzy matching on some OCR results, and I want to be able to factor in common OCR errors. In particular, I'm matching streets to a database of streets. I figured out how to down-weight common single-character substitution errors using the weighted-levenshtein package, but it seems to only work on single characters, when many of the most common errors are things like "li" to "h".
Right now, "Mam" matches most closely to "MAY ST," when I'd really like it to match to "MAIN ST" instead. I'd like to be able to build something in that knows that "IN" and "M" often correspond because "in" gets read as "m" by the OCR.
Here's the current code I'm working with (I'm down-weighting inserts because some of the streets I'm working with are abbreviated or missing "St", "Ave", etc.):
import numpy as np
from weighted_levenshtein import lev, osa, dam_lev
def lratio(str1,str2):
insert_costs = 0.5*np.ones(128, dtype=np.float64)
delete_costs = np.ones(128, dtype=np.float64)
substitute_costs = np.ones((128, 128), dtype=np.float64)
substitute_costs[ord('O'), ord('0')] = 0.25
substitute_costs[ord('0'), ord('O')] = 0.25
substitute_costs[ord('I'), ord('T')] = 0.5
substitute_costs[ord('T'), ord('I')] = 0.5
ldistance = lev(str1, str2, insert_costs=insert_costs, delete_costs=delete_costs, substitute_costs=substitute_costs)
return (1.0 - float(ldistance) / float(len(str1) + len(str2))) * 100.0
I don't think there's a way to modify weighted-levenshtein for multi-character substitution. But if there were, it would be great. And I bet there's a package out there that has this capacity--possibly a package that has a library of common errors already built in.
Any ideas?
I would like to go through a gene and get a list of 10bp long sequences containing the exon/intron borders from each feature.type =='mRNA'. It seems like I need to use compoundLocation, and the locations used in 'join' but I can not figure out how to do it, or find a tutorial.
Could anyone please give me an example or point me to a tutorial?
Assuming all the info in the exact format you show in the comment, and that you're looking for 20 bp on either side of each intro/exon boundary, something like this might be a start:
Edit: If you're actually starting from a GenBank record, then it's not much harder. Assuming that the full junction string you're looking for is in the CDS feature info, then:
for f in record.features:
if f.type == 'CDS':
jct_info = str(f.location)
converts the "location" information into a string and you can continue as below.
(There are ways to work directly with the location information without converting to a string - in particular you can use "extract" to pull the spliced sequence directly out of the parent sequence -- but the steps involved in what you want to do are faster and more easily done by converting to str and then int.)
import re
jct_info = "join{[0:229](+), [11680:11768](+), [11871:12135](+), [15277:15339](+), [16136:16416](+), [17220:17471](+), [17547:17671](+)"
jctP = re.compile("\[\d+\:\d+\]")
jcts = jctP.findall(jct_info)
jcts
['[0:229]', '[11680:11768]', '[11871:12135]', '[15277:15339]', '[16136:16416]', '[17220:17471]', '[17547:17671]']
Now you can loop through the list of start:end values, pull them out of the text and convert them to ints so that you can use them as sequence indexes. Something like this:
for jct in jcts:
(start,end) = jct.replace('[', '').replace(']', '').split(':')
try: # You need to account for going out of index, e.g. where start = 0
start_20_20 = seq[int(start)-20:int(start)+20]
except IndexError:
# do your alternatives e.g. start = int(start)
I wrote a simple procedure to calculate the average of the test coverage of some specific packages in a Java project. The raw data in a huge html file is like this:
<body>
package pkg1 <line_coverage>11/111,<branch_coverage>44/444<end>
package pkg2 <line_coverage>22/222,<branch_coverage>55/555<end>
package pkg3 <line_coverage>33/333,<branch_coverage>66/666<end>
...
</body>
Given the specified packages "pkg1" and "pkg3", for example, the average line coverage is:
(11+33)/(111+333)
and average branch coverage is:
(44+66)/(444+666)
I wrote the follow procedure to get the result and it works well. But how to implement this calculation in a functional style? Something like "(x,y) for x in ... for b in ... if...". I know a little Erlang, Haskell and Clojure, So solutions in these languages are also appreciated. Thanks a lot!
from __future__ import division
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for line in datafile:
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
match = ptn.match(line)
if match is not None:
cvln, tlln, cvbh, tlbh = match.groups()
covered_lines += int(cvln)
total_lines += int(tlln)
covered_branches += int(cvbh)
total_branches += int(tlbh)
print 'Line coverage:', '{:.2%}'.format(covered_lines / total_lines)
print 'Branch coverage:', '{:.2%}'.format(covered_branches/total_branches)
Down below you can find my Haskell solution. I will try to explain the important points I went through as I wrote it.
First you will find that I created a data structure for coverage data. It's generally a good idea to create data structures to represent whatever data you want to handle. This is in part because it makes it easier to design your code when you can think in terms of whatever you are designing – closely related to functional programming philosophies, and in part because it can eliminate a few bugs where you think you are doing something but are in actuality doing something else.
Related to the point before: The first thing I do is to convert the string-represented data into my own data structure. When you are doing functional programming, you are often doing things in "sweeps." You don't have a single function that converts data to your format, filters out the unwanted data and summarises the result. You have three different functions for each of those tasks, and you do them one at a time!
This is because functions are very composable, i.e. if you have three different ones, you can stick them together to form a single one if you want to. If you start with a single one, it is very difficult to take it apart to form three different ones.
The actual workings of the conversion function is actually quite uninteresting unless you are specifically doing Haskell. All it does is try to match each string with a regex, and if it succeeds, it adds the coverage data to the resulting list.
Again, mad composition is about to happen. I don't create a function to loop over a list of coverages and sum them up. I create a single function to sum two coverages, because I know I can use it together with the specialised fold loop (which is sort of like a for loop on steroids) to summarise all coverages in a list. There's no need for me to reinvent the wheel and create a loop myself.
Besides, my sumCoverages function works with a lot of specialised loops, so I don't have to write a ton of functions, I just stick my single function into a ton of pre-made library functions!
In the main function you will see what I mean by programming in "sweeps" or "passes" over the data. First I convert it to the internal format, then I filter out the unwanted data, then I summarise the remaining data. These are completely independent computations. That's functional programming.
You will also notice that I use two specialised loops there, filter and fold. This means that I don't have to write any loops myself, I just stick in a function to those standard library loops and let those take it from there.
import Data.Maybe (catMaybes)
import Data.List (foldl')
import Text.Printf (printf)
import Text.Regex (matchRegex, mkRegex)
corePkgs = ["d", "f"]
stats = [
"d>11/23d>34/89d",
"e>25/65e>13/25e",
"f>36/92f>19/76"
]
format = mkRegex ".*(\\w+).*>([0-9]+)/([0-9]+).*>([0-9]+)/([0-9]+).*"
-- It might be a good idea to define a datatype for coverage data.
-- A bit of coverage data is defined as the name of the package it
-- came from, the lines covered, the total amount of lines, the
-- branches covered and the total amount of branches.
data Coverage = Coverage String Int Int Int Int
-- Then we need a way to convert the string data into a list of
-- coverage data. We do this by regex. We try to match on each
-- string in the list, and then we choose to keep only the successful
-- matches. Returned is a list of coverage data that was represented
-- by the strings.
convert :: [String] -> [Coverage]
convert = catMaybes . map match
where match line = do
[name, cl, tl, cb, tb] <- matchRegex format line
return $ Coverage name (read cl) (read tl) (read cb) (read tb)
-- We need a way to summarise two coverage data bits. This can of course also
-- be used to summarise entire lists of coverage data, by folding over it.
sumCoverage (Coverage nameA clA tlA cbA tbA) (Coverage nameB clB tlB cbB tbB) =
Coverage (nameA ++ nameB ++ ",") (clA + clB) (tlA + tlB) (cbA + cbB) (tbA + tbB)
main = do
-- First we need to convert the strings to coverage data
let coverageData = convert stats
-- Then we want to filter out only the relevant data
relevantData = filter (\(Coverage name _ _ _ _) -> name `elem` corePkgs) coverageData
-- Then we need to summarise it, but we are only interested in the numbers
Coverage _ cl tl cb tb = foldl' sumCoverage (Coverage "" 0 0 0 0) relevantData
-- So we can finally print them!
printf "Line coverage: %.2f\n" (fromIntegral cl / fromIntegral tl :: Double)
printf "Branch coverage: %.2f\n" (fromIntegral cb / fromIntegral tb :: Double)
Here are some quickly-hacked, untested ideas applied to your code:
import numpy as np
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
matches = map(datafile, ptn.match)
statsList = [map(int, match.groups()) for match in matches if matches]
# statsList is a list of [cvln, tlln, cvbh, tlbh]
stats = np.array(statsList)
covered_lines, total_lines, covered_branches, total_branches = stats.sum(axis=1)
Well, as you can see I haven't bothered to finish off the remaining loop, but I think the point is made by now. There's certainly a lot more than one way to do this; I elected to show off map() (which some will say makes this less efficient, and it probably does), as well as NumPy to get the (admittedly light) math done.
This is the corresponding Clojure solution:
(defn extract-data
"extract 4 integer from a string line according to a package name"
[pkg line]
(map read-string
(rest (first
(re-seq
(re-pattern
(str pkg ".*>(\\d+)/(\\d+).*>(\\d+)/(\\d+)"))
line)))))
(defn scan-lines-by-pkg
"scan all string lines and extract all data as integer sequences
according to package names"
[pkgs lines]
(filter seq (for [pkg pkgs
line lines]
(extract-data pkg line))))
(defn sum-data
"add all data in valid lines together"
[pkgs lines]
(apply map + (scan-lines-by-pkg pkgs lines)))
(defn get-percent
[covered all]
(str (format "%.2f" (float (/ (* covered 100) all))) "%"))
(defn get-cov
[pkgs lines]
{:line-cov (apply get-percent (take 2 (sum-data pkgs lines)))
:branch-cov (apply get-percent (drop 2 (sum-data pkgs lines)))})
(get-cov ["d" "f"] ["abc" "d>11/23d>34/89d" "e>25/65e>13/25e" "f>36/92f>19/76"])
I have crawled txt files from different website, now i need to glue them into one file. There are many lines are similar to each other from various websites. I want to remove repetitions.
Here is what I have tried:
import difflib
sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
sourcefile = open('%s.txt' % sourcename)
sourcelines = sourcefile.readlines()
sourcefile.close()
for sourceline in sourcelines:
destfile = open('%s.txt' % destname, 'a+')
destlines = destfile.readlines()
similar = False
for destline in destlines:
ratio = difflib.SequenceMatcher(None, destline, sourceline).ratio()
if ratio > 0.8:
print destline
print sourceline
similar = True
if not similar:
destfile.write(sourceline)
destfile.close()
I will run it for every source, and write line by line to the same file. The result is, even if i run it for the same file multiple times, the line is always appended to the destination file.
EDIT:
I have tried the code of the answer. It's still very slow.
Even If I minimize the IO, I still need to compare O(n^2), especially when you have 1000+ lines. I have average 10,000 lines per file.
Any other ways to remove the duplicates?
Here is a short version that does minimal IO and cleans up after itself.
import difflib
sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
with open('%s.txt' % destname, 'w+') as destfile:
# we read in the file so that on subsequent runs of this script, we
# won't duplicate the lines.
known_lines = set(destfile.readlines())
with open('%s.txt' % sourcename) as sourcefile:
for line in sourcefile:
similar = False
for known in known_lines:
ratio = difflib.SequenceMatcher(None, line, known).ratio()
if ratio > 0.8:
print ratio
print line
print known
similar = True
break
if not similar:
destfile.write(line)
known_lines.add(line)
Instead of reading the known lines each time from the file, we save them to a set, which we use for comparison against. The set is essentially a mirror of the contents of 'destfile'.
A note on complexity
By its very nature, this problem has a O(n2) complexity. Because you're looking for similarity with known strings, rather than identical strings, you have to look at every previously seen string. If you were looking to remove exact duplicates, rather than fuzzy matches, you could use a simple lookup in a set, with complexity O(1), making your entire solution have O(n) complexity.
There might be a way to reduce the fundamental complexity by using lossy compression on the strings so that two similar strings compress to the same result. This is however both out of scope for a stack overflow answer, and beyond my expertise. It is an active research area so you might have some luck digging through the literature.
You could also reduce the time taken by ratio() by using the less accurate alternatives quick_ratio() and real_quick_ratio().
Your code works fine for me. it prints destline and sourceline to stdout when lines are similar (in the example I used, exactly the same) but it only wrote unique lines to file once. You might need to set your ratio threshold lower for your specific "similarity" needs.
Basically what you need to do is check every line in the source file to see if it has a potential match against every line of the destination file.
##xiaoshanwujzw.txt
##-----------------
##radically different thing
##this is data
##and more data
##bindresult.txt
##--------------
##a website line
##this is data
##and more data
from difflib import SequenceMatcher
sourcefile = open('xiaoshanwujzw.txt', 'r')
sourcelines = sourcefile.readlines()
sourcefile.close()
destfile = open('bindresult.txt', 'a+')
destlines = destfile.readlines()
has_matches = {k: False for k in sourcelines}
for d_line in destlines:
for s_line in sourcelines:
if SequenceMatcher(None, d_line, s_line).ratio() > 0.8:
has_matches[s_line] = True
break
for k in has_matches:
if has_matches[k] == False:
destfile.write(k)
destfile.close()
This will add the line radically different thing`` to the destinationfile.