How to speed up a search in a long document using python?

How to speed up a search in a long document using python? - python

I was wondering if it is possible to search in Vim using Python in order to speed up a search in a long document.
I have a text document of 140.000 lines.
I have a list (mysearches) with 115 different search patterns.
I want to put all lines with matches in a list (hits)
This is what I do now:
for i in range(0,len(mysearches)-1)
for line in range(1, line("$"))
let idx = match(getline(line), mysearches[i])
if idx >= 0
call add(hits, line)
endif
endfor
endfor
"remove double linenumbers:
let unduplist=filter(copy(hits), 'index(hits, v:val, v:key+1)==-1')
The problem is that this search takes over 5 minutes.
How can I adapt above search to a python search?

How about this:
let pattern=join(mysearches, '\|')
let mylist = systemlist('grep -n "'.pattern.'" '. shellescape(fnamemodify(#%, ':p')). ' | cut -d: -f1')
This works by joining you pattern by \| (e.g. ORing all your different patterns), shelling out and using grep to process your pattern. Grep should be pretty fast, a lot more than vim and possibly also faster than either python or even perl (of course this depends on the pattern).
The return value is a list, containing all matching lines. Since we used the -n switch of grep we received the matching line numbers which were in turn cut out using cut.
systemlist() then contains the output split at \n.So mylist should contain all lines, matching your pattern. This of course depends on your pattern, but if you use standard BRE or ERE (-E) or even perl RE (-P switch) you should be okay. Depending on the flavor of RE desired, the joining part needs to be adjusted.
Note however this is basically untested, for a real robust solution, one would probably add some more error handling (possibly preprocessing of the pattern) and split up the whole part a little bit, so that it is easier to read.

XY problem indeed.
You can use the :vimgrep command like so:
execute "vim /\\(" . join(mysearches, "\\|") . "\\)/ %"
cwindow
I just tested with the the content of the 4017 lines-long .less file I'm working on, pasted 34 times into a new 136579 lines-long file and a list of only 13 searches:
:let foo = ["margin", "padding", "width", "height", "bleu", "gris", "none", "auto", "background", "color", "line", "border", "overflow"]
It took 3 seconds to find the 47634 matching lines which are now conveniently listed in the quickfix window.
YMMV, of course, because the search will take more time as you add items to mysearches and complexify them but I'm fairly sure you'll be able to beat your current timing easily.
You could also use :grep:
execute "grep -snH " . shellescape(join(foo, '\\|')) . " %"

Related

How to find filenames with a specific extension using regex?

How can I grab 'dlc3.csv' & 'spongebob.csv' from the below string via the absolute quickest method - which i assume is regex?
4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv
I've already managed to achieve this by using split() and for loops but its slowing my program down way too much.
I would post an example of my current code but its got a load of other stuff in it so it would only cause you to ask more questions.
In a nutshell im opening a large 6,000 line .csv file and im then using nested for loops to iterate through each line and using .split() to find specific parts in each line. I have many files where i need to scan specific things on each line and atm ive only implemented a couple features into my Qt program and its already taking upto 5 seconds to load some things and up to 10 seconds for others. All of which is due to the nested loops. Ive looked at where to use range, where not to, and where to use enumerate. I also use time.time() and loggin.info() to show each code changes speed. And after asking around ive been told that using a regex is the best option for me as it would remove the need for many of my for loops. Problem is i have no clue how to use regex. I of course plan on learning it but if someone could help me out with this it'll be much appreciated.
Thanks.
Edit: just to point out that when scanning each line the filename is unknown. ".csv" is the only thing that isnt unknown. So i basically need the regex to grab every filename before .csv but of course without grabbing the crap before the filename.
Im currently looking for .csv using .split('/') & .split('|'), then checking if .csv is in list index to grab the 'unknown' filename. And some lines will only have 1 filename whereas others will have 2+ so i need the regex to account for this too.

You can use this pattern: [^/]*\.csv
Breakdown:
[^/] - Any character that's not a forward slash (or newline)
* - Zero or more of them
\. - A literal dot. (This is necessary because the dot is a special character in regex.)
For example:
import re
s = '''4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv'''
pattern = re.compile(r'[^/]*\.csv')
result = pattern.findall(s)
Result:
['dlc3.csv', 'spongebob.csv']
Note: It could just as easily be result = re.findall(r'[^/]*\.csv', s), but for code cleanliness, I prefer naming my regexes. You might consider giving it an even clearer name in your code, like pattern_csv_basename or something like that.
Docs: re, including re.findall
See also: The official Python Regular Expression HOWTO

Regular expression in Python module looking for asterisk character is extremely slow

My working environment:
OS: Windows 10 (64 bits)
RAM: 32 GB
Processor: Intel(R) Xeon(R) CPU E3-1240 v5 # 3.50 GHz
Python version: 3.7.4 (64 bits)
Problem description:
I'm working on a restful API log file. Users of this API, can specify variableName:value in their queries' URL and based on the variable name and the specified value, the search engine behind this API, would return the result. There is also a wildcard functionality allowing to create queries by using regular expressions which can have one of the following forms:
variableName:va*
variableName:*lue
variableName:*alu*
variableName:*
The purpose is to read a log file, then extract and count the number of lines containing at least one occurrence of one of the above mentioned patterns. This can give us an estimation so that we might be able to see what percentage of our users work with the wildcard functionality while querying our API.
For our analysis, it doesn't really matter how many occurrences of different variables (or maybe even same variables) appear on each line in the file (each line in the log file = one user query). As soon as one occurrence of one of the above mentioned patterns has been detected, the line is selected and our counter is incremented indicating that the wildcard functionality has been used in the query.
For the purpose of this analysis, I've developed a Python module with the following regular expression:
regexp_wildcard_asterisk = r"".join(
[
r"[a-zA-Z][a-zA-Z0-9]*([:]|%3A)",
r"(([*]|%2A)|[^=*]+([*]|%2A)|([*]|%2A)[^=*]+|",
r"([*]|%2A)[^=*]+([*]|%2A))"
]
)
regexp_wildcard_asterisk_prog = re.compile(
regexp_wildcard_asterisk, re.IGNORECASE
)
Given that queries are actualy http URLs, that's why in the above regular expression you can see %3A and %2A because depending on the encoding on the client side : and * can also be encoded as %3A and %2A respectively.
Then all I need to do, is to read the log file line by line, inside a loop and to check if there is a match:
with open(
src_file,
"r",
encoding="UTF-8"
) as srcfile_desc:
csv_reader = csv.reader(srcfile_desc, delimiter="|")
wildcard_asterisk_func_counter = 0
for tokens in csv_reader:
# The pattern matching is done on the 5th colonne of each
# line, that's why I've written tokens[4]
if (regexp_wildcard_asterisk_prog.search(tokens[4])):
wildcard_asterisk_func_counter += 1
Well, this does the job, but it is extremely slow! Although I have to admit that sometimes my log files are quite huge, but still the size of the file doesn't explain the very very long execution time of this program. The last time I run the above program on a log file with only 890 lines and roughly 240 000 characters on each line (only a few lines with 1 100 000 characters). It took more than 24 hours and when I checked it was still running.
Now I know that regular expressions could indeed have impact on the performence, yet I've done pattern matching on other API logs files with millions of lines and sometimes with millions of characters on each line, looking for other characters such as ?, [, ], {, } and the execution time never exceeded a few hours. So I thought maybe there is some bug in the definition of my regular expression looking for asterisk.
Reading my code, could you telle me, where do you think I've made a mistake (or mistakes)?

I updated the regex to remove capture groups and to not do anything that would attempt to search beyond line boundaries (I am only searching individual lines anyway). I am also using a regex that looks for the minimal match that will recognize a wildcard query. I timed my code twice using the OP's regex and the one below and found no appreciable difference for these 4 test cases. I removed the re.IGNORECASE flag as it was not serving any useful purpose.
I tried different variations for matching zero or more characters that were neither newline nor one of the asterisk characters (* or %3A) and there wasn't any appreciable timing change.
import re
# matches enough of the query to detect whether there was a wildcard:
regexp_wildcard_asterisk = r"""(?x) # Verbose mode
[a-zA-Z][a-zA-Z0-9]+ # Match variable
(?::|%3A) # Match ':'
(?:(?!\*|%2A).)* # Match zero or more non-'*' non-newline characters
(?:\*|%2A) # Match '*'
"""
regexp_wildcard_asterisk_prog = re.compile(
regexp_wildcard_asterisk
)
lines = [
'variableName:va*',
'variableName:*lue',
'variableName:*alu*',
'variableName:*', # 1000 lines of these
]
from time import time
t1 = time()
for _ in range(300):
wildcard_asterisk_func_counter = 0
for line in lines:
if (regexp_wildcard_asterisk_prog.search(line)):
wildcard_asterisk_func_counter += 1
t2 = time()
print(t2 - t1)
See Regex demo

The fastest way I found is
^.*\w+(?::|%3A)[^=*]*?\*
Demo & explanation (3 matches/83 steps/2 ms)
to be compared with yours with the same example text,
Demo & explanation (22 matches/476 steps/4 ms)

Python regex on multiple src to destination

I have been reading through thousands of posts trying to find best solution.
I apologize if the nature of this question has been asked multiple times before.
I have a file that I put placeholders in. The file is 200 lines and in this file there is a section where I have propertyNames and corresponding propertyValues. The propertyValues are placeholders that I want to find and substitute actual values with.
I think I will use fileinput and re modules to do this, but I do not want to have to parse line by line multiple times to fill in multiple propertyValues. Instead I was thinking it would be more efficient to have multiple strings I search for and its corresponding replacement text and while its scanning through lines if it finds any instance it replaces with its corresponding replacement.
What would be the best way to do this? Can it be done in a simple way with fileinput and re?

I would use jinja for that. It's a templating engine that allows you to do that and much more (like having for loops inside your templates, and so on).
Take a look at: http://jinja.pocoo.org/docs/dev/templates/
Of course, this would need to change the input file format. If you are allowed to do that, I think this is the way to go.

As I understand your question there are two cases
First: Search and replace line by line
$place_holders=[]
find_and_replace():
for $line in $file:
for $text in $line:
if $text == "Target text":
$place_holders.add($text.get_place_holder)
if place_holders.size != 0:
for $place_holder in $place_holders:
replace "New text" at position $place_holder
$place_holders=[]
Second: Search all lines then replace
find_and_replace():
for $line in $file:
for $text in $line
if $text == "Target text":
$place_holders.add($text.get_place_holder)
if $place_holders.size != 0:
for $place_holder in $place_holders:
replace "New text" at position $place_holder
$place_holders=[]
What is difference between codes above?
Yes, just how many times you ask the question "place_holders list is empty or not?" the first case asks file.number_of_line times meanwhile the second case ask only one time. I think this should have a very small significant to speed of regex.
Note the code above is just simple demonstration of scenerio in your problem, there is no guarantee that regex engine will work in this way.
BUT
If you want the another way to optimize a speed of your program, I suggest
Do parallel computing,
Use any regex engine which provide JIT compilation (In case that you have a complex regex).

To print only one occurrence of matching pattern using pcregrep

Is there any option in pcregrep that allows me to print only one occurrence of the matched string pattern? I came to know about option --match-limit. But pcregrep is not recognizing this options. Is there any specific version that supports this option.
I assume that --match-limit=1 prints only one occurrence of the matched pattern.
You can also let me know on other possible ways. I am executing pcregrep command from a python script via commands utility of python.

Before we look into --match-limit, let's review two options that almost do what you want to do.
Option 1. When you only want to know if you can find a match in a file, but you don't care what the match is, you can use the -l option like so:
pcregrep -l \d\d\d test.txt
where \d\d\d is the pattern and test.txt contains the strings.
Option 2. To count the number of matches, use
pcregrep -c \d\d\d test.txt
This may be the closest we can get to what you want to do.
What is match--limit ?
--match-limit=1does work, but it doesn't do what you want it to do.
From the documentation:
The --match-limit option provides a means of limiting resource usage
when processing patterns that are not going to match, but which have a
very large number of possibilities in their search trees. The classic
example is a pattern that uses nested unlimited repeats. Internally,
PCRE uses a function called match() which it calls repeatedly
(sometimes recursively). The limit set by --match-limit is imposed on
the number of times this function is called during a match, which has
the effect of limiting the amount of backtracking that can take place.
So --match-limit is about memory, not about the number of matches.
Let's try this out:
If you make a file called test.txt and add some lines with three digits, like so:
111
123
456
Then running pcregrep --match-limit=1 \d\d\d test.txt will match all these lines.
But if you run pcregrep --match-limit=1 \d{3} test.txt you will get an error that the resource limit was exceeded.
Looking at the full documentation, I don't see any option to limit the number of matches. Of course you could design your regex to do so.
For more info
You probably know this, but for the short documentation type pcregrep --help
The full documentation can be downloaded in the pcre package from pcre.org
For usage examples, see grep in PCRE

Unable search names which contain three 7s in random order by AWK/Python/Bash

I need to find names which contain three number 7 in the random order.
My attempt
We need to find first names which do not contain seven
ls | grep [^7]
Then, we could remove these matches from the whole space
ls [remove] ls | grep [^7]
The problem in my pseudo-code starts to repeat itself quickly.
How can you find the names which contain three 7s in the random order by AWK/Python/Bash?
[edit]
The name can contain any number of letters and it contains words of three 7s.

I don't understand the part about "random order". How do you differentiate between the "order" when it's the same token that repeats? Is "a7b7" different from "c7d7" in the order of the 7s?
Anyway, this ought to work:
ls *7*7*7*
It just let's the shell solve the problem, but maybe I didn't understand properly.
EDIT: The above is wrong, it includes cases with more than four 7s which is not wanted. Assuming this is bash, and extended globbing is enabled, this works:
ls *([^7])7*([^7])7*([^7])7*([^7])
This reads as "zero or more characters which are not sevens, followed by a seven, followed by zero or more characters that are not sevens", and so on. It's important to understand that the asterisk is a prefix operator here, operating on the expression ([^7]) which means "any character except 7".

I'm guessing you want to find files that contain exactly three 7's, but no more. Using gnu grep with the extends regexp switch (-E):
ls | grep -E '^([^7]*7){3}[^7]*$'
Should do the trick.
Basically that matches 3 occurrences of "not 7 followed by a 7", then a bunch of "not 7" across the whole string (the ^ and $ at the beginning and end of the pattern respectively).

Something like this:
printf '%s\n' *|awk -F7 NF==4

A Perl solution:
$ ls | perl -ne 'print if (tr/7/7/ == 3)'
3777
4777
5777
6777
7077
7177
7277
7377
7477
7577
7677
...
(I happen to have a directory with 4-digit numbers. 1777 and 2777 don't exist. :-)

Or instead of doing it in a single grep, use one grep to find files with 3-or-more 7s and another to filter out 4-or-more 7s.
ls -f | egrep '7.*7.*7' | grep -v '7.*7.*7.*7'
You could move some of the work into the shell glob with the shorter
ls -f *7*7*7* | grep -v '7.*7.*7.*7'
though if there are a large number of files which match that pattern then the latter won't work because of built-in limits to the glob size.
The '-f' in the 'ls' is to prevent 'ls' from sorting the results. If there is a huge number of files in the directory then the sort time can be quite noticeable.
This two-step filter process is, I think, more understandable than using the [^7] patterns.
Also, here's the solution as a Python script, since you asked for that as an option.
import os
for filename in os.listdir("."):
if filename.count("7") == 4:
print filename
This will handle a few cases that the shell commands won't, like (evil) filenames which contain a newline character. Though even here the output in that case would likely still be wrong, or at least unprepared for by downstream programs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.