Python regex on multiple src to destination - python

I have been reading through thousands of posts trying to find best solution.
I apologize if the nature of this question has been asked multiple times before.
I have a file that I put placeholders in. The file is 200 lines and in this file there is a section where I have propertyNames and corresponding propertyValues. The propertyValues are placeholders that I want to find and substitute actual values with.
I think I will use fileinput and re modules to do this, but I do not want to have to parse line by line multiple times to fill in multiple propertyValues. Instead I was thinking it would be more efficient to have multiple strings I search for and its corresponding replacement text and while its scanning through lines if it finds any instance it replaces with its corresponding replacement.
What would be the best way to do this? Can it be done in a simple way with fileinput and re?

I would use jinja for that. It's a templating engine that allows you to do that and much more (like having for loops inside your templates, and so on).
Take a look at: http://jinja.pocoo.org/docs/dev/templates/
Of course, this would need to change the input file format. If you are allowed to do that, I think this is the way to go.

As I understand your question there are two cases
First: Search and replace line by line
$place_holders=[]
find_and_replace():
for $line in $file:
for $text in $line:
if $text == "Target text":
$place_holders.add($text.get_place_holder)
if place_holders.size != 0:
for $place_holder in $place_holders:
replace "New text" at position $place_holder
$place_holders=[]
Second: Search all lines then replace
find_and_replace():
for $line in $file:
for $text in $line
if $text == "Target text":
$place_holders.add($text.get_place_holder)
if $place_holders.size != 0:
for $place_holder in $place_holders:
replace "New text" at position $place_holder
$place_holders=[]
What is difference between codes above?
Yes, just how many times you ask the question "place_holders list is empty or not?" the first case asks file.number_of_line times meanwhile the second case ask only one time. I think this should have a very small significant to speed of regex.
Note the code above is just simple demonstration of scenerio in your problem, there is no guarantee that regex engine will work in this way.
BUT
If you want the another way to optimize a speed of your program, I suggest
Do parallel computing,
Use any regex engine which provide JIT compilation (In case that you have a complex regex).

Related

How to find filenames with a specific extension using regex?

How can I grab 'dlc3.csv' & 'spongebob.csv' from the below string via the absolute quickest method - which i assume is regex?
4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv
I've already managed to achieve this by using split() and for loops but its slowing my program down way too much.
I would post an example of my current code but its got a load of other stuff in it so it would only cause you to ask more questions.
In a nutshell im opening a large 6,000 line .csv file and im then using nested for loops to iterate through each line and using .split() to find specific parts in each line. I have many files where i need to scan specific things on each line and atm ive only implemented a couple features into my Qt program and its already taking upto 5 seconds to load some things and up to 10 seconds for others. All of which is due to the nested loops. Ive looked at where to use range, where not to, and where to use enumerate. I also use time.time() and loggin.info() to show each code changes speed. And after asking around ive been told that using a regex is the best option for me as it would remove the need for many of my for loops. Problem is i have no clue how to use regex. I of course plan on learning it but if someone could help me out with this it'll be much appreciated.
Thanks.
Edit: just to point out that when scanning each line the filename is unknown. ".csv" is the only thing that isnt unknown. So i basically need the regex to grab every filename before .csv but of course without grabbing the crap before the filename.
Im currently looking for .csv using .split('/') & .split('|'), then checking if .csv is in list index to grab the 'unknown' filename. And some lines will only have 1 filename whereas others will have 2+ so i need the regex to account for this too.
You can use this pattern: [^/]*\.csv
Breakdown:
[^/] - Any character that's not a forward slash (or newline)
* - Zero or more of them
\. - A literal dot. (This is necessary because the dot is a special character in regex.)
For example:
import re
s = '''4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv'''
pattern = re.compile(r'[^/]*\.csv')
result = pattern.findall(s)
Result:
['dlc3.csv', 'spongebob.csv']
Note: It could just as easily be result = re.findall(r'[^/]*\.csv', s), but for code cleanliness, I prefer naming my regexes. You might consider giving it an even clearer name in your code, like pattern_csv_basename or something like that.
Docs: re, including re.findall
See also: The official Python Regular Expression HOWTO

To Split text based on words using python code

I have a long text like the one below. I need to split based on some words say ("In","On","These")
Below is sample data:
On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain. These cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is untrammelled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted. The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains.
Can this problem be solved with a code as I have 1000 rows in a csv file.
As per my comment, I think a good option would be to use regular expression with the pattern:
re.split(r'(?<!^)\b(?=(?:On|In|These)\b)', YourStringVariable)
Yes this can be done in python. You can load the text into a variable and use the built in Split function for string. For example:
with open(filename, 'r') as file:
lines = file.read()
lines = lines.split('These')
# lines is now a list of strings split whenever 'These' string was encountered
To find whole words that are not part of larger words, I like using the regular expression:
[^\w]word[^\w]
Sample python code, assuming the text is in a variable named text:
import re
exp = re.compile(r'[^\w]in[^\w]', flags=re.IGNORECASE)
all_occurrences = list(exp.finditer(text))

How to speed up a search in a long document using python?

I was wondering if it is possible to search in Vim using Python in order to speed up a search in a long document.
I have a text document of 140.000 lines.
I have a list (mysearches) with 115 different search patterns.
I want to put all lines with matches in a list (hits)
This is what I do now:
for i in range(0,len(mysearches)-1)
for line in range(1, line("$"))
let idx = match(getline(line), mysearches[i])
if idx >= 0
call add(hits, line)
endif
endfor
endfor
"remove double linenumbers:
let unduplist=filter(copy(hits), 'index(hits, v:val, v:key+1)==-1')
The problem is that this search takes over 5 minutes.
How can I adapt above search to a python search?
How about this:
let pattern=join(mysearches, '\|')
let mylist = systemlist('grep -n "'.pattern.'" '. shellescape(fnamemodify(#%, ':p')). ' | cut -d: -f1')
This works by joining you pattern by \| (e.g. ORing all your different patterns), shelling out and using grep to process your pattern. Grep should be pretty fast, a lot more than vim and possibly also faster than either python or even perl (of course this depends on the pattern).
The return value is a list, containing all matching lines. Since we used the -n switch of grep we received the matching line numbers which were in turn cut out using cut.
systemlist() then contains the output split at \n.So mylist should contain all lines, matching your pattern. This of course depends on your pattern, but if you use standard BRE or ERE (-E) or even perl RE (-P switch) you should be okay. Depending on the flavor of RE desired, the joining part needs to be adjusted.
Note however this is basically untested, for a real robust solution, one would probably add some more error handling (possibly preprocessing of the pattern) and split up the whole part a little bit, so that it is easier to read.
XY problem indeed.
You can use the :vimgrep command like so:
execute "vim /\\(" . join(mysearches, "\\|") . "\\)/ %"
cwindow
I just tested with the the content of the 4017 lines-long .less file I'm working on, pasted 34 times into a new 136579 lines-long file and a list of only 13 searches:
:let foo = ["margin", "padding", "width", "height", "bleu", "gris", "none", "auto", "background", "color", "line", "border", "overflow"]
It took 3 seconds to find the 47634 matching lines which are now conveniently listed in the quickfix window.
YMMV, of course, because the search will take more time as you add items to mysearches and complexify them but I'm fairly sure you'll be able to beat your current timing easily.
You could also use :grep:
execute "grep -snH " . shellescape(join(foo, '\\|')) . " %"

re.findall regex hangs or very slow

My input file is a large txt file with concatenated texts I got from an open text library. I am now trying to extract only the content of the book itself and filter out other stuff such as disclaimers etc. So I have around 100 documents in my large text file (around 50 mb).
I then have identified the start and end markers of the contents themselves, and decided to use a Python regex to find me everything between the start and end marker. To sum it up, the regex should look for the start marker, then match everything after it, and stop looking once the end marker is reached, then repeat these steps until the end of the file is reached.
The following code works flawlessly when I feed a small, 100kb sized file into it:
import codecs
import re
outfile = codecs.open("outfile.txt", "w", "utf-8-sig")
inputfile = codecs.open("infile.txt", "r", "utf-8-sig")
filecontents = inputfile.read()
for result in re.findall(r'START\sOF\sTHE\sPROJECT\sGUTENBERG\sEBOOK.*?\n(.*?)END\sOF\THE\sPROJECT\sGUTENBERG\sEBOOK', filecontents, re.DOTALL):
outfile.write(result)
outfile.close()
When I use this regex operation on my larger file however, it will not do anything, the program just hangs. I tested it overnight to see if it was just slow and even after around 8 hours the program was still stuck.
I am very sure that the source of the problem is the
(.*?)
part of the regex, in combination with re.DOTALL.
When I use a similar regex on smaller distances, the script will run fine and fast.
My question now is: why is this just freezing up everything? I know the texts between the delimiters are not small, but a 50mb file shouldn't be too much to handle, right?
Am I maybe missing a more efficient solution?
Thanks in advance.
You are correct in thinking that using the sequence .*, which appears more than once, is causing problems. The issue is that the solver is trying many possible combinations of .*, leading to a result known as catastrophic backtracking.
The usual solution is to replace the . with a character class that is much more specific, usually the production that you are trying to terminate the first .* with. Something like:
`[^\n]*(.*)`
so that the capturing group can only match from the first newline to the end. Another option is to recognize that a regular expression solution may not be the best approach, and to use either a context free expression (such as pyparsing), or by first breaking up the input into smaller, easier to digest chunks (for example, with corpus.split('\n'))
Another workaround to this issue is adding a sane limit to the number of matched characters.
So instead of something like this:
[abc]*.*[def]*
You can limit it to 1-100 instances per character group.
[abc]{1,100}.{1,100}[def]{1,100}
This won't work for every situation, but in some cases it's an acceptable quickfix.

findall/finditer on a stream?

Is there a way to get the re.findall, or better yet, re.finditer functionality applied to a stream (i.e. an filehandle open for reading)?
Note that I am not assuming that the pattern to be matched is fully contained within one line of input (i.e. multi-line patterns are permitted). Nor am I assuming a maximum match length.
It is true that, at this level of generality, it is possible to specify a regex that would require that the regex engine have access to the entire string (e.g. r'(?sm).*'), and, of course, this means having to read the entire file into memory, but I am not concerned with this worst-case scenario at the moment. It is, after all, perfectly possible to write multi-line-matching regular expressions that would not require reading the entire file into memory.
Is it possible to access the underlying automaton (or whatever is used internally) from a compiled regex, to feed it a stream of characters?
Thanks!
Edit: Added clarifications regarding multi-line patterns and match lengths, in response to Tim Pietzcker's and rplnt's answers.
This is possible if you know that a regex match will never span a newline.
Then you can simply do
for line in file:
result = re.finditer(regex, line)
# do something...
If matches can extend over multiple lines, you need to read the entire file into memory. Otherwise, how would you know if your match was done already, or if some content further up ahead would make a match impossible, or if a match is only unsuccessful because the file hasn't been read far enough?
Edit:
Theoretically it is possible to do this. The regex engine would have to check whether at any point during the match attempt it reaches the end of the currently read portion of the stream, and if it does, read on ahead (possibly until EOF). But the Python engine doesn't do this.
Edit 2:
I've taken a look at the Python stdlib's re.py and its related modules. The actual generation of a regex object, including its .match() method and others is done in a C extension. So you can't access and monkeypatch it to also handle streams, unless you edit the C sources directly and build your own Python version.
It would be possible to implement on regexp with known maximum length. Either no +/* or ones where you know maximum numbers of repetition. If you know this you can read file by chunks and match on these, yielding the result. You would also run the regexp on overlapping chunk than would cover the case when the regexp would match but was stopped by the end of a string.
some pseudo(python)code:
overlap_tail = ''
matched = {}
for chunk in file.stream(chunk_size):
# calculate chunk_start
for result in finditer(match, overlap_tail+chunk):
if not chunk_start + result.start() in matched:
yield result
matched[chunk_start + result.start()] = result
# delete old results from dict
overlap_tail = chunk[-max_re_len:]
Just an idea but I hope you get what I'm trying to achieve. You'd need to consider that file(stream) could end and some other cases. But I think it can be done (if the length of the regular expression is limited(known)).

Categories

Resources