How to find filenames with a specific extension using regex? - python

How can I grab 'dlc3.csv' & 'spongebob.csv' from the below string via the absolute quickest method - which i assume is regex?
4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv
I've already managed to achieve this by using split() and for loops but its slowing my program down way too much.
I would post an example of my current code but its got a load of other stuff in it so it would only cause you to ask more questions.
In a nutshell im opening a large 6,000 line .csv file and im then using nested for loops to iterate through each line and using .split() to find specific parts in each line. I have many files where i need to scan specific things on each line and atm ive only implemented a couple features into my Qt program and its already taking upto 5 seconds to load some things and up to 10 seconds for others. All of which is due to the nested loops. Ive looked at where to use range, where not to, and where to use enumerate. I also use time.time() and loggin.info() to show each code changes speed. And after asking around ive been told that using a regex is the best option for me as it would remove the need for many of my for loops. Problem is i have no clue how to use regex. I of course plan on learning it but if someone could help me out with this it'll be much appreciated.
Thanks.
Edit: just to point out that when scanning each line the filename is unknown. ".csv" is the only thing that isnt unknown. So i basically need the regex to grab every filename before .csv but of course without grabbing the crap before the filename.
Im currently looking for .csv using .split('/') & .split('|'), then checking if .csv is in list index to grab the 'unknown' filename. And some lines will only have 1 filename whereas others will have 2+ so i need the regex to account for this too.

You can use this pattern: [^/]*\.csv
Breakdown:
[^/] - Any character that's not a forward slash (or newline)
* - Zero or more of them
\. - A literal dot. (This is necessary because the dot is a special character in regex.)
For example:
import re
s = '''4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv'''
pattern = re.compile(r'[^/]*\.csv')
result = pattern.findall(s)
Result:
['dlc3.csv', 'spongebob.csv']
Note: It could just as easily be result = re.findall(r'[^/]*\.csv', s), but for code cleanliness, I prefer naming my regexes. You might consider giving it an even clearer name in your code, like pattern_csv_basename or something like that.
Docs: re, including re.findall
See also: The official Python Regular Expression HOWTO

Related

REGEX working together separated by | OR. When run independently are both returning empty lists

I have written two REGEX that I originally was using with the | either or. I need them to both run separately, what should be a simple matter of doing is not working the way I expected. I have tested both regex with online tools, and they both work 100%. When ran in the code they both return: [].
For reference stringSoup is an html string.
Here was the original:
re.findall(r"(\(#([^)\s]+)\))|//.*instagram\.com/(\w+.*?)/(?:p)/g")
I need to run each re separately like so:
re.findall(r"(\(#([^)\s]+)\))/g", stringSoup)
re.findall(r"//.*instagram\.com/(\w+.*?)/(?:p)/g", stringSoup)
The first regex is to find usernames as (#username) The second is to find usernames as instagram.com/username
The original combined regex was working fine
After separation both of these are returning empty []
I'm not really certain I understand your question and some of the inputs, but I made a sample to hopefully re-create what you're trying to do:
\(#(?P<username1>[^)]+)\) # username is after '(#' and is everything up until ')'
| # or
.*instagram\.com\/(?P<username2>[^\/]+)\/p # username is between 'instagram.com/' and the next '/'
You can view it here. You can also remove the top half or the bottom half and see that each regex will only match that specific item. Note that using something like [^\/] might be a bit crude and you can make that more specific, but the above should give you what you need in a general sense.

Identify a dot in an aiml pattern in Python

In a project of mine, I am trying to identify file names in a given sentence. For example, "Could you please open abc.txt", so I need to fetch the keywords "open" in order to know the kind of action that is expected and I also need to identify the file name, for obvious reasons. A simple AIML tag for this is:
<aiml>
<category>
<pattern>* OPEN *</pattern>
<template>open <star index="2"/></template>
<category>
</aiml>
Here, in the template tag, I am just giving an information about the operation to be performed and the file name. My python code on the other hand takes care of performing the required action.
Now the problem is the '.' character. Using that character divides the sentence into 2 parts, (in case of the example I mentioned above, the 2 sentences would be "Could you please open abc" and "txt") which are individually mapped to any of the aiml tags defined. But, in my case I don't want the '.' character to act as a delimiter. Basically, I want to identify file names that may or may not include an extension. Could anyone please help me out with this?
Thanks in advance!
By default AIML allows multi sentence input. This means full stops, exclamation marks and question marks are treated as separators between sentences. For example if you asked:
Good morning. My name is George. How are you today?
this is interpreted as 3 separate inputs. Normally this is a good thing as it means the AIML interpreter can re-use existing patterns for GOOD MORNING, MY NAME IS *, HOW ARE YOU *.
But in your case that's not helping as the full-stop before the extension is causing unwanted splitting. Depending on your AIML interpreter, sentence splitting is done in a pre-processing stage before sending the input to the interpreter. Some AIML interpreters have a configuration file that lets you define the sentence splitting characters, so you may simply be able to remove the full stop from the list of separators.
A better approach may be to pre-process the filenames and replace the full stop with the word DOT, you can then detect this in your pattern * OPEN *
As a final comment, * OPEN * is a very wide ranging pattern, it will also be invoked if someone says WHAT TIME IS THE SHOP OPEN TODAY, or any other input with the word OPEN in it surrounded by text.

Python regex on multiple src to destination

I have been reading through thousands of posts trying to find best solution.
I apologize if the nature of this question has been asked multiple times before.
I have a file that I put placeholders in. The file is 200 lines and in this file there is a section where I have propertyNames and corresponding propertyValues. The propertyValues are placeholders that I want to find and substitute actual values with.
I think I will use fileinput and re modules to do this, but I do not want to have to parse line by line multiple times to fill in multiple propertyValues. Instead I was thinking it would be more efficient to have multiple strings I search for and its corresponding replacement text and while its scanning through lines if it finds any instance it replaces with its corresponding replacement.
What would be the best way to do this? Can it be done in a simple way with fileinput and re?
I would use jinja for that. It's a templating engine that allows you to do that and much more (like having for loops inside your templates, and so on).
Take a look at: http://jinja.pocoo.org/docs/dev/templates/
Of course, this would need to change the input file format. If you are allowed to do that, I think this is the way to go.
As I understand your question there are two cases
First: Search and replace line by line
$place_holders=[]
find_and_replace():
for $line in $file:
for $text in $line:
if $text == "Target text":
$place_holders.add($text.get_place_holder)
if place_holders.size != 0:
for $place_holder in $place_holders:
replace "New text" at position $place_holder
$place_holders=[]
Second: Search all lines then replace
find_and_replace():
for $line in $file:
for $text in $line
if $text == "Target text":
$place_holders.add($text.get_place_holder)
if $place_holders.size != 0:
for $place_holder in $place_holders:
replace "New text" at position $place_holder
$place_holders=[]
What is difference between codes above?
Yes, just how many times you ask the question "place_holders list is empty or not?" the first case asks file.number_of_line times meanwhile the second case ask only one time. I think this should have a very small significant to speed of regex.
Note the code above is just simple demonstration of scenerio in your problem, there is no guarantee that regex engine will work in this way.
BUT
If you want the another way to optimize a speed of your program, I suggest
Do parallel computing,
Use any regex engine which provide JIT compilation (In case that you have a complex regex).

re.findall regex hangs or very slow

My input file is a large txt file with concatenated texts I got from an open text library. I am now trying to extract only the content of the book itself and filter out other stuff such as disclaimers etc. So I have around 100 documents in my large text file (around 50 mb).
I then have identified the start and end markers of the contents themselves, and decided to use a Python regex to find me everything between the start and end marker. To sum it up, the regex should look for the start marker, then match everything after it, and stop looking once the end marker is reached, then repeat these steps until the end of the file is reached.
The following code works flawlessly when I feed a small, 100kb sized file into it:
import codecs
import re
outfile = codecs.open("outfile.txt", "w", "utf-8-sig")
inputfile = codecs.open("infile.txt", "r", "utf-8-sig")
filecontents = inputfile.read()
for result in re.findall(r'START\sOF\sTHE\sPROJECT\sGUTENBERG\sEBOOK.*?\n(.*?)END\sOF\THE\sPROJECT\sGUTENBERG\sEBOOK', filecontents, re.DOTALL):
outfile.write(result)
outfile.close()
When I use this regex operation on my larger file however, it will not do anything, the program just hangs. I tested it overnight to see if it was just slow and even after around 8 hours the program was still stuck.
I am very sure that the source of the problem is the
(.*?)
part of the regex, in combination with re.DOTALL.
When I use a similar regex on smaller distances, the script will run fine and fast.
My question now is: why is this just freezing up everything? I know the texts between the delimiters are not small, but a 50mb file shouldn't be too much to handle, right?
Am I maybe missing a more efficient solution?
Thanks in advance.
You are correct in thinking that using the sequence .*, which appears more than once, is causing problems. The issue is that the solver is trying many possible combinations of .*, leading to a result known as catastrophic backtracking.
The usual solution is to replace the . with a character class that is much more specific, usually the production that you are trying to terminate the first .* with. Something like:
`[^\n]*(.*)`
so that the capturing group can only match from the first newline to the end. Another option is to recognize that a regular expression solution may not be the best approach, and to use either a context free expression (such as pyparsing), or by first breaking up the input into smaller, easier to digest chunks (for example, with corpus.split('\n'))
Another workaround to this issue is adding a sane limit to the number of matched characters.
So instead of something like this:
[abc]*.*[def]*
You can limit it to 1-100 instances per character group.
[abc]{1,100}.{1,100}[def]{1,100}
This won't work for every situation, but in some cases it's an acceptable quickfix.

findall/finditer on a stream?

Is there a way to get the re.findall, or better yet, re.finditer functionality applied to a stream (i.e. an filehandle open for reading)?
Note that I am not assuming that the pattern to be matched is fully contained within one line of input (i.e. multi-line patterns are permitted). Nor am I assuming a maximum match length.
It is true that, at this level of generality, it is possible to specify a regex that would require that the regex engine have access to the entire string (e.g. r'(?sm).*'), and, of course, this means having to read the entire file into memory, but I am not concerned with this worst-case scenario at the moment. It is, after all, perfectly possible to write multi-line-matching regular expressions that would not require reading the entire file into memory.
Is it possible to access the underlying automaton (or whatever is used internally) from a compiled regex, to feed it a stream of characters?
Thanks!
Edit: Added clarifications regarding multi-line patterns and match lengths, in response to Tim Pietzcker's and rplnt's answers.
This is possible if you know that a regex match will never span a newline.
Then you can simply do
for line in file:
result = re.finditer(regex, line)
# do something...
If matches can extend over multiple lines, you need to read the entire file into memory. Otherwise, how would you know if your match was done already, or if some content further up ahead would make a match impossible, or if a match is only unsuccessful because the file hasn't been read far enough?
Edit:
Theoretically it is possible to do this. The regex engine would have to check whether at any point during the match attempt it reaches the end of the currently read portion of the stream, and if it does, read on ahead (possibly until EOF). But the Python engine doesn't do this.
Edit 2:
I've taken a look at the Python stdlib's re.py and its related modules. The actual generation of a regex object, including its .match() method and others is done in a C extension. So you can't access and monkeypatch it to also handle streams, unless you edit the C sources directly and build your own Python version.
It would be possible to implement on regexp with known maximum length. Either no +/* or ones where you know maximum numbers of repetition. If you know this you can read file by chunks and match on these, yielding the result. You would also run the regexp on overlapping chunk than would cover the case when the regexp would match but was stopped by the end of a string.
some pseudo(python)code:
overlap_tail = ''
matched = {}
for chunk in file.stream(chunk_size):
# calculate chunk_start
for result in finditer(match, overlap_tail+chunk):
if not chunk_start + result.start() in matched:
yield result
matched[chunk_start + result.start()] = result
# delete old results from dict
overlap_tail = chunk[-max_re_len:]
Just an idea but I hope you get what I'm trying to achieve. You'd need to consider that file(stream) could end and some other cases. But I think it can be done (if the length of the regular expression is limited(known)).

Categories

Resources