In a project of mine, I am trying to identify file names in a given sentence. For example, "Could you please open abc.txt", so I need to fetch the keywords "open" in order to know the kind of action that is expected and I also need to identify the file name, for obvious reasons. A simple AIML tag for this is:
<aiml>
<category>
<pattern>* OPEN *</pattern>
<template>open <star index="2"/></template>
<category>
</aiml>
Here, in the template tag, I am just giving an information about the operation to be performed and the file name. My python code on the other hand takes care of performing the required action.
Now the problem is the '.' character. Using that character divides the sentence into 2 parts, (in case of the example I mentioned above, the 2 sentences would be "Could you please open abc" and "txt") which are individually mapped to any of the aiml tags defined. But, in my case I don't want the '.' character to act as a delimiter. Basically, I want to identify file names that may or may not include an extension. Could anyone please help me out with this?
Thanks in advance!
By default AIML allows multi sentence input. This means full stops, exclamation marks and question marks are treated as separators between sentences. For example if you asked:
Good morning. My name is George. How are you today?
this is interpreted as 3 separate inputs. Normally this is a good thing as it means the AIML interpreter can re-use existing patterns for GOOD MORNING, MY NAME IS *, HOW ARE YOU *.
But in your case that's not helping as the full-stop before the extension is causing unwanted splitting. Depending on your AIML interpreter, sentence splitting is done in a pre-processing stage before sending the input to the interpreter. Some AIML interpreters have a configuration file that lets you define the sentence splitting characters, so you may simply be able to remove the full stop from the list of separators.
A better approach may be to pre-process the filenames and replace the full stop with the word DOT, you can then detect this in your pattern * OPEN *
As a final comment, * OPEN * is a very wide ranging pattern, it will also be invoked if someone says WHAT TIME IS THE SHOP OPEN TODAY, or any other input with the word OPEN in it surrounded by text.
Related
How can I grab 'dlc3.csv' & 'spongebob.csv' from the below string via the absolute quickest method - which i assume is regex?
4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv
I've already managed to achieve this by using split() and for loops but its slowing my program down way too much.
I would post an example of my current code but its got a load of other stuff in it so it would only cause you to ask more questions.
In a nutshell im opening a large 6,000 line .csv file and im then using nested for loops to iterate through each line and using .split() to find specific parts in each line. I have many files where i need to scan specific things on each line and atm ive only implemented a couple features into my Qt program and its already taking upto 5 seconds to load some things and up to 10 seconds for others. All of which is due to the nested loops. Ive looked at where to use range, where not to, and where to use enumerate. I also use time.time() and loggin.info() to show each code changes speed. And after asking around ive been told that using a regex is the best option for me as it would remove the need for many of my for loops. Problem is i have no clue how to use regex. I of course plan on learning it but if someone could help me out with this it'll be much appreciated.
Thanks.
Edit: just to point out that when scanning each line the filename is unknown. ".csv" is the only thing that isnt unknown. So i basically need the regex to grab every filename before .csv but of course without grabbing the crap before the filename.
Im currently looking for .csv using .split('/') & .split('|'), then checking if .csv is in list index to grab the 'unknown' filename. And some lines will only have 1 filename whereas others will have 2+ so i need the regex to account for this too.
You can use this pattern: [^/]*\.csv
Breakdown:
[^/] - Any character that's not a forward slash (or newline)
* - Zero or more of them
\. - A literal dot. (This is necessary because the dot is a special character in regex.)
For example:
import re
s = '''4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv'''
pattern = re.compile(r'[^/]*\.csv')
result = pattern.findall(s)
Result:
['dlc3.csv', 'spongebob.csv']
Note: It could just as easily be result = re.findall(r'[^/]*\.csv', s), but for code cleanliness, I prefer naming my regexes. You might consider giving it an even clearer name in your code, like pattern_csv_basename or something like that.
Docs: re, including re.findall
See also: The official Python Regular Expression HOWTO
I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.
Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions
In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.
I am trying to extract all the latex commands from a tex file. I have to use Python for this. I tried to extract the latex commands in a list using Re module.
The problem is that this list does not contain the latex commands whose name includes special characters (such as \alpha*, \a', \#, \$, +, :, \; etc). It only contains the latex commands that consist of letters.
I am presently using the re.match python command :
"I already know the starting index of '\' which is at self.i.
The example Latex code string could be:
\documentclass[envcountsame,envcountchap]{svmono}"
match_text = re.match("[\w]+", search_string[self.i + 1:])
I am able to extract 'documentclass'. But suppose there is another command like:
"\abstract*[alpha]{beta}"
"\${This is a latex document}"
"\:"
How do I extract only 'abstract*', '$', ':' from these strings?
I am new to Python and tried various approaches, but am not able to extract all these command names. If there is a general python Regex that can handle all these cases, it would be useful.
NOTE: A book called 'The Not So Short introduction to LaTeX' defines that the format of LaTeX commands can be of three types -
FORMATS:
They start with a backslash \ and then have a name consisting of
letters only. Command names are terminated by a space, a number or
any other ‘non-letter.’
They consist of a backslash and exactly one non-letter.
Many commands exist in a ‘starred variant’ where a star is appended to the command name.
Here's the exact translation of your format specification:
\\(?:[^a-zA-Z]|[a-zA-Z]+)\*?
Demo
non-letter: [^a-zA-Z]
or letters: [a-zA-Z]+
starred variant: \*?
If your format description is accurate, this should do it. Unfortunately I don't know LaTeX so I'm not sure it's 100% OK.
From the feedback in the comments, it turns out the star is applicable only to letter commands, and there can be some other terminating characters as well. The final regex is:
\\(?:[^a-zA-Z]|[a-zA-Z]+[*=']?)
LaTeX is a TeX macro package, and as so, all that's applicable to TeX is also applicable to LaTeX.
The question you ask is a difficult one, as TeX is not a regular language. If you want only to deal with commands, you have to check for \\([A-Za-z]+ *|.|\n) regex (see demo), with the notice that in TeX you have active characters, that is, characters for which the only presence acts like a command. If you want to deal with command parameters, you'll have to check the individual command definitions, because TeX is a Polish Notation (operators or commands are prefix, with a variable number of positional parameters) language. For parameter extraction, TeX uses brace matching which is context free and not regular, so you'll need a complete parser for that.
TeX allows you to redefine all character classes, so you can redefine the digits to act as letters, and be usable as command names (so for example \a23 is a valid command name) (this happens inside the package definitions, where the # is used as a letter, to be able to make commands that are inaccessible to users, but available inside the package)
Eliminating LaTeX markup is a difficult thing for this reason and you can only achieve partial results. There are many different problems to be solved (what to do with \include directives, what to do with valid text in parameters like \chapter parameters or \footnote, you want the index included, etc.)
Also, you have to be carefull, as if you try to eliminate command parameters, you'll be also eliminating part of your text (for example the text in \footnote, \abstract, \title, \chapter{...}, etc.) I don't know the effect you actually want to get, so I cannot give you more info in this respect.
I am doing a directory listening and need to get all directory names that follow the pattern: Feb14-2014 and 14022014-sometext. The directory names must not contain dots, so I dont want to match 14022014-sometext.more. Like you can see I want to match just the directories that follow the pattern %b%d-%Y and %d%m%Y-textofanylengthWithoutDots.
For the first case it should be something like [a-zA-Z]{3}\d{2}. I dont know how to parse the rest because my regex skills are poor, sorry. So I hope someone can tell me what the correct patterns look like. Thanks.
I am assuming each directory listing is separated by a new line
([A-Z]\w{2}\d{1,2}\-\d{4}|\d{7,8}\-\w+)$
Will match both cases and will match the text only if it is uninterrupted (by dots or anything else for that matter) until it hits the end of the line.
Some notes:
If you want to match everything except dot you may replace the final \w+ with [^.]+.
You need the multiline modifier /m for this to work, otherwise the $ will match the end of the string only.
I've not added a ^ to the start of the regex, but you may do so if each line contains a single directory
Of course you may expand this regex to include (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) instead of [A-Z]\w{2}. I've not done this to keep it readable. I would also suggest you store this in a python array and insert it dynamically into your regex for maintainability sake.
See it in action: http://regex101.com/r/pS6iY9
That's quite easy.
The best one I can make is:
((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d)|(\d\d\d\d\d\d\d\d-\w+)
The first part ((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d) matches the first kind of dates and the second part (\d\d\d\d\d\d\d\d-\w+) - the second kind.
I am new to python and trying to find out how way to match a sentence with variable words
for examples 'The file test.bed in successfully uploaded'
Now here in the above sentence, the file name would change (it could sample.png) and rest of the words would be same.
Can anybody let me know what is the best way using regular expression to match the sentence.
thanks
If you just want to match anything there:
r'The file (.+?) in successfully uploaded'
The . means any character, and the + means one or more of the preceding.
The ? means to do it non-greedily, so if you have two sentences in a row, like "The file foo.bar is successfully uploaded. The file spam.eggs is successfully uploaded.", it'll match "foo.bar", and then "spam.eggs", rather than just finding one match "foo.bar is successfully uploaded. The file spam.eggs". You may not need it in your application.
Finally, the parentheses are how you mark part of a pattern as a group that you can extract from the match object.
But what if you want to match just valid filenames? Well, you'll need to come up with a rule for valid filenames, which may be different depending on your application. Is it Windows-specific? Is whatever you're parsing quoting filenames with spaces? And so on.