I have a piece of code which extracts a string that lies between two strings.However,this script performs this operation only on a line.I want to perform this operation on a complete file and get a list of all the words lying between those two words.
Note:The two words are fixed.For ex:If my code is something like
'const int variablename=1'
then I want a list of all the words in the file lying between 'int' and '='.
Here is the current script:
s='const int variablename = 1'
k=s[s.find('int')+4:s.find('=')]
print k
If the file fits comfortably into memory, you can get this with a single regex call:
import re
regex = re.compile(
r"""(?x)
(?<= # Assert that the text before the current location is:
\b # word boundary
int # "int"
\s # whitespace
) # End of lookbehind
[^=]* # Match any number of characters except =
(?<!\s) # Assert that the previous character isn't whitespace.
(?= # Assert that the following text is:
\s* # optional whitespace
= # "="
) # end of lookahead""")
with open(filename) as fn:
text = fn.read()
matches = regex.findall(text)
If there can be only one word between int and =, then the regex is a bit more simple:
regex = re.compile(
r"""(?x)
(?<= # Assert that the text before the current location is:
\b # word boundary
int # "int"
\s # whitespace
) # End of lookbehind
[^=\s]* # Match any number of characters except = or space
(?= # Assert that the following text is:
\s* # optional whitespace
= # "="
) # end of lookahead""")
with open(filename) as fn:
for row in fn:
# do something with the row?
I would use regular expressions over whole text (you can do it over one line too). This prints strings betweet "int " and "="
import re
text = open('example.txt').read()
print re.findall('(?<=int\s).*?(?=\=)', text)
If you want a quick and dirty ways and you're on a unix-like system.
I just should use a grep on the file.
Then i will split the string in order to recognize the pattern and the data i want.
Related
I ran the following code in python 3.8
import re
a,b,c = 'g(x)g', '(x)g', 'g(x)'
a_re = re.compile(rf"(\b{re.escape(a)}\b)+",re.I)
b_re = re.compile(rf"(\b{re.escape(b)}\b)+",re.I)
c_re = re.compile(rf"(\b{re.escape(c)}\b)+",re.I)
a_re.findall('g(x)g')
b_re.findall('(x)g')
c_re.findall('g(x)')
c_re.findall(' g(x) ')
The result I want is below.
['g(x)g']
['(x)g']
['g(x)']
['g(x)']
But the actual result is below.
['g(x)g']
[]
[]
[]
The following conditions must be observed:
A combination of variables and f-string should be used.
\b must not be removed.
Because I want to know if there are certain characters in the sentence.
How can I get the results I want?
Regular characters have no problem using \b, but it won't work for words that start with '(' or end with ')'.
I was wondering if there is an alternative to \b that can be used in these words.
I must use the same function as \b because I want to make sure that the sentence contains a specific word.
\b is the boundary between \w and \W characters (Docs). That is why your first one gives the result (since it starts and ends with characters) but none of the others.
To get the expected result, your patterns should look like these:
a_re = re.compile(rf"(\b{re.escape(a)}\b)+",re.I) # No change
b_re = re.compile(rf"({re.escape(b)}\b)+",re.I) # No '\b' in the beginning
c_re = re.compile(rf"(\b{re.escape(c)})+",re.I) # No '\b' in the end
You can write your own \b by finding start, end, or separator and not capturing it
(^|[ .\"\']) start or boundary
($|[ .\"\']) end or boundary
(?:) non-capture group
>>> a_re = re.compile(rf"(?:^|[ .\"\'])({re.escape(a)})(?:$|[ .\"\'])", re.I)
>>> b_re = re.compile(rf"(?:^|[ .\"\'])({re.escape(b)})(?:$|[ .\"\'])", re.I)
>>> c_re = re.compile(rf"(?:^|[ .\"\'])({re.escape(c)})(?:$|[ .\"\'])", re.I)
>>> a_re.findall('g(x)g')
['g(x)g']
>>> b_re.findall('(x)g')
['(x)g']
>>> c_re.findall('g(x)')
['g(x)']
>>> c_re.findall(' g(x) ')
['g(x)']
I want to use a regular expression to detect and substitute some phrases. These phrases follow the
same pattern but deviate at some points. All the phrases are in the same string.
For instance I have this string:
/this/is//an example of what I want /to///do
I want to catch all the words inside and including the // and substitute them with "".
To solve this, I used the following code:
import re
txt = "/this/is//an example of what i want /to///do"
re.search("/.*/",txt1, re.VERBOSE)
pattern1 = r"/.*?/\w+"
a = re.sub(pattern1,"",txt)
The result is:
' example of what i want '
which is what I want, that is, to substitute the phrases within // with "". But when I run the same pattern on the following sentence
"/this/is//an example of what i want to /do"
I get
' example of what i want to /do'
How can I use one regex and remove all the phrases and //, irrespective of the number of // in a phrase?
In your example code, you can omit this part re.search("/.*/",txt1, re.VERBOSE) as is executes the command, but you are not doing anything with the result.
You can match 1 or more / followed by word chars:
/+\w+
Or a bit broader match, matching one or more / followed by all chars other than / or a whitspace chars:
/+[^\s/]+
/+ Match 1+ occurrences of /
[^\s/]+ Match 1+ occurrences of any char except a whitespace char or /
Regex demo
import re
strings = [
"/this/is//an example of what I want /to///do",
"/this/is//an example of what i want to /do"
]
for txt in strings:
pattern1 = r"/+[^\s/]+"
a = re.sub(pattern1, "", txt)
print(a)
Output
example of what I want
example of what i want to
You can use
/(?:[^/\s]*/)*\w+
See the regex demo. Details:
/ - a slash
(?:[^/\s]*/)* - zero or more repetitions of any char other than a slash and whitespace
\w+ - one or more word chars.
See the Python demo:
import re
rx = re.compile(r"/(?:[^/\s]*/)*\w+")
texts = ["/this/is//an example of what I want /to///do", "/this/is//an example of what i want to /do"]
for text in texts:
print( rx.sub('', text).strip() )
# => example of what I want
# example of what i want to
I'm trying to search through a bunch of large text files for specific information.
#!/usr/bin/env python
# pythnon 3.4
import re
sometext = """
lots
of
text here
Sentinel starts
--------------------
item_one item_one_result
item_two item_two_result
--------------------
lots
more
text here
Sentinel starts
--------------------
item_three item_three_result
item_four item_four_result
item_five item_five_result
--------------------
even
more
text here
Sentinel starts
--------------------
item_six item_six_result
--------------------
"""
sometextpattern = re.compile( '''.*Sentinel\s+starts.*$ # sentinel
^.*-+.*$ # dividing line
^.*\s+(?P<itemname>\w+)\s+(?P<itemvalue>\w+)\s+ # item details
^.*-+.*$ # dividing line
''', flags = re.MULTILINE | re.VERBOSE)
print( re.findall( sometextpattern, sometext ) )
Individually, the sentinels and dividing lines match on their own. How do I make this work together? i.e. I would like this to print:
[('item_one','item_one_result'),('item_two','item_two_result'),('item_three','item_three_result'),('item_four','item_four_result'),('item_five','item_five_results'),('item_six','item_six_results')]
Try these regex:
for m in re.findall(r'(?:Sentinel starts\n[-\n]*)([^-]+)', sometext, flags=re.M ):
print(list(re.findall(r'(\w+)\s+(\w+)', m)))
It gives you a list of key,value tuples:
# [('item_one', 'item_one_result'), ('item_two', 'item_two_result')]
# [('item_three', 'item_three_result'), ('item_four', 'item_four_result')]
Because the text has trailing spaces, change the regex in the for statement for this one:
r'(?:Sentinel starts\s+-*)([^-]*\b)'
The regex multiline matching tag only makes ^ and $ match the beginning and end of each line, respectively. If you want to match multiple lines, you will need to add a whitespace meta character '\\s' to match the newline.
.*Sentinel\s+starts.*$\s
^.*-+.*$\s
^.*\s+(?P<itemname>\w+)\s+(?P<itemvalue>\w+)\s+
^.*-+.*$
Debuggex Demo
Also the string you are using does not have the required string escaping. I would recommend using the r'' type string instead. That way you do not have to escape your backslashes.
Use four capturing groups in-order to print the text you want inside the list.
>>> import regex
>>> text = """ lots
of
text here
Sentinel starts
--------------------
item_one item_one_result
item_two item_two_result
--------------------
lots
more
text here
Sentinel starts
--------------------
item_three item_three_result
item_four item_four_result
item_five item_five_result
--------------------
even
more
text here
Sentinel starts
--------------------
item_six item_six_result
--------------------"""
>>> regex.findall(r'(?:(?:\bSentinel starts\s*\n\s*-+\n\s*|-+)|(?<!^)\G) *(\w+) *(\w+)\n*', text)
[('item_one', 'item_one_result'), ('item_two', 'item_two_result'), ('item_three', 'item_three_result'), ('item_four', 'item_four_result'), ('item_five', 'item_five_result'), ('item_six', 'item_six_result')]
\s* matches zero or more space characters and \S+ matches one or more non-space characters. \G assert position at the end of the previous match or the start of the string for the first match.
DEMO
I'm working with a search&replace programming assignment. I'm a student and I'm finding the regex documentation a bit overwhelming (e.g. https://docs.python.org/2/library/re.html), so I'm hoping someone here could explain to me how to accomplish what I'm looking for.
I've used regex to get a list of strings from my document. They all look like this:
%#import fileName (regexStatement)
An actual example:
%#import script_example.py ( *out =(.|\n)*?return out)
Now, I'm wondering how I can split these up so I get the fileName and regexStatements as separate strings. I'd assume using a regex or string split function, but I'm not sure how to make it work on all kinds of variations of %#import fileName (regexstatement). Splitting using parentheses could hit the middle of the regex statement, or if a parentheses is part of the fileName, for instance. The assignment doesn't specify if it should only be able to import from python files, so I don't believe I can use ".py (" as a splitting point before the regex statement either.
I'm thinking something like a regex "%#import " to hit the gap after import, "\..* " to hit the gap after fileName. But I'm not sure how to get rid of the parentheses that encapsule the regex statement, or how to use all of it to actually split the string correctly so i have one variable storing fileName and one storing regexStatement for each entry in my list.
Thanks a lot for your attention!
If the filename can't contain spaces, just split your string on spaces with maxsplit 2:
>>> line.split(' ', 2)
['%#import', 'script_example.py', '( *out =(.|\n)*?return out)']
The maxsplit 2 makes it split only the first two spaces, and leave intact any spaces within the regex. Now you have the filename as the second element and the regex as the third. It's not clear from your statement whether the parentheses are part of the regex or not (i.e., as a capturing group). If not, you can easily remove them by trimming the first and last characters from that part.
If you assign the values like this:
filename, regex = line.split(' ', 2)[1:]
then you can strip the parentheses with:
regex = regex[1:-1]
That should do it nicely
^%#import (\S+) \((.*)\)
or, if the filename may have spaces:
^%#import ((?:(?! \().)+) \((.*)\)
Both expressions contain two groups, one for the file name and one for the contents of the parentheses. Run in multiline mode on the entire file or in normal mode if you work with single lines anyway.
This: ((?:(?! \().)+) breaks down as:
( # group start
(?: # non-capturing group
(?! # negative look-ahead: a position NOT followed by
\( # " ("
) # end look-ahead
. # match any char (this is part of the filename)
)+ # end non-capturing group, repeat
) # end group
The other bits of the expression should be self-explanatory.
import re
line = "%#import script_example.py ( *out =(.|\\n)*?return out)"
pattern = r'^%#import (\S+) \((.*)\)'
match = re.match(pattern, line)
if match:
print "match.group(1) '" + match.group(1) + "'"
print "match.group(2) '" + match.group(2) + "'"
else:
print "No match."
prints
match.group(1) 'script_example.py'
match.group(2) ' *out =(.|\n)*?return out'
For matching something like %#import script_example.py ( *out =(.|\n)*?return out) i suggest :
r'%#impor[\w\W ]+'
DEMO
note that :
\w match any word character [a-zA-Z0-9_]
\W match any non-word character [^a-zA-Z0-9_]
so you can use re.findall() for find all the matches :
import re
re.findall(r'%#impor[\w\W ]+', your_string)
I'd like to extract the two words FIRST and SECOND from the phrase below, i've tried with this regex, to get the word before the slash but it doesn't work : / btw it's on python:
import re
data = "12341 O:EXAMPLE (FIRST:/xxxxxx) R:SECOND/xxxxx id:1234"
data2 = "12341 O:EXAMPLE:FIRST2:/xxxxxx) R:SECOND2/xxxxx id:1234"
result = re.findall(r'[/]*',data)
result2 = re.findall(r'[/]*',data2)
print result,result2
Try
result = re.findall(r'\w+:?(?=/)',data)
Explanation:
\w+ # Match one or more alphanumeric characters
:? # Match an optional colon
(?=/) # Assert that the next character is a slash
If you don't want the colon to be part of the match (your question is unclear on this), put the optional colon into the lookahead assertion:
result = re.findall(r'\w+(?=:?/)',data)