hey I was wondering if it is possible to collect a set amount of letters or words in between brackets, an example would be I want to collect the words in between the brackets :
I like food (because it tastes) good
how can I collect the word(s) or letter(s) or number(s) in between those brackets?
thank you.
You can do it using regex in python. I took the liberty of generalizing your example!!
import re
re.findall("\((.*?)\)","I like food (because it tastes) good also (hello world)")
Output:
['because it tastes', 'hello world']
Explanation:
The regex says that whatever expressions are present inside braces pick them out for me.
Why did I use \? Because ( and ) are special characters in regex. So if you want to use them as ordinary braces, you need to tell that to the compiler by using \ before the braces.
Why did I use ??
? is used for two purposes:
a. Whenever you want to give an option like either this could be present or not. For eg. apples? would match apple and apples both.
b. Whenever you want to select a minimum of something. For eg. had I not used ? here, the output that I would have got would be something like ['because it tastes) good also (hello world']. Look closely and you will find that the max expression that lies inside two braces is the one selected.
You can do something like:
x = "I like food (because it tastes) good"
print(len(x.split("(")[-1].split(")")[0]))
I believe there are better ways, but this works.
Related
I have a need to recover 2 results of a regular expression in Python: what is searched and all else.
For example, in:
"boofums",3,4
I'd like to find what is in the quotes and what isn't:
boofums
,3,4
What I have so far is:
bobbles = '"boofums",3,4'
pickles = re.split(r'\".*\"', bobbles)
morton = re.match(r'\".*\"', bobbles)
print(pickles[1])
print(morton[0])
,3,4
"boofums"
This seems to me insanely inefficient and not Python-esque. Is there a better way to do this? (Sorry for the "is there a better way" construct on StackOverflow, but... I need to do this better! 😂)
...and if you can help me extract just what's in the quotes, something that I'd easily do in Perl or Ruby, all the better!
You're probably best off with regex groupings:
So for your example I'd use something like
regex = re.compile("\"(.*)\"(.*)")
bobble_groups = regex.match(bobbles)
you can then use bobble_groups.group(1) to just get the quotation marks.
See named groups if you don't want to depend on an index number.
a, b = re.match('"(.*)"(.*)', bobbles).groups()
Brackets determine groups that are "saved" to the match object
I tried to take the middle words based on my pattern. Below are my codes:
text = "東京都田中区9-7−4"
import re
#Sorry due to the edit problem and stackoverflow doesnt allow me to include long sentences here, please check my comment below for the compile function of re.
city = re.findall(r,text)
print("getCity: {}".format(city))
My current output:
getCity: ['都田中区']
My expected output:
getCity: ['田中区']
I do not want to take the [都道府県] so I use "?!" in my first beginning pattern as (?!...??[都道府県]). However, when I run my program, it shows that "都" is inside as well like I show on my current output. Could anyone please direct me on this?
The problem with your regex is that it is too allowing.
If you look at this visualisation here (I have removed all the hardcoded city (市) names because they are irrelevant):
you can see a lot of "any character" repeated x times, or just "not 市" and "not 町" repeated x times. These are what matches the 都道府県 in your string. Therefore, these are the places where you should disallow 都道府県:
The corresponding regex would be:
(?:余市|高市|[^都道府県市]{2,3}?)郡(?:玉村|大町|[^都道府県]{1,5}?)[町村]|(?:[^都道府県]{1,4}市)?[^都道府県町]{1,4}?区|[^都道府県]{1,7}?[市町村]
Remember to add the hardcoded cities when you put this in your code!
I know this question has been asked almost hundred times in stack overflow but after doing lot of search and not finding my answer, I am asking this question.
I am looking to search exact word from strings something like below.
'svm_midrangedb_nonprod:svm_midrangedb_nonprod_root'
'svm_midrangedb_prod:svm_midrangedb_prod_root'
I want to search only for 'prod' but getting both 'prod' and 'nonprod' in output.
Here is the code I am using:
re.search(r"\wprod\w", in_volumes.json()[i]['name'].split(":")[2].lower())
You have to make rules to not match nonprod but match prod.
For example, maybe you can make it so that if there's n infront of prod, you exclude it like this: [^n]prod\w.
Or maybe some data has n infront of prod and you want to keep it. Then, you want to exclude if there's non infront of prod like this: \w*(?<!non)prod\w*.
It really depends on the rest of your data and see what kind of rules you can make/apply to them to get your desired data.
It's normal because your regular expression tell that you want a string containing "prod", in order to solve that very easily you can do the same thing you did but like follow
re.search(r"\w_prod\w", in_volumes.json()[i]['name'].split(":")[2].lower())
I just add a _ character existing in your targeted string
I have sentences like the following:
((wouldyou)) give me something ((please))
and a bunch of keywords, stored in arrays / lists:
keywords["wouldyou"] = ["can you", "would you", "please"]
keywords["please"] = ["please", "ASAP"]
I want to replace every occurrence of variables in parentheses with a suitable set of strings stored in an array and get every possible combination back. The amount of variables and keywords is undefined.
James helped me with the following code:
def filler(word, from_char, to_char):
options = [(c,) if c != from_char else (from_char, to_char) for c in word.split(" ")]
return (' '.join(o) for o in product(*options))
list(filler('((?please)) tell me something ((?please))', '((?please))', ''))
It works great but only replaces one specific variable with empty strings. Now I want to go through various variables with different set of keywords. The desired result should look something like this:
can you give me something please
would you give me something please
please give me something please
can you give me something ASAP
would you give me something ASAP
please give me something ASAP
I guess it has something to do with to_ch, but I have no idea how to compare through list items at this place.
The following would work. It uses itertools.product to construct all of the possible pairings (or more) of your keywords.
import re, itertools
text = "((wouldyou)) give me something ((please))"
keywords = {}
keywords["wouldyou"] = ["can you", "would you", "please"]
keywords["please"] = ["please", "ASAP"]
# Get a list of bracketed terms
lsources = re.findall("\(\((.*?)\)\)", text)
# Build a list of the possible substitutions
ldests = []
for source in lsources:
ldests.append(keywords[source])
# Generate the various pairings
for lproduct in itertools.product(*ldests):
output = text
for src, dest in itertools.izip(lsources, lproduct):
# Replace each term (you could optimise this using a single re.sub)
output = output.replace("((%s))" % src, dest)
print output
You could further improve it by avoiding the need to do multiple replace() and assignment calls with one re.sub() call.
This scripts gives the following output:
can you give me something please
can you give me something ASAP
would you give me something please
would you give me something ASAP
please give me something please
please give me something ASAP
It was tested using Python 2.7. You will need to think how to solve it if multiple identical keywords were used. Hopefully you find this useful.
This is a job for Captain Regex!
Partial, pseudo-codey, solution...
One direct, albeit inefficient (like O(n*m) where n is number of words to replace and m is average number of replacements per word), way to do this would be to use the regex functionality in the re module to match the words, then use the re.sub() method to swap them out. Then you could just embed that in nested loops. So (assuming you get your replacements into a dict or something first), it would look something like this:
for key in repldict:
regexpattern = # construct a pattern on the fly for key
for item in repldict[key]:
newstring = re.sub(regexpattern, item)
And so forth. Only, you know, like with correct syntax and stuff. And then just append the newstring to a list, or print it, or whatever.
For creating the regexpatterns on the fly, string concatenation just should do it. Like a regex to match left parens, plus the string to match, plus a regex to match right parens.
If you do it that way, then you can handle the optional features just by looping over a second version of the regex pattern which appends a question mark to the end of the left parens, then does whatever you want to do with that.
I'm parsing some TV episodes that have been transcribed by different people, meaning I need to search for a variety of formats. For example, new scenes are indicated one of two ways:
[A coffee shop]
or
INT. Coffee shop - NIGHT
Right now, I match this with the following regex in Python:
re.findall("(^\[(.+?)\]$)|(^[INTEXT]{3}\. .+?$)", text)
where "text" is the text of the entire script (hence using findall). This always appears on its own line, hence the ^$
This gives me something like: (None, None, "INT. Coffee Shop - NIGHT") for example.
My question: How do you construct a regex to search for one of two complex patterns, using the | notation, without also creating submatches that you don't really want? Or is there a better way?
Many thanks.
UPDATE: I had overlooked the idea of non-capturing groups. I can accomplish what I want with:
"(?:^\[.+?\]$)|(?:^[INTEX]{3}\. .+?$)"
However, this raises a new question. I don't actually want the brackets or the INT/EXT in the scenes, just the location. I thought that I could use actual groups within the none-capturing groups, but I'm still getting those blank matches for the other expression, like so:
import re
pattern = "(?:^\[(.+?)\]$)|(?:^[INTEX]{3}\. (.+?)$)"
examples = [
"[coffee shop]",
"INT. COFFEE SHOP - DAY",
"EXT. FIELD - NIGHT",
"[Hugh's aparment]"
]
for example in examples:
print re.findall(pattern, example)
'''
[('coffee shop', '')]
[('', 'COFFEE SHOP - DAY')]
[('', 'FIELD - NIGHT')]
[("Hugh's aparment", '')]
'''
I can just join() them, but is there a better way?
Based on the limited examples you've provided, how about using assertions for the brackets:
re.findall("((?<=^\[)[^[\]]+(?=\]$)|^[INTEXT]{3}\. .+?$)", text)
You may be better off just using two expressions.
patterns = [r'^\[(.+?)\]$', r'^(?:INT|EXT)\. (.+?)$']
for example in examples:
print re.findall(patterns[0], example) or re.findall(patterns[1], example)
This seems to do what you want:
(?m)^(?=(?:\[|[INTEX]{3}\.\s+)([^\]\r\n]+))(?:\[\1\]|[INTEX]{3}\. \1)$
First the lookahead peeks at the text of the scene marker, capturing it in group #1. Then the rest of the regex goes ahead and consumes the whole line containing the marker. Although now I think about it, you don't really have to consume anything. This works, too:
result = re.findall(r"(?m)^(?=(?:\[|[INTEX]{3}\.\s+)([^\]\r\n]+))", subject)
The marker text is still captured in group #1, so it still gets added to the result of findall(). Then again, I don't see why you would want to use findall() here. If you're trying to normalize the scene markers by replacing them in place, you'll have to use the consuming version of the regex.
Also, notice the (?m). In your examples you always apply the regex to the scene markers in isolation. To pluck them out of the whole script, you have to set the MULTILINE flag, turning ^ and $ into line anchors.