I wanted to grab a argument from a string in python...
I wanted to grab the city of this string: weather in <city>
How do I get the city? Into a new variable?
Use Regular Expressions!
If you haven't heard of them, it's quite simple. Simply import the re module, and away you go!
>>> import re
Ok, maybe that wasn't so exciting. But now you can use pattern matching. Simply define your pattern:
>>> pattern = r"^(?P<thing>.*?) in (?P<city>.*?)$"
and away you go!
>>> re.match(pattern, "weather in my city")
<_sre.SRE_Match object; span=(0, 18), match='weather in my city'>
Don't worry! This is actually something useful. Let's store this in a variable so we can use it:
>>> match = re.match(pattern, "weather in my city")
>>> match.group("city")
'my city'
Hooray!
Now, what was that crazy pattern thing about? It worked, but it just seems like magic. Let me explain:
r"" just makes Python treat (most) \s as literal \s. So, r"\n" will be an actual \ followed by an actual n, as opposed to a new-line character. This is because regular expressions have special meanings for \ characters, and it's awkward to have to write \\ all the time.
^ means "start of the string".
(?P<name>...) is a named group. Normal groups are represented by (...), and can be referenced by their number (e.g. match.group(0)). Named groups can also be referenced by number, but they can also be referenced by their name. The P stands for Python, because that's where the syntax originally came from. Neat!
. means "any character".
* means "repeated 0 or more times".
? means a few things, but when it's after a * or + it means "match as little as possible". This means that it will make the thing group have as few "any character"s as possible.
in means exactly what it looks like. A followed by an i followed by a n followed by a .
.*? again means "match as few of any character as possible", but... I'm not really sure why I wrote that, considering that
$ means "end of the string".
And yeah, they never really stop seeming like magic. (Unless you use Perl.) If you want to make your own regular expression or learn some more, have a look at the documentation for the re module.
If you have constant spaces in your string and your strings are not going to change, it's relatively easy. Just use split on your string.
x = "weather in <city>"
split_x = x.split(" ")
# will return you
["weather", "in", "<city>"]
city = split_x[2]
Look at split's docs. But suppose your city is something like "New York", then you'll have to look for some alternative because in that case, the list will be -
x = "weather in New York"
# O/P
["weather", "in", "New", "York"]
And then if you do this-
city = split_x[2]
You will have wrong city name
With str.lstrip():
s = "weather in Las Vegas"
city_name = s.lstrip('weather in ')
print(city_name)
Prints:
Las Vegas
Related
I am new to Python and right now I am trying to extract information from a set of paragraphs containing employees related statistics.
For example, the paragraph might look like:
Name Rakesh Rao Age 34 Gender Male Marital Status Single
The whole text is not separated by any comas so I am having a hard time separating this information.
Also sometimes there might be a colon after the name of the variable and sometimes there might not be. For example in row 1, it's "Name Rakesh Rao" but in row 2 it's "Name: Ramachandra Deshpande".
There are around 1400 records of this information so it would be really great if I don't have to manually separate the information.
Can anyone help with this? I would be super grateful!
Well, I suppose you could try and do that using a regular expression.
If your text is exactly this:
paragraph = 'Name Rakesh Rao Age 34 Gender Male Marital Status Single'
You could use this regular expression (you would have to import re first):
m = re.fullmatch(
(
r'Name(?:\:)? (?P<name>\D+) ' # pay attention to the space at the end
r'Age(?:\:)? (?P<age>\d+) '
r'Gender(?:\:)? (?P<gender>\D+) '
r'Marital Status(?:\:)? (?P<status>\D+)' # no space here, since the string ends
),
paragraph
)
Then you could use the names of the groups defined within the regular expression, like this:
>>> m.group('name')
'Rakesh Rao'
>>> m.group('age')
'34'
>>> m.group('gender')
'Male'
>>> m.group('status')
'Single'
If all the fields are in a single line, you just have to replace \n with a single space within the regular expression.
Note that this will support a single comma immediately after row name, like this:
Name: Rakesh Rao
but it will not support different order of the data. If you would like that as well, I could try to write a different expression.
Explanation of the expression
Let's take the first "line" of the expression:
r'Name(?:\:)? (?P<name>\D+) '
First, why the r'…' string syntax? This is just to avoid double backslashes. In the "typical" string, we would need to write the expression like this:
'Name(?:\\:)? (?P<name>\\D+) '
Now, to the actual expression. The first part, Name, is pretty obvious.
(?:\:)?
This part creates a non-capturing group ((?:…)) with a colon inside – it's \: and not just :, because the colon itself is part of a regex syntax. Non-capturing group, because this colon really doesn't matter to us.
Then, after a single space, we have this:
(?P<name>\D+)
This creates a named group, the syntax is (?P<name_of_the_group>…). I use a named group just to make it easier and nicer to extract the information later, using m.group('name'), where m is a match object.
The \D+ means "at least one non-digit character". This captures all letters, underscores, but also white spaces. That is why the order of the fields is so important to this particular expression. If you were to change the order and put Gender field between Name and Age, it would capture it as well, because the + modifier is greedy.
On the other hand, the \d+ in the next "line" means "at least one digit character", so between 0 and 9.
I hope that explanation is enough, but it might be useful to you to play with that expression here, on this very useful site:
https://regex101.com/r/N5ZJU9/2
I've already entered the regex and the test string for you.
You can match optional characters, in your case it is : with the following expression [:]?.
According to the provided information, this regex should extract the required information:
^Name[:]?\s([A-Z][-'a-zA-Z]+)\s([A-Z][-'a-zA-Z]+)$
You can check it here.
This regular expression will match two-words names. Also names containing -'.
In Python this may look like that:
regex = r"^Name[:]?\s([A-Z][-'a-zA-Z]+)\s([A-Z][-'a-zA-Z]+)$"
test_str = ("Name Rakesh Rao\n"
"Name: Ramachandra Deshpande")
matches = re.finditer(regex, test_str, re.MULTILINE)
You can also check this example by the link provided above.
Hope this helps.
If the field names are always in the string, you can split the string on those field names. For example:
str_to_split = "Name Rakesh Rao Age 34 Gender Male Marital Status Single"
splitted = str_to_split.split("Age")
name = splitted[0].replace("Name", "")
If your text still contains other chars, you can remove them with replace(":", "") for instance. Otherwise you can use the NLTK toolkit to remove all kind of special chars from your text. Be careful, because names could also have special chars in them.
This is the code:
a = '000.222.tld'
b = re.search('(.*).\d+\.tld', a)
would like to see it print
000
so far..
print b.group(0)
gives me this:
000.222.tld
print b.group(1)
gives me this:
000.2
There are a a few problems with your expression:
b = re.match('\(.*)\.\d+\.com', a)
First, that \( means that you're escaping the (—it will only match a literal ( character in the search string. You're not trying to match any parentheses, you're trying to create a capturing group, so don't escape the parens. (Also, you're not escaping the matching ), so you'd get an error about mismatched parens trying to use this…)
Second, you're trying to match .com, but your sample input ends in .tld. Those obviously aren't going to match. Presumably you wanted to match any string of letters, or some other rule?
Finally, you're not using a raw string literal, or escaping your backslashes. Sometimes you get away with this, but do you know the Python backslash-escape rules by heart so well that you can be sure that \d or \. doesn't mean anything? Do you expect anyone who reads your code to also know?
If you fix all of those problems, your regex works:
>>> a = '1.2.tld'
>>> b = re.match(r'(.*)\.\d+\.[A-Za-z]+', a)
>>> b.group(1)
'1'
Now that you've completely changed both the expression and the input, you have completely different problems:
b = re.search('(.*).\d+\.tld', a)
The main problem here, besides again not using a raw string literal, is that you didn't escape the first ., so you're searching for any character there. Since regular expressions are greedy by default, the first .* will capture as much as it can while still leaving room for any character, 1 or more digits, and .tld, so it will match 000.2. But if you escape the ., it will capture as much as it can while still leaving room for a literal ., 1 or more digits, and .tld, which is exactly what you want.
>>> a = '000.222.tld'
>>> b = re.search(r'(.*)\.\d+\.tld', a)
>>> b.group(1)
'000'
Meanwhile, there are some great regular expression debuggers, both downloadable and online. I don't want to recommend one in particular, but Debuggex makes it easy to create a sharable link to a particular test, so here is your first one, and here is your second. Check out the examples and see how much easier it is to find the problems with your pattern that way.
You can do it without regex:
b = a.split('.', 1)[0]
I'm trying to determine whether a term appears in a string.
Before and after the term must appear a space, and a standard suffix is also allowed.
Example:
term: google
string: "I love google!!! "
result: found
term: dog
string: "I love dogs "
result: found
I'm trying the following code:
regexPart1 = "\s"
regexPart2 = "(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
and get the error:
raise error("multiple repeat")
sre_constants.error: multiple repeat
Update
Real code that fails:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
On the other hand, the following term passes smoothly (+ instead of ++)
term = 'lg incite" OR author:"http+www.dealitem.com" OR "for sale'
The problem is that, in a non-raw string, \" is ".
You get lucky with all of your other unescaped backslashes—\s is the same as \\s, not s; \( is the same as \\(, not (, and so on. But you should never rely on getting lucky, or assuming that you know the whole list of Python escape sequences by heart.
Either print out your string and escape the backslashes that get lost (bad), escape all of your backslashes (OK), or just use raw strings in the first place (best).
That being said, your regexp as posted won't match some expressions that it should, but it will never raise that "multiple repeat" error. Clearly, your actual code is different from the code you've shown us, and it's impossible to debug code we can't see.
Now that you've shown a real reproducible test case, that's a separate problem.
You're searching for terms that may have special regexp characters in them, like this:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
That p++ in the middle of a regexp means "1 or more of 1 or more of the letter p" (in the others, the same as "1 or more of the letter p") in some regexp languages, "always fail" in others, and "raise an exception" in others. Python's re falls into the last group. In fact, you can test this in isolation:
>>> re.compile('p++')
error: multiple repeat
If you want to put random strings into a regexp, you need to call re.escape on them.
One more problem (thanks to Ωmega):
. in a regexp means "any character". So, ,|.|;|:" (I've just extracted a short fragment of your longer alternation chain) means "a comma, or any character, or a semicolon, or a colon"… which is the same as "any character". You probably wanted to escape the ..
Putting all three fixes together:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|\.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(term) + regexPart2 , re.IGNORECASE)
As Ωmega also pointed out in a comment, you don't need to use a chain of alternations if they're all one character long; a character class will do just as well, more concisely and more readably.
And I'm sure there are other ways this could be improved.
The other answer is great, but I would like to point out that using regular expressions to find strings in other strings is not the best way to go about it. In python simply write:
if term in string:
#do whatever
Also make sure that your arguments are in the correct order!
I was trying to run a regular expression on some html code. I kept getting the multiple repeat error, even with very simple patterns of just a few letters.
Turns out I had the pattern and the html mixed up. I tried re.findall(html, pattern) instead of re.findall(pattern, html).
i have an example_str = "i love you c++" when using regex get error multiple repeat Error. The error I'm getting here is because the string contains "++" which is equivalent to the special characters used in the regex. my fix was to use re.escape(example_str ), here is my code.
example_str = "i love you c++"
regex_word = re.search(rf'\b{re.escape(word_filter)}\b', word_en)
A general solution to "multiple repeat" is using re.escape to match the literal pattern.
Example:
>>>> re.compile(re.escape("c++"))
re.compile('c\\+\\+')
However if you want to match a literal word with space before and after try out this example:
>>>> re.findall(rf"\s{re.escape('c++')}\s", "i love c++ you c++")
[' c++ ']
I need to perform a search/replace on text which contains a comma which is NOT followed by a space, to change to a comma+space.
So I can find this using:
,[^\s]
But I am struggling with the replacement; I can't just use:
, (space, comma)
Or
& ,
As the match originally matches two characters.
Is there a way of saying '&' - 1 ? or '&[0]' or something which means; 'The Matched String, but only part of it' in the replacement argument ?
Another way of trying to ask this:
Can I use Regex to IDENTIFY one part of my string.
But REPLACE a (slightly different,but related) part of my string.
I could just probably replace every comma with a comma+space, but this is a little more controlled and less likely to make a change I do not need....
For example:
Original:
Hello,World.
Should become:
Hello, World.
But:
Hello, World.
Should remain as :
Hello, World.
And currently, using my (bad) pattern I have:
Original:
Hello,World
After (wrong):
Hello, orld
I'm actually using Python's (2.6) 're' module for this as it happens.
Using parantheses to capture a part of the string is one way to do it. Another possibility is to use "lookahead assertion":
,(?=\S)
This pattern matches a comma only if it is followed by a non-whitespace character. It does not match anything followed by comma but uses that information to decide whether or not to match the comma.
For example:
>>> re.sub(r",(?=\S)", ", ", "Hello,World! Hello, World!")
'Hello, World! Hello, World!'
Yes, use parentheses to "capture" part of the string that matches your expression. I'm not up to speed on Python's implementation, but it should give you some kind of array called match[] whose elements correspond to the captures.
Yes, you could. But why would you, in this simple case?
def insertspaceaftercomma(s):
"""inserts a space after every comma, then remove doubled whitespace after comma (if any)"""
return s.replace(",",", ").replace(", ",", ")
seems to work:
>>> insertspaceaftercomma("Hello, World")
'Hello, World'
>>> insertspaceaftercomma("Hello,World")
'Hello, World'
>>>
You can look for a comma + non-space character and then stick a space in between them:
re.sub(r',([^\s])', r', \1', string)
Try this:
import re
s1 = 'Hello,World.'
re.sub(r',([^\s])', ', \g<1>', s1)
> Hello, World.
s2 = 'Hello, World.'
re.sub(r',([^\s])', ', \g<1>', s2)
> Hello, World.
When you use variables (is that the correct word?) in python regular expressions like this: "blah (?P\w+)" ("value" would be the variable), how could you make the variable's value be the text after "blah " to the end of the line or to a certain character not paying any attention to the actual content of the variable. For example, this is pseudo-code for what I want:
>>> import re
>>> p = re.compile("say (?P<value>continue_until_text_after_assignment_is_recognized) endsay")
>>> m = p.match("say Hello hi yo endsay")
>>> m.group('value')
'Hello hi yo'
Note: The title is probably not understandable. That is because I didn't know how to say it. Sorry if I caused any confusion.
For that you'd want a regular expression of
"say (?P<value>.+) endsay"
The period matches any character, and the plus sign indicates that that should be repeated one or more times... so .+ means any sequence of one or more characters. When you put endsay at the end, the regular expression engine will make sure that whatever it matches does in fact end with that string.
You need to specify what you want to match if the text is, for example,
say hello there and endsay but some more endsay
If you want to match the whole hello there and endsay but some more substring, #David's answer is correct. Otherwise, to match just hello there and, the pattern needs to be:
say (?P<value>.+?) endsay
with a question mark after the plus sign to make it non-greedy (by default it's greedy, gobbling up all it possibly can while allowing an overall match; non-greedy means it gobbles as little as possible, again while allowing an overall match).