How can I select a string in python knowing the start and end points?
If the string is:
Evelin said, "Hi Dude! How are you?" and no one cared!!
Or something like this:
Jane said *aww! thats cute, we must try it!* John replied, "Okay!, but not now!!"
what I want to write is a function that selects from the " " not by counting the index,
but something that just selects the text from character to character,
"Hi Dude! How are you?" and "Okay!, but not now!!"
so how can I do this? is there a built in function ?
I know there is a built-in function in python that get the index of the given character
ie,
find("something") returns the index of the given string in the string.
or it need to loop through the string?
I'm just starting with python, sorry for a little question like this.
python 2 or 3 is just okay!! thank you so much!!
Update:
Thank you everyone for the answers, as a just beginner I just wanna stick with the built in split() function quotes = string.split('"')[1::2] just because its simple. thank you all. so much love :)
txt='''\
Evelin said, "Hi Dude! How are you?" and no one cared!!
Jane said *aww! thats cute, we must try it!* John replied, "Okay!, but not now!!"'''
import re
print re.findall(r'"([^"]+)"', txt)
# ['Hi Dude! How are you?', 'Okay!, but not now!!']
You can use regular expressions if you don't want to use str.index():
import re
quotes = re.findall('"([^"]*)"', string)
You can easily extend this to extract other information from your strings as well.
Alternatively:
quotes = string.split('"')[1::2]
And using str.index():
first = string.index('"')
second = string.index('"', first+1)
quote = string[first+1:second]
To extract a substring by characters, it is much easier to split on those characters; str.partition() and str.rpartition() efficiently split the string on the first or last occurrence of a given string:
extracted = inputstring.partition('"')[-1].rpartition('"')[0]
The combination of partitioning from the start and end gives you then the largest substring possible, leaving any embedded quotes in there.
Demo:
>>> inputstring = 'Evelin said, "Hi Dude! How are you?" and no one cared!!'
>>> inputstring.partition('"')
('Evelin said, ', '"', 'Hi Dude! How are you?" and no one cared!!')
>>> inputstring.rpartition('"')
('Evelin said, "Hi Dude! How are you?', '"', ' and no one cared!!')
>>> inputstring.partition('"')[-1].rpartition('"')[0]
'Hi Dude! How are you?'
str.index(str2) finds the index of str2 in str... most simple approach !
a = 'Evelin said, "Hi Dude! How are you?" and no one cared!!'
print a[1+a.index("\""):1+a.index("\"")+a[a.index("\"")+1:].index("\"")]
or as Scorpion_God mentioned, you could simply use single quotes as below
print a[1+a.index('"'):1+a.index('"')+a[a.index('"')+1:].index('"')]
this will result in :
Hi Dude! How are you?
Quotes won't be included !!!
Related
I frequently get strings that are formatted like this:
"This is an exampleI wish it looked different"
when I want it to look like this instead:
"this is an example
I wish it looked different"
Any ideas? Regular expressions maybe? I'm still very noob, many thanks in advance!
Should be pretty simple for your example with re.sub:
import re
old_string = "This is an exampleI wish it looked different"
new_string = re.sub('([a-z])([A-Z])', '\\1\n\\2', old_string)
print(new_string)
# This is an example
# I wish it looked different
It finds all parts where a lowercase letter is followed by an uppercase letter and puts a newline between them.
I am a total noob, coding for the first time and trying to learn by doing.
I'm using this:
import re
f = open('aaa.txt', 'r')
string=f.read()
c = re.findall(r"Guest last name: (.*)", string)
print "Dear Mr.", c
that returns
Dear Mr. ['XXXX']
I was wondering, is there any way to get the result like
Dear Mr. XXXX
instead?
Thanks in advance.
You need to take the first item in the list
print "Dear Mr.", c[0]
Yes use re.search if you only expect one match:
re.search(r"Guest last name: (.*)", string).group(1)`
findall is if you expect multiple matches. You probably want to also add ? to your regex (.*?) for a non-greedy capture but you also probably want to be a little more specific and capture up to the next possible character after the name/phrase you want.
I'm trying to write a regular expression to represent a sentence with the following conditions: starts with a capital letter, ends with a period (and only one period can appear), and is allowed to contain a comma or semi-colon, but when it does, it must appear as (letter)(semicolon)(space) or (letter)(comma)(space).
I've got the capital letter and period down. I have the idea for the code but I think I'm not getting the syntax completely right...
In English, my expression for a sentence looks like this:
(capital letter) ((lowercase letter)(space) ((lowercase letter)(comma)(space))*
((lowercase letter)(semicolon)(space)* )* (period)
I realize this ignores the case where the first letter of the sentence is immediately followed by a comma or semicolon, but it's safe to ignore that case.
Now when I try to code this in Python, I try the following (I've added whitespace to make things easier to read):
sentence = re.compile("^[A-Z] [a-z\\s (^[a-z];\\s$)* (^[a-z],\\s$)*]* \.$")
I feel like it's a syntax issue... I'm not sure if I'm allowed to have the semicolon and comma portions inside of parentheses.
Sample inputs that match the definition:
"This is a sentence."
"Hello, world."
"Hi there; hi there."
Sample inputs that do not match the definition:
"i ate breakfast."
"This is , a sentence."
"What time is it?"
This would match what you said above.
^"[A-Z][a-z]*(\s*|[a-z]*|(?<!\s)[;,](?=\s))*[.]"$? => demo
This would match:
"This is a sentence."
"Hello, world."
"Hi there; hi there."
This won't match:
"i ate breakfast."
"This is , a sentence."
"What time is it?"
"I a ,d am."
"I a,d am."
If you don't need the " just remove it from the regex.
If you need the regex in python, try this
re.compile(r'^[A-Z][a-z]*(\s*|[a-z]*|(?<!\s)[;,](?=\s))*[.]$')
Python demo
import re
tests = ["This is a sentence."
,"Hello, world."
,"Hi there; hi there."
,"i ate breakfast."
,"This is , a sentence."
,"What time is it?"]
rex = re.compile(r'^[A-Z][a-z]*(\s*|[a-z]*|(?<![\s])[;,])*[.]$')
for test in tests:
print rex.match(test)
output
<_sre.SRE_Match object at 0x7f31225afb70>
<_sre.SRE_Match object at 0x7f31225afb70>
<_sre.SRE_Match object at 0x7f31225afb70>
None
None
None
^(?!.*[;,]\S)(?!.* [;,])[A-Z][a-z\s,;]+\.$
Its easier to use lookaheads to remove invalid sentences.See demo.
https://regex101.com/r/vV1wW6/36#python
I ended up modifying my regular expression to
"^[A-Z][a-z\s (a-z,\s)* (a-z;\s)*]*\.$"
and it ended up working just fine. Thanks for everyone's help!
I need re.findall to detect words that are followed by a "="
So it works for an example like
re.findall('\w+(?=[=])', "I think Python=amazing")
but it won't work for "I think Python = amazing" or "Python =amazing"...
I do not know how to possibly integrate the whitespace issue here properly.
Thanks a bunch!
'(\w+)\s*=\s*'
re.findall('(\w+)\s*=\s*', 'I think Python=amazing') \\ return 'Python'
re.findall('(\w+)\s*=\s*', 'I think Python = amazing') \\ return 'Python'
re.findall('(\w+)\s*=\s*', 'I think Python =amazing') \\ return 'Python'
You said "Again stuck in the regex" probably in reference to your earlier question Looking for a way to identify and replace Python variables in a script where you got answers to the question that you asked, but I don't think you asked the question you really wanted the answer to.
You are looking to refactor Python code, and unless your tool understands Python, it will generate false positives and false negatives; that is, finding instances of variable = that aren't assignments and missing assignments that aren't matched by your regexp.
There is a partial list of tools at What refactoring tools do you use for Python? and more general searches with "refactoring Python your_editing_environment" will yield more still.
Just add some optional whitespace before the =:
\w+(?=\s*=)
Use this instead
re.findall('^(.+)(?=[=])', "I think Python=amazing")
Explanation
# ^(.+)(?=[=])
#
# Options: case insensitive
#
# Assert position at the beginning of the string «^»
# Match the regular expression below and capture its match into backreference number 1 «(.+)»
# Match any single character that is not a line break character «.+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=[=])»
# Match the character “=” «[=]»
You need to allow for whitespace between the word and the =:
re.findall('\w+(?=\s*[=])', "I think Python = amazing")
You can also simplify the expression by using a capturing group around the word, instead of a non-capturing group around the equals:
re.findall('(\w+)\s*=', "I think Python = amazing")
r'(.*)=.*' would do it as well ...
You have anything #1 followed with a = followed with anything #2, you get anything #1.
>>> re.findall(r'(.*)=.*', "I think Python=amazing")
['I think Python']
>>> re.findall(r'(.*)=.*', " I think Python = amazing oh yes very amazing ")
[' I think Python ']
>>> re.findall(r'(.*)=.*', "= crazy ")
['']
Then you can strip() the string that is in the list returned.
re.split(r'\s*=', "I think Python=amazing")[0].split() # returns ['I', 'think', 'Python']
I need to perform a search/replace on text which contains a comma which is NOT followed by a space, to change to a comma+space.
So I can find this using:
,[^\s]
But I am struggling with the replacement; I can't just use:
, (space, comma)
Or
& ,
As the match originally matches two characters.
Is there a way of saying '&' - 1 ? or '&[0]' or something which means; 'The Matched String, but only part of it' in the replacement argument ?
Another way of trying to ask this:
Can I use Regex to IDENTIFY one part of my string.
But REPLACE a (slightly different,but related) part of my string.
I could just probably replace every comma with a comma+space, but this is a little more controlled and less likely to make a change I do not need....
For example:
Original:
Hello,World.
Should become:
Hello, World.
But:
Hello, World.
Should remain as :
Hello, World.
And currently, using my (bad) pattern I have:
Original:
Hello,World
After (wrong):
Hello, orld
I'm actually using Python's (2.6) 're' module for this as it happens.
Using parantheses to capture a part of the string is one way to do it. Another possibility is to use "lookahead assertion":
,(?=\S)
This pattern matches a comma only if it is followed by a non-whitespace character. It does not match anything followed by comma but uses that information to decide whether or not to match the comma.
For example:
>>> re.sub(r",(?=\S)", ", ", "Hello,World! Hello, World!")
'Hello, World! Hello, World!'
Yes, use parentheses to "capture" part of the string that matches your expression. I'm not up to speed on Python's implementation, but it should give you some kind of array called match[] whose elements correspond to the captures.
Yes, you could. But why would you, in this simple case?
def insertspaceaftercomma(s):
"""inserts a space after every comma, then remove doubled whitespace after comma (if any)"""
return s.replace(",",", ").replace(", ",", ")
seems to work:
>>> insertspaceaftercomma("Hello, World")
'Hello, World'
>>> insertspaceaftercomma("Hello,World")
'Hello, World'
>>>
You can look for a comma + non-space character and then stick a space in between them:
re.sub(r',([^\s])', r', \1', string)
Try this:
import re
s1 = 'Hello,World.'
re.sub(r',([^\s])', ', \g<1>', s1)
> Hello, World.
s2 = 'Hello, World.'
re.sub(r',([^\s])', ', \g<1>', s2)
> Hello, World.