how to find substring from a single line string - python

suppose, I have a string, s="panpanIpanAMpanJOEpan" . From this I want to find the word pan and replace it with spaces so that I can get the output string as "I AM JOE". How can I do it??
Actually I also don't know how to find certain substring from a long string without spaces such as mentioned above.
It will be great if someone helps me learning about this.

If you don't know pan you can exploit that the letters you want to find is all upper case.
fillword = min(set("".join(i if i.islower() else ' ' for i in s).split(' '))-set(['']),key=len)
This works by first replacing all upper case letters with space, then splitting on space and finding the minimal nonempty word.
Use replace to replace with space, and then strip to remove excess spacing.
s="panpanIpanAMpanJOEpan"
s.replace(fillword,' ').strip()
gives:
'I AM JOE'

s="panpanIpanAMpanJOEpan"
print(s.replace("pan"," ").strip())
use replace
Output:
I AM JOE

As DarrylG and others mentioned, .replace will do what you asked for, where you define what you want to replace ("pan") and what you want to replace it with (" ").
To find a certain string in a longer string you can use .find(), which takes a string you are looking for and optionally where to start and stop looking for it (as integers) as arguments.
If you wanted to find all of the occurrences of a string in a bigger string there's two options:
Find the string with find(), then cut the string so it no longer contains your searchterm and repeat this until the .find() method returns -1(that means the searchterm is not found in the string anymore)
or use the regex module and use the .finditer method to find all occurences of your string Link to someone explaining exactly that on stackoverflow.
Edit: If you don't know what you are searching for, it becomes a bit more tricky, but you can write a regex expession that would extract this data as well using the same regex module. This is easy if you know what the end result is supposed to be (I AM JOE in your case). If you don't it becomes more complicated and we would need additional information to help with this.

You can use replace, to replace all occurances of a substring at once.
In case you want to find the substrings yourself, you can do it manually:
s = "panpanIpanAMpanJOEpan"
while True:
panPosition = s.find('pan') # -1 == 'pan' not found!
if panPosition == -1:
s = s.strip()
break
# Cut out pan from s and replace it with a blanc.
s = s[:panPosition] + ' ' + s[panPosition + 3:]
print(s)
Out:
I AM JOE

Related

Python regex to check if a substring is at the beginning or at the end of a bigger path to look for

I have a string containing words in the form word1_word2, word3_word4, word5_word1 (so a word can appear at the left or at the right). I want a regex that looks for all the occurrences of a specific word, and returns the "super word" containing it. So if I'm looking for word1, I expect my regex to return word1_word2, word5_word1. Since the word can appear on the left or on the right, I wrote this:
re.findall("( {}_)?[\u0061-\u007a\u00e0-\u00e1\u00e8-\u00e9\u00ec\u00ed\u00f2-\u00f3\u00f9\u00fa]*(_{} )?".format("w1", "w1"), string)
With the optional blocks at the beginning or at the end of the pattern. However, it takes forever to execute and I think something is not correct because I tried removing the optional blocks and writing two separate regex for looking at the beginning and at the end and they are much faster (but I don't want to use two regex). Am I missing something or is it normal?
This would be the regex solution to your problem:
re.findall(rf'\b({yourWord}_\w+?|\w+?_{yourWord})\b', yourString)
Python provides some methods to do this
a=['word1_word2', 'word3_word4', 'word5_word1']
b = [x for x in a if x.startswith("word1") or x.endswith('word1')]
print(b) # ['word1_word2', 'word5_word1']
Referenece link
s = 'word1_word2, word3_word4, word5_word1'
matches = re.finditer(r'(\w+_word1)|(word1_\w+)', s)
result = list(map(lambda x: x.group(), matches))
['word1_word2', 'word5_word1']
This is one method, but seeing #Carl his answer I voted for his. That is a faster and cleaner method. I will just leave it here as one of many regex options.
this regex will do the job for word1:
regex = (word\d_)*word1(_word\d)*
re.findall(regex, string)
you can also use this:
re.findall(rf'\b(word{number}_\w+?|\w+?_word{number})\b', string)
Try the following regex.
In the following, replace word1 with the word you're looking for. This is assuming that the word you are looking for consists of only alphanumeric characters.
([a-zA-Z0-9]*_word1)|(word1_.[a-zA-Z0-9]*)

Find words containing . in middle or at the end

I need help to find words containing . in middle or at the end with regex in python.
Like N. or N.E. or North.East or N.East.
Not sure if you specifically need to use regex, but here's how you can do it without. Here are a couple of ways of looking at it:
If you're looking anywhere in the word (let's call it MyString) except the first character, you can use MyString[1:].contains('.'), or simply '.' in MyString[1:].
If you want to check the exact center of a string, you can use MyString[len(MyString)/2] == '.'; if the string has an even number of characters, the righthand character will be checked ('d' in 'abcdef', for instance).
If you want to check the very last character without checking anything else, MyString[-1] == '.' is enough.
Assuming that your words are sent as strings, anyway.
Maybe this is what you are looking for:
/\w+\.\w*\.?/g
https://regex101.com/r/iH9bO6/1
^\w+(?:\.\w+)*\.?$
Try this.See demo.
https://regex101.com/r/sS2dM8/15

in python find index in list if combination of strings exist

I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.

Find words with capital letters not at start of a sentence with regex

Using Python and regex I am trying to find words in a piece of text that start with a capital letter but are not at the start of a sentence.
The best way I can think of is to check that the word is not preceded by a full stop then a space. I am pretty sure that I need to use negative lookbehind. This is what I have so far, it will run but always returns nothing:
(?<!\.\s)\b[A-Z][a-z]*\b
I think the problem might be with the use of [A-Z][a-z]* inside the word boundary \b but I am really not sure.
Thanks for the help.
Your regex appears to work:
In [6]: import re
In [7]: re.findall(r'(?<!\.\s)\b[A-Z][a-z]*\b', 'lookbehind. This is what I have')
Out[7]: ['I']
Make sure you're using a raw string (r'...') when specifying the regex.
If you have some specific inputs on which the regex doesn't work, please add them to your question.
Although you asked specifically for a regex, it may be interesting to also consider a list comprehension. They're sometimes a bit more readable (although in this case, probably at the cost of efficiency). Here's one way to achieve this:
import string
S = "T'was brillig, and the slithy Toves were gyring and gimbling in the " + \
"Wabe. All mimsy were the Borogoves, and the Mome Raths outgrabe."
LS = S.split(' ')
words = [x for (pre,x) in zip(['.']+LS, LS+[' '])
if (x[0] in string.uppercase) and (pre[-1] != '.')]
Try and loop over your input with:
(?!^)\b([A-Z]\w+)
and capture the first group. As you can see, a negative lookahead can be used as well, since the position you want to match is everything but a beginning of line. A negative lookbehind would have the same effect.

Find two of the same character in a string with regular expressions

This is in reference to a question I asked before here
I received a solution to the problem in that question but ended up needing to go with regex for this particular part.
I need a regular expression to search and replace a string for instances of two vowels in a row that are the same, so the "oo" in "took", or the "ee" in "bees" and replace it with the one of the letters that was replaced and a :.
Some examples of expected behavior:
"took" should become "to:k"
"waaeek" should become "wa:e:k"
"raaag" should become "ra:ag"
Thank you for the help.
Try this:
re.sub(r'([aeiou])\1', r'\1:', str)
Search for ([aeiou])\1 and replace it with \1:
I don't know about python, but you should be able to make the regex case insensitive and global with something like /([aeiou])\1/gi
What NOT to do:
As noted, this will match any two vowels together. Leaving this answer as an example of what NOT to do. The correct answer (in this case) is to use backreferences as mentioned in numerous other answers.
import re
data = ["took","waaeek","raaag"]
for s in data:
print re.sub(r'([aeiou]){2}',r'\1:',s)
This matches exactly two occurrences {2} of any member of the set [aeiou]. and replaces it with the vowel, captured with the parens () and placed in the sub string by the \1 followed by a ':'
Output:
to:k
wa:e:k
ra:ag
You'll need to use a back reference in your search expression. Try something like: ([a-z])+\1 (or ([a-z])\1 for just a double).

Categories

Resources