Find the next word after a word in a string - python

I am trying to record the word after a specific word. For example, let's say I have a string:
First Name: John
Last Name: Doe
Email: John.Doe#email.com
I want to search the string for a key word such as "First Name:". Then I want to only capture the next word after it, in this case John.
I started using string.find("First Name:"), but I do not think that is the correct approach.
I could use some help with this. Most examples either split the string or keep everything else after "John". My goal is to be able to search strings for specific keywords no mater their location.
SOLUTION:
I used a similar set of code as below:
search = r"(First Name:)(.)(.+)"
x = re.compile(search)
This gave me the "John" with no spaces

a regular expression is the way to go
import re
pattern = r"(?:First Name\: ).+\b"
first_names = re.findall(pattern, mystring)
It will find the prefix (First name: ) without extracting r"(?:First Name: )
then extracts .+\b which denotes a word. Or you can split the string and itterate over resulting list
my_words = [ x.split()[0] for x in my_string.split("First Name: ")]

The .find approach is a good start.
You can use split on the remaining string to limit results to the single word.
Without using regex
s = "abc def opx"
q = 'abc'
res = s[s.find(q)+len(q):].split()[0]
res == 'def'

Related

Regex : replace url inside string

i have
string = 'Server:xxx-zzzzzzzzz.eeeeeeeeeee.frPIPELININGSIZE'
i need a python regex expression to identify xxx-zzzzzzzzz.eeeeeeeeeee.fr to do a sub-string function to it
Expected output :
string : 'Server:PIPELININGSIZE'
the URL is inside a string, i tried a lot of regex expressions
Not sure if this helps, because your question was quite vaguely formulated. :)
import re
string = 'Server:xxx-zzzzzzzzz.eeeeeeeeeee.frPIPELININGSIZE'
string_1 = re.search('[a-z.-]+([A-Z]+)', string).group(1)
print(f'string: Server:{string_1}')
Output:
string: Server:PIPELININGSIZE
No regex. single line use just to split on your target word.
string = 'Server:xxx-zzzzzzzzz.eeeeeeeeeee.frPIPELININGSIZE'
last = string.split("fr",1)[1]
first =string[:string.index(":")]
print(f'{first} : {last}')
Gives #
Server:PIPELININGSIZE
The wording of the question suggests that you wish to find the hostname in the string, but the expected output suggests that you want to remove it. The following regular expression will create a tuple and allow you to do either.
import re
str = "Server:xxx-zzzzzzzzz.eeeeeeeeeee.frPIPELININGSIZE"
p = re.compile('^([A-Za-z]+[:])(.*?)([A-Z]+)$')
m = re.search(p, str)
result = m.groups()
# ('Server:', 'xxx-zzzzzzzzz.eeeeeeeeeee.fr', 'PIPELININGSIZE')
Remove the hostname:
print(f'{result[0]} {result[2]}')
# Output: 'Server: PIPELININGSIZE'
Extract the hostname:
print(result[1])
# Output: 'xxx-zzzzzzzzz.eeeeeeeeeee.fr'

Search for string in file while ignoring id and replacing only a substring

I’ve got a master .xml file generated by an external application and want to create several new .xmls by adapting and deleting some rows with python. The search strings and replace strings for these adaptions are stored within an array, e.g.:
replaceArray = [
[u'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"',
u'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"'],
[u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="false"/>',
u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="true"/>'],
[u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="false"/>',
u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="true"/>']]
So I'd like to iterate through my file and replace all occurences of 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"' with 'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"' and so on.
Unfortunately the ID values of "RowID", “id_tool_base” and “ref_layerid_mapping” might change occassionally. So what I need is to search for matches of the whole string in the master file regardless which id value is inbetween the quotation mark and only to replace the substring that is different in both strings of the replaceArray (e.g. use=”true” instead of use=”false”). I’m not very familiar with regular expressions, but I think I need something like that for my search?
re.sub(r'<TOOL_SELECT_LINE RowID="\d+" id_tool_base="\d+" use="false"/>', "", sentence)
I'm happy about any hint that points me in the right direction! If you need any further information or if something is not clear in my question, please let me know.
One way to do this is to have a function for replacing text. The function would get the match object from re.sub and insert id captured from the string being replaced.
import re
s = 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"'
pat = re.compile(r'ref_layerid_mapping=(.+) lyvis="off" toc_visible="off"')
def replacer(m):
return "ref_layerid_mapping=" + m.group(1) + 'lyvis="on" toc_visible="on"';
re.sub(pat, replacer, s)
Output:
'ref_layerid_mapping="x4049"lyvis="on" toc_visible="on"'
Another way is to use back-references in replacement pattern. (see http://www.regular-expressions.info/replacebackref.html)
For example:
import re
s = "Ab ab"
re.sub(r"(\w)b (\w)b", r"\1d \2d", s)
Output:
'Ad ad'

Regular expression to match one from two groups

I want to perform operation like select any one suitable group from two
for e.g:
gmail.com|gmail.co.in
this gives me direct result but how can I write regular expression for python to map email id for above case.
Note. I just want to map 3 char after dot or max two group of dot and 2 character
I tried writing regex as :
[\w]+\.?[\w]{3}|[[\w]{2}\.?]{2}
but wont give me expected results
If tried to use () returns group for e.g:abcd#gmail.com will return gmail.com but need to retrieve whole email address.
You can use the following regex :
\w+#\w+\.((\w{3})|(\w{2}\.\w{2}))
Demo
All you need here is put the first part as \w+#\w+\. then you just need to play with grouping and pipe.so the following pattern:
((\w{3})|(\w{2}\.\w{2}))
will match a string contain 3 word character or (\w{2}\.\w{2}) that means a string with 2 word character then dot then string with 2 word character.
Hope this helps.
>>> import re
>>> y = re.search('(?P<user>\w+)#(?P<domain>[\w.]+)', 'abc#gmail.com')
>>> print y.group("user"), y.group("domain")
abc gmail.com
>>> y = re.search('(?P<user>\w+)#(?P<domain>[\w.]+)', 'abc#gmail.co.in')
>>> print y.group("user"), y.group("domain")
abc gmail.co.in
>>>
I hope this can help you.
email = 'abc#gmail.com'
user = email.split('#')[0]
domain = email.split('#')[1]

Match >>number and replace it

I have a string that contains some words in the >>number format.
For example:
this is a sentence >>82384324
I need a way to match those >>numbers and replace it with another string that contains the number.
For example: >>342 becomes
this is a string that contains the number 342
s= "this is a sentence >>82384324"
print re.sub("(.*\>\>)","This is a string containing " ,s)
This is a string containing 82384324
Assuming you are going to run into multiple number occurrences in a string I would suggest something a little more robust such as:
import re
pattern = re.compile('>>(\d+)')
str = "sadsaasdsa >>353325233253 Frank >>352523523"
search = re.findall(pattern, str)
for each in search:
print "The string contained the number %s" % each
Which yields:
>>The string contained the number 353325233253
>>The string contained the number 352523523
Using this basic pattern should work:
>>(\d+)
code:
import re
str = "this is a sentence >>82384324"
rep = "which contains the number \\1"
pat = ">>(\\d+)"
res = re.sub(pat, rep, str)
print(res)
example: http://regex101.com/r/kK3tL8
One simple way, assuming the only place you find ">>" is before a number, is to replace just those:
>>> mystr = "this is a sentence >>82384324"
>>> mystr.replace(">>","this is a string that contains the number ")
'this is a sentence this is a string that contains the number 82384324'
If there are other examples of >> in the text that you don't want to replace, you will need to catch the number as well, and it'll be best to use a regular expression.
>>> import re
>>> re.sub('>>(\d+)','this is a string that contains the number \g<1>',mystr)
'this is a sentence this is a string that contains the number 82384324'
https://docs.python.org/2/library/re.html and https://docs.python.org/2/howto/regex.html can provide more information about regular expressions.
You can do this using :
sentence = 'Stringwith>>1221'
print 'This is a string that contains
the number %s' % (re.search('>>(\d+)',sentence).group(1))
Result :
This is a string that contains the number 1221
You can look to the findall option to get all numbers that match the pattern here

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Categories

Resources