Extracting text from a line: Regex in Python

Extracting text from a line: Regex in Python - python

I'm working with regular expressions in Python and I'm struggling with this.
I have data in a file of lines like this one:
|person=[[Old McDonald]]
and I just want to be able to extract Old McDonald from this line.
I have been trying with this regular expression:
matchLine = re.match(r"\|[a-z]+=(\[\[)?[A-Z][a-z]*(\]\])", line)
print matchLine
but it doesn't work; None is the result each time.

Construct [A-Z][a-z]* does not match Old McDonald. You probably should use something like [A-Z][A-Za-z ]*. Here is code example:
import re
line = '|person=[[Old McDonald]]'
matchLine = re.match ('\|[a-z]+=(?:\[\[)?([A-Z][A-Za-z ]*)\]\]', line)
print matchLine.group (1)
The output is Old McDonald for me. If you need to search in the middle of the string, use re.search instead of re.match:
import re
line = 'blahblahblah|person=[[Old McDonald]]blahblahblah'
matchLine = re.search ('\|[a-z]+=(?:\[\[)?([A-Z][A-Za-z ]*)\]\]', line)
print matchLine.group (1)

Related

replace before and after a string using re in python

i have string like this 'approved:rakeshc#IAD.GOOGLE.COM'
i would like extract text after ':' and before '#'
in this case the test to be extracted is rakeshc
it can be done using split method - 'approved:rakeshc#IAD.GOOGLE.COM'.split(':')[1].split('#')[0]
but i would want this be done using regular expression.
this is what i have tried so far.
import re
iptext = 'approved:rakeshc#IAD.GOOGLE.COM'
re.sub('^(.*approved:)',"", iptext) --> give everything after ':'
re.sub('(#IAD.GOOGLE.COM)$',"", iptext) --> give everything before'#'
would want to have the result in single expression. expression would be used to replace a string with only the middle string

Here is a regex one-liner:
inp = "approved:rakeshc#IAD.GOOGLE.COM"
output = re.sub(r'^.*:|#.*$', '', inp)
print(output) # rakeshc
The above approach is to strip all text from the start up, and including, the :, as well as to strip all text from # until the end. This leaves behind the email ID.

Use a capture group to copy the part between the matches to the result.
result = re.sub(r'.*approved:(.*)#IAD\.GOOGLE\.COM$', r'\1', iptext)

Hope this works for you:
import re
input_text = "approved:rakeshc#IAD.GOOGLE.COM"
out = re.search(':(.+?)#', input_text)
if out:
found = out.group(1)
print(found)

You can use this one-liner:
re.sub(r'^.*:(\w+)#.*$', r'\1', iptext)
Output:
rakeshc

Replacing when a word is in another word but with special circumstances

My program replaces tokens with values when they are in a file. When reading in a certain line it gets stuck here is an example:
1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a
The two tokens in the example are Token100 and Token100a. I need a way to only replace Token100 with its data and not replace Token100a with Token100's data with an a afterwards. I can't look for spaces before and after because sometimes they are in the middle of lines. Any thoughts are appreciated. Thanks.

You can use regex:
import re
line = "1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a"
match = re.sub("Token100a", "data", line)
print(match)
Outputs:
1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1data
More about regex here:
https://www.w3schools.com/python/python_regex.asp

You can use a regular expression with a negative lookahead to ensure that the following character is not an "a":
>>> import re
>>> test = '1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a'
>>> re.sub(r'Token100(?!a)', 'data', test)
'1.1.1.1.1.1.1.1.1.1 data.1 1.1.1.1.1.1.1Token100a'

how to extract portion of a string between two substrings in a multiline string in python

I'm trying to extract the portion of a string between two string identifiers. The technique works if the search is made in first line but it do not work for substrings in other line.
The string is like this:
mystring="""abc jhfshf iztzrtzoi hjge);
kjsyh ldjfsj sjsdgj sodfsd);
sjfhsdvh isdjgdfg sdgjhg isjdgg);
ghdcbnv jgdfkjg fdjgjfdgj);
vgdfnkvgfd dfgjfdjgjöfd);
end"""
Until now I have the following code.
startString='jhfshf'
endString=';'
search_var=mystring[mystring.find(startString)+len(startString):mystring.find(endString)]
print(search_var)
I get the correct output like iztzrtzoi hjge)
But if I search for a string in second line like (startString=ldjfsj), it do not work. Can can body suggest some changes for correction?

Using Regex.
Demo:
import re
mystring="""abc jhfshf iztzrtzoi hjge);
kjsyh ldjfsj sjsdgj sodfsd);
sjfhsdvh isdjgdfg sdgjhg isjdgg);
ghdcbnv jgdfkjg fdjgjfdgj);
vgdfnkvgfd dfgjfdjgjöfd);
end"""
m = re.search("(?<=jhfshf).*?(?=\;)", mystring)
if m:
print( m.group() )
Output:
iztzrtzoi hjge)

How to read only number from a specific line using python script

How to read only number from a specific line using python script for example
"1009 run test jobs" here i should read only number "1009" instead of "1009 run test jobs"

Or this if your number always comes first int(line.split()[0])

a simple regexp should do:
import re
match = re.match(r"(\d+)", "1009 run test jobs")
if match:
number = match.group()
https://docs.python.org/3/library/re.html

Use regular expression:
>>> import re
>>> x = "1009 run test jobs"
>>> re.sub("[^0-9]","",x)
>>> re.sub("\D","",x) #better way

Or a simple check if its numbers in a string.
[int(s) for s in str.split() if s.isdigit()]
Where str is your string of text.

Pretty sure there is a "more pythonic" way, but this works for me:
s='teststri3k2k3s21k'
outs=''
for i in s:
try:
numbr = int(i)
outs+=i
except:
pass
print(outs)
If the number is always at the beginning of your string, you might consider something like outstring = instring[0,3].

You can do it with regular expression. That's very easy:
import re
regularExpression = "[^\d-]*(-?[0-9]+).*"
line = "some text -123 some text"
m = re.search(regularExpression, line)
if m:
print(m.groups()[0])
This regular expression extracts the first number in a text. It considers '-' as part of numbers. If you don't want this change regular expression to this one: "[^\d-]*([0-9]+).*"

Regular expression, trimming after a particular sign and neglecting the list terms which do not have that sign

file = open('SMSm.txt', 'r')
file2 = open('SMSw.txt', 'w')
debited=[]
for line in file.readlines():
if 'debited with' in line:
import re
a= re.findall(r'[INR]\S*', line)
debited.append(a)
file2.write(line)
print re.findall(r'^(.*?)(=)?$', (debited)
My output is [['INR 2,000=2E00'], ['INR 12,000=2E400', 'NFS*Cash'], ['INR 2,000=2E0d0']]
I only want the digits after INR. For example ['INR 2,000','INR 12000','INR 2000']. What changes shall I make in the regular expression?
I have tried using str(debited) but it didn't work out.

You can use a simple regex matching INR + whitespace if any + any digits with , as separator:
import re
s = "[['INR 2,000=2E00']['INR 12,000=2E400', 'NFS*Cash']['INR 2,000=2E0d0']]"
t = re.findall(r"INR\s*(\d+(?:,\d+)*)", s)
print(t)
# Result: ['2,000', '12,000', '2,000']
With findall, all captured texts will be output as a list.
See IDEONE demo
If you want INR as part of the output, just remove the capturing round brackets from the pattern: r"INR\s*\d+(?:,\d+)*".
UPDATE
Just tried out a non-regex approach (a bit error prone if there are entries with no =), here it is:
t = [x[0:x.find("=")].strip("'") for x in s.strip("[]").replace("][", "?").split("?")]
print(t)

Given the code you already have, the simplest solution is to make the extracted string start with INR (it already does) and end just before the equals sign. Just replace this line
a= re.findall(r'[INR]\S*', line)
with this:
a= re.findall(r'[INR][^\s=]*', line)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting text from a line: Regex in Python - python

Related

replace before and after a string using re in python

Replacing when a word is in another word but with special circumstances

how to extract portion of a string between two substrings in a multiline string in python

How to read only number from a specific line using python script

Regular expression, trimming after a particular sign and neglecting the list terms which do not have that sign

Categories

Resources