how to find value in (txt) python string - python

i'm new to python world and i'm trying to extract value from text. I try to find the keyword by re.search('keyword') , but I want to get the value after keyword
text = word1:1434, word2:4446, word3:7171
i just want to get the value of word1
i try
keyword = 'word1'
before_keyword, keyword, after_keyword = text.partition(keyword)
print(after_keyword)
output
:1434, word2:4446, word3:7171
i just want to get the value of word1 (1434)

Here is how you can search the text using regular expressions:
import re
keyword_regex = r'word1:(\d+)'
text = "word1:1434, word2:4446, word3:7171"
keyword_value = re.search(keyword_regex, text)
print(keyword_value.group(1))
The RegEx word1:(\d+) searches for the string word1: followed by one or more digits. It stops matching when the next character is not a digit. The parentheses around (\d+) make this part a capturing group which is what enables you to access it later using keyword_value.group(1).
More about regular expressions here and Python's re module here.

Assuming Text input is a string not dict; then
text = "word1:1434, word2:4446, word3:7171"
keyword = 'word1'
print(text.split(keyword+":")[1].split(",")[0])
Hope this helps...

Related

Updating a string using regular expressions in Python

I'm pretty sure that my question is very straightforward but I cannot find the answer to it. Let's say we have an input string like:
input = "This is an example"
Now, I want to simply replace every word --generally speaking, every substring using a regular expression, "word" here is just an example-- in the input with another string which includes the original string too. For instance, I want to add an # to the left and right of every word in input. And, the output would be:
output = "#This# #is# #an# #example#"
What is the solution? I know how to use re.sub or replace, but I do not know how I can use them in a way that I can update the original matched strings and not completely replace them with something else.
You can use capture groups for that.
import re
input = "This is an example"
output = re.sub("(\w+)", "#\\1#", input)
A capture group is something that you can later reference, for example in the substitution string. In this case, I'm matching a word, putting it into a capture group and then replacing it with the same word, but with # added as a prefix and a suffix.
You can read about regexps in python more in the docs.
Here is an option using re.sub with lookarounds:
input = "This is an example"
output = re.sub(r'(?<!\w)(?=\w)|(?<=\w)(?!\w)', '#', input)
print(output)
#This# #is# #an# #example#
This is without re library
a = "This is an example"
l=[]
for i in a.split(" "):
l.append('#'+i+'#')
print(" ".join(l))
You can match only word boundaries with \b:
import re
input = "This is an example"
output = re.sub(r'\b', '#', input)
print(output)
#This# #is# #an# #example#

Python: strip function definition using regex

I am a very beginner of programming and reading the book "Automate the boring stuff with Python'. In Chapter 7, there is a project practice: the regex version of strip(). My code below does not work (I use Python 3.6.1). Could anyone help?
import re
string = input("Enter a string to strip: ")
strip_chars = input("Enter the characters you want to be stripped: ")
def strip_fn(string, strip_chars):
if strip_chars == '':
blank_start_end_regex = re.compile(r'^(\s)+|(\s)+$')
stripped_string = blank_start_end_regex.sub('', string)
print(stripped_string)
else:
strip_chars_start_end_regex = re.compile(r'^(strip_chars)*|(strip_chars)*$')
stripped_string = strip_chars_start_end_regex.sub('', string)
print(stripped_string)
You can also use re.sub to substitute the characters in the start or end.
Let us say if the char is 'x'
re.sub(r'^x+', "", string)
re.sub(r'x+$', "", string)
The first line as lstrip and the second as rstrip
This just looks simpler.
When using r'^(strip_chars)*|(strip_chars)*$' string literal, the strip_chars is not interpolated, i.e. it is treated as a part of the string. You need to pass it as a variable to the regex. However, just passing it in the current form would result in a "corrupt" regex because (...) in a regex is a grouping construct, while you want to match a single char from the define set of chars stored in the strip_chars variable.
You could just wrap the string with a pair of [ and ] to create a character class, but if the variable contains, say z-a, it would make the resulting pattern invalid. You also need to escape each char to play it safe.
Replace
r'^(strip_chars)*|(strip_chars)*$'
with
r'^[{0}]+|[{0}]+$'.format("".join([re.escape(x) for x in strip_chars]))
I advise to replace * (zero or more occurrences) with + (one or more occurrences) quantifier because in most cases, when we want to remove something, we need to match at least 1 occurrence of the unnecessary string(s).
Also, you may replace r'^(\s)+|(\s)+$' with r'^\s+|\s+$' since the repeated capturing groups will keep on re-writing group values upon each iteration slightly hampering the regex execution.
#! python
# Regex Version of Strip()
import re
def RegexStrip(mainString,charsToBeRemoved=None):
if(charsToBeRemoved!=None):
regex=re.compile(r'[%s]'%charsToBeRemoved)#Interesting TO NOTE
return regex.sub('',mainString)
else:
regex=re.compile(r'^\s+')
regex1=re.compile(r'$\s+')
newString=regex1.sub('',mainString)
newString=regex.sub('',newString)
return newString
Str=' hello3123my43name is antony '
print(RegexStrip(Str))
Maybe this could help, it can be further simplified of course.

How to parse values appear after the same string in python?

I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)
(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)
I am trying to parse the text to store something like this:
value1="xxx" and value2="yyy".
I wrote python code as follows:
value1_start = content.find('value')
value1_end = content.find(';', value1_start)
value2_start = content.find('value')
value2_end = content.find(';', value2_start)
print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])
But it always returns:
value=xxx
value=xxx
Could anyone tell me how can I parse the text so that the output is:
value=xxx
value=yyy
Use a regex approach:
re.findall(r'\bvalue=[^;]*', s)
Or - if value can be any 1+ word (letter/digit/underscore) chars:
re.findall(r'\b\w+=[^;]*', s)
See the regex demo
Details:
\b - word boundary
value= - a literal char sequence value=
[^;]* - zero or more chars other than ;.
See the Python demo:
import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)
Use regex to filter the data you want from the "junk characters":
>>> import re
>>> _input = '#4#5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
print(match)
value=xxx
value=yyy
>>>
Summary or the regular expression:
[a-zA-Z0-9]+: One or more alphanumeric characters
=: literal equal sign
[a-zA-Z0-9]+: One or more alphanumeric characters
For this input:
content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'
use a simple regex and manually strip off the first and last two characters:
import re
values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
print(value)
Output:
value=xxx
value=yyy
Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.
You already have good answers based on the re module. That would certainly be the simplest way.
If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :
value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)

Search for string in file while ignoring id and replacing only a substring

I’ve got a master .xml file generated by an external application and want to create several new .xmls by adapting and deleting some rows with python. The search strings and replace strings for these adaptions are stored within an array, e.g.:
replaceArray = [
[u'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"',
u'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"'],
[u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="false"/>',
u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="true"/>'],
[u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="false"/>',
u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="true"/>']]
So I'd like to iterate through my file and replace all occurences of 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"' with 'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"' and so on.
Unfortunately the ID values of "RowID", “id_tool_base” and “ref_layerid_mapping” might change occassionally. So what I need is to search for matches of the whole string in the master file regardless which id value is inbetween the quotation mark and only to replace the substring that is different in both strings of the replaceArray (e.g. use=”true” instead of use=”false”). I’m not very familiar with regular expressions, but I think I need something like that for my search?
re.sub(r'<TOOL_SELECT_LINE RowID="\d+" id_tool_base="\d+" use="false"/>', "", sentence)
I'm happy about any hint that points me in the right direction! If you need any further information or if something is not clear in my question, please let me know.
One way to do this is to have a function for replacing text. The function would get the match object from re.sub and insert id captured from the string being replaced.
import re
s = 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"'
pat = re.compile(r'ref_layerid_mapping=(.+) lyvis="off" toc_visible="off"')
def replacer(m):
return "ref_layerid_mapping=" + m.group(1) + 'lyvis="on" toc_visible="on"';
re.sub(pat, replacer, s)
Output:
'ref_layerid_mapping="x4049"lyvis="on" toc_visible="on"'
Another way is to use back-references in replacement pattern. (see http://www.regular-expressions.info/replacebackref.html)
For example:
import re
s = "Ab ab"
re.sub(r"(\w)b (\w)b", r"\1d \2d", s)
Output:
'Ad ad'

replace wildcard numbers in pattern with additional text + same numbers

I need to find all parts of a large text string in this particular pattern:
"\t\t" + number (between 1-999) + "\t\t"
and then replace each occurrence with:
TEXT+"\t\t"+same number+"\t\t"
So, the end result is:
'TEXT\t\t24\t\tblah blah blahTEXT\t\t56\t\t'... and so on...
The various numbers are between 1-999 so it needs some kind of wildcard.
Please can somebody show me how to do it? Thanks!
You'll want to use Python's re library, and in particular the re.sub function:
import re # re is Python's regex library
SAMPLE_TEXT = "\t\t45\t\tbsadfd\t\t839\t\tds532\t\t0\t\t" # Test text to run the regex on
# Run the regex using re.sub (for substitute)
# re.sub takes three arguments: the regex expression,
# a function to return the substituted text,
# and the text you're running the regex on.
# The regex looks for substrings of the form:
# Two tabs ("\t\t"), followed by one to three digits 0-9 ("[0-9]{1,3}"),
# followed by two more tabs.
# The lambda function takes in a match object x,
# and returns the full text of that object (x.group(0))
# with "TEXT" prepended.
output = re.sub("\t\t[0-9]{1,3}\t\t",
lambda x: "TEXT" + x.group(0),
SAMPLE_TEXT)
print output # Print the resulting string.

Categories

Resources