Match >>number and replace it - python

I have a string that contains some words in the >>number format.
For example:
this is a sentence >>82384324
I need a way to match those >>numbers and replace it with another string that contains the number.
For example: >>342 becomes
this is a string that contains the number 342

s= "this is a sentence >>82384324"
print re.sub("(.*\>\>)","This is a string containing " ,s)
This is a string containing 82384324

Assuming you are going to run into multiple number occurrences in a string I would suggest something a little more robust such as:
import re
pattern = re.compile('>>(\d+)')
str = "sadsaasdsa >>353325233253 Frank >>352523523"
search = re.findall(pattern, str)
for each in search:
print "The string contained the number %s" % each
Which yields:
>>The string contained the number 353325233253
>>The string contained the number 352523523

Using this basic pattern should work:
>>(\d+)
code:
import re
str = "this is a sentence >>82384324"
rep = "which contains the number \\1"
pat = ">>(\\d+)"
res = re.sub(pat, rep, str)
print(res)
example: http://regex101.com/r/kK3tL8

One simple way, assuming the only place you find ">>" is before a number, is to replace just those:
>>> mystr = "this is a sentence >>82384324"
>>> mystr.replace(">>","this is a string that contains the number ")
'this is a sentence this is a string that contains the number 82384324'
If there are other examples of >> in the text that you don't want to replace, you will need to catch the number as well, and it'll be best to use a regular expression.
>>> import re
>>> re.sub('>>(\d+)','this is a string that contains the number \g<1>',mystr)
'this is a sentence this is a string that contains the number 82384324'
https://docs.python.org/2/library/re.html and https://docs.python.org/2/howto/regex.html can provide more information about regular expressions.

You can do this using :
sentence = 'Stringwith>>1221'
print 'This is a string that contains
the number %s' % (re.search('>>(\d+)',sentence).group(1))
Result :
This is a string that contains the number 1221
You can look to the findall option to get all numbers that match the pattern here

Related

Match all words of list in string using regex or any other way

I'm writing a python script where i have to match all words given in list with the string. The list could be as long as possible. But i found operations that will match any of the character but couldn't find operation to match all words in list. For example
s = "This is a sample string"
list = ["is", "sample"]
// any operation such that re.search(r'',s) return correct result
// I want that regular expression or approach to do it.
Something like this?
import re
string = "This is a sample string"
lst = ["is", "sample"]
for item in lst:
rx = re.compile(r"\b{}\b".format(item))
if rx.search(string):
print("'{}' is in the string".format(item))
This yields
'is' is in the string
'sample' is in the string

Find the next word after a word in a string

I am trying to record the word after a specific word. For example, let's say I have a string:
First Name: John
Last Name: Doe
Email: John.Doe#email.com
I want to search the string for a key word such as "First Name:". Then I want to only capture the next word after it, in this case John.
I started using string.find("First Name:"), but I do not think that is the correct approach.
I could use some help with this. Most examples either split the string or keep everything else after "John". My goal is to be able to search strings for specific keywords no mater their location.
SOLUTION:
I used a similar set of code as below:
search = r"(First Name:)(.)(.+)"
x = re.compile(search)
This gave me the "John" with no spaces
a regular expression is the way to go
import re
pattern = r"(?:First Name\: ).+\b"
first_names = re.findall(pattern, mystring)
It will find the prefix (First name: ) without extracting r"(?:First Name: )
then extracts .+\b which denotes a word. Or you can split the string and itterate over resulting list
my_words = [ x.split()[0] for x in my_string.split("First Name: ")]
The .find approach is a good start.
You can use split on the remaining string to limit results to the single word.
Without using regex
s = "abc def opx"
q = 'abc'
res = s[s.find(q)+len(q):].split()[0]
res == 'def'

How to parse values appear after the same string in python?

I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)
(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)
I am trying to parse the text to store something like this:
value1="xxx" and value2="yyy".
I wrote python code as follows:
value1_start = content.find('value')
value1_end = content.find(';', value1_start)
value2_start = content.find('value')
value2_end = content.find(';', value2_start)
print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])
But it always returns:
value=xxx
value=xxx
Could anyone tell me how can I parse the text so that the output is:
value=xxx
value=yyy
Use a regex approach:
re.findall(r'\bvalue=[^;]*', s)
Or - if value can be any 1+ word (letter/digit/underscore) chars:
re.findall(r'\b\w+=[^;]*', s)
See the regex demo
Details:
\b - word boundary
value= - a literal char sequence value=
[^;]* - zero or more chars other than ;.
See the Python demo:
import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)
Use regex to filter the data you want from the "junk characters":
>>> import re
>>> _input = '#4#5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
print(match)
value=xxx
value=yyy
>>>
Summary or the regular expression:
[a-zA-Z0-9]+: One or more alphanumeric characters
=: literal equal sign
[a-zA-Z0-9]+: One or more alphanumeric characters
For this input:
content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'
use a simple regex and manually strip off the first and last two characters:
import re
values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
print(value)
Output:
value=xxx
value=yyy
Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.
You already have good answers based on the re module. That would certainly be the simplest way.
If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :
value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)

Best way to convert string to integer in Python

I have a spreadsheet with text values like A067,A002,A104. What is most efficient way to do this? Right now I am doing the following:
str = 'A067'
str = str.replace('A','')
n = int(str)
print n
Depending on your data, the following might be suitable:
import string
print int('A067'.strip(string.ascii_letters))
Python's strip() command takes a list of characters to be removed from the start and end of a string. By passing string.ascii_letters, it removes any preceding and trailing letters from the string.
If the only non-number part of the input will be the first letter, the fastest way will probably be to slice the string:
s = 'A067'
n = int(s[1:])
print n
If you believe that you will find more than one number per string though, the above regex answers will most likely be easier to work with.
You could use regular expressions to find numbers.
import re
s = 'A067'
s = re.findall(r'\d+', s) # This will find all numbers in the string
n = int(s[0]) # This will get the first number. Note: If no numbers will throw exception. A simple check can avoid this
print n
Here's some example output of findall with different strings
>>> a = re.findall(r'\d+', 'A067')
>>> a
['067']
>>> a = re.findall(r'\d+', 'A067 B67')
>>> a
['067', '67']
You can use the replace method of regex from re module.
import re
regex = re.compile("(?P<numbers>.*?\d+")
matcher = regex.search(line)
if matcher:
numbers = int(matcher.groupdict()["numbers"] #this will give you the numbers from the captured group
import string
str = 'A067'
print (int(str.strip(string.ascii_letters)))

How to use regex to parse a number from HTML?

I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:
Your number is <b>123</b>
Now, how can I extract "123", i.e. the contents of the first bold text after the string "Your number is"?
import re
m = re.search("Your number is <b>(\d+)</b>",
"xxx Your number is <b>123</b> fdjsk")
if m:
print m.groups()[0]
Given s = "Your number is <b>123</b>" then:
import re
m = re.search(r"\d+", s)
will work and give you
m.group()
'123'
The regular expression looks for 1 or more consecutive digits in your string.
Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search() to make sure that m contained a valid reference, otherwise m.group() would result in a AttributeError: exception.
Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup - it's meant for that and much more. The whole idea with BeautifulSoup is to avoid "manual" parsing using string ops or regular expressions.
import re
x = 'Your number is <b>123</b>'
re.search('(?<=Your number is )<b>(\d+)</b>',x).group(0)
this searches for the number that follows the 'Your number is' string
import re
print re.search(r'(\d+)', 'Your number is <b>123</b>').group(0)
The simplest way is just extract digit(number)
re.search(r"\d+",text)
val="Your number is <b>123</b>"
Option : 1
m=re.search(r'(<.*?>)(\d+)(<.*?>)',val)
m.group(2)
Option : 2
re.sub(r'([\s\S]+)(<.*?>)(\d+)(<.*?>)',r'\3',val)
import re
found = re.search("your number is <b>(\d+)</b>", "something.... Your number is <b>123</b> something...")
if found:
print found.group()[0]
Here (\d+) is the grouping, since there is only one group [0] is used. When there are several groupings [grouping index] should be used.
To extract as python list you can use findall
>>> import re
>>> string = 'Your number is <b>123</b>'
>>> pattern = '\d+'
>>> re.findall(pattern,string)
['123']
>>>
You can use the following example to solve your problem:
import re
search = re.search(r"\d+",text).group(0) #returns the number that is matched in the text
print("Starting Index Of Digit", search.start())
print("Ending Index Of Digit:", search.end())
import re
x = 'Your number is <b>123</b>'
output = re.search('(?<=Your number is )<b>(\d+)</b>',x).group(1)
print(output)

Categories

Resources