I have a spreadsheet with text values like A067,A002,A104. What is most efficient way to do this? Right now I am doing the following:
str = 'A067'
str = str.replace('A','')
n = int(str)
print n
Depending on your data, the following might be suitable:
import string
print int('A067'.strip(string.ascii_letters))
Python's strip() command takes a list of characters to be removed from the start and end of a string. By passing string.ascii_letters, it removes any preceding and trailing letters from the string.
If the only non-number part of the input will be the first letter, the fastest way will probably be to slice the string:
s = 'A067'
n = int(s[1:])
print n
If you believe that you will find more than one number per string though, the above regex answers will most likely be easier to work with.
You could use regular expressions to find numbers.
import re
s = 'A067'
s = re.findall(r'\d+', s) # This will find all numbers in the string
n = int(s[0]) # This will get the first number. Note: If no numbers will throw exception. A simple check can avoid this
print n
Here's some example output of findall with different strings
>>> a = re.findall(r'\d+', 'A067')
>>> a
['067']
>>> a = re.findall(r'\d+', 'A067 B67')
>>> a
['067', '67']
You can use the replace method of regex from re module.
import re
regex = re.compile("(?P<numbers>.*?\d+")
matcher = regex.search(line)
if matcher:
numbers = int(matcher.groupdict()["numbers"] #this will give you the numbers from the captured group
import string
str = 'A067'
print (int(str.strip(string.ascii_letters)))
Related
I want to pick up a substring from <personne01166+30-90>, which the output should look like: +30 and -90.
The strings can be like: 'personne01144+0-30', 'personne01146+0+0', 'personne01180+60-75', etc.
I tried use
<string.split('+')[len(string.split('+')) -1 ].split('+')[0]>
but the output must be two correspondent numbers.
Here is how you can use a list comprehension and re.findall:
import re
s = ['personne01144+0-30', 'personne01146+0+0', 'personne01180+60-75']
print([re.findall('[+-]\d+', i) for i in s])
Output:
[['+0', '-30'], ['+0', '+0'], ['+60', '-75']]
re.findall('[+-]\d+', i) finds all the patterns of '[+-]\d+' in the string i.
[+-] means any either + or -. \d+ means all numbers in a row.
If you know the interesting part always comes after + then you can simply split twice.
numbers = string.split('+', 1)[1]
if '+' in numbers:
this, that = numbers.split('+')
elif '-' in numbers:
this, that = numbers.split('-')
that = -that
else:
raise ValueError('Could not parse %s', string)
Perhaps a regex-based approach makes more sense, though;
import re
m = re.search(r'([-+]\d+)([-+]\d+)$', string)
if m:
this, that = m.groups()
I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)
(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)
I am trying to parse the text to store something like this:
value1="xxx" and value2="yyy".
I wrote python code as follows:
value1_start = content.find('value')
value1_end = content.find(';', value1_start)
value2_start = content.find('value')
value2_end = content.find(';', value2_start)
print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])
But it always returns:
value=xxx
value=xxx
Could anyone tell me how can I parse the text so that the output is:
value=xxx
value=yyy
Use a regex approach:
re.findall(r'\bvalue=[^;]*', s)
Or - if value can be any 1+ word (letter/digit/underscore) chars:
re.findall(r'\b\w+=[^;]*', s)
See the regex demo
Details:
\b - word boundary
value= - a literal char sequence value=
[^;]* - zero or more chars other than ;.
See the Python demo:
import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)
Use regex to filter the data you want from the "junk characters":
>>> import re
>>> _input = '#4#5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
print(match)
value=xxx
value=yyy
>>>
Summary or the regular expression:
[a-zA-Z0-9]+: One or more alphanumeric characters
=: literal equal sign
[a-zA-Z0-9]+: One or more alphanumeric characters
For this input:
content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'
use a simple regex and manually strip off the first and last two characters:
import re
values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
print(value)
Output:
value=xxx
value=yyy
Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.
You already have good answers based on the re module. That would certainly be the simplest way.
If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :
value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)
I have a list of strings and i would like to extract : "000000_5.612230" of :
A = '/calibration/test_min000000_5.612230.jpeg'
As the size of the strings could evolve, I try with monitoring the position of "n" of "min". I try to get the good index with :
print sorted(A, key=len).index('n')
But i got "11" which corresponds to the "n" of "calibration". I would like to know how to get the maximum index value of the string?
it is difficult to answer since you don't specify what part of the filename remains constant and what is subject to change. is it always a jpeg? is the number always the last part? is it always preceded with '_min' ?
in any case, i would suggest using a regex instead:
import re
A = '/calibration/test_min000000_5.612230.jpeg'
p = re.compile('.*min([_\d\.]*)\.jpeg')
value = p.search(A).group(1)
print value
output :
000000_5.612230
note that this snippet assumes that a match is always found, if the filename doesn't contain the pattern then p.search(...) will return None and an exception will be raised, you'll check for that case.
You can use re module and the regex to do that, for example:
import re
A = '/calibration/test_min000000_5.612230.jpeg'
text = re.findall('\d.*\d', A)
At now, text is a list. If you print it the output will be like this: ['000000_5.612230']
So if you want to extract it, just do this or use for:
import re
A = '/calibration/test_min000000_5.612230.jpeg'
text = re.findall('\d.*\d', A)
print text[0]
String slicing seems like a good solution for this
>>> A = '/calibration/test_min000000_5.612230.jpeg'
>>> start = A.index('min') + len('min')
>>> end = A.index('.jpeg')
>>> A[start:end]
'000000_5.612230'
Avoids having to import re
Try this (if extension is always '.jpeg'):
A.split('test_min')[1][:-5]
If your string is regular at the end, you can use negative indices to slice the string:
>>> a = '/calibration/test_min000000_5.612230.jpeg'
>>> a[-20:-5]
'000000_5.612230'
I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:
Your number is <b>123</b>
Now, how can I extract "123", i.e. the contents of the first bold text after the string "Your number is"?
import re
m = re.search("Your number is <b>(\d+)</b>",
"xxx Your number is <b>123</b> fdjsk")
if m:
print m.groups()[0]
Given s = "Your number is <b>123</b>" then:
import re
m = re.search(r"\d+", s)
will work and give you
m.group()
'123'
The regular expression looks for 1 or more consecutive digits in your string.
Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search() to make sure that m contained a valid reference, otherwise m.group() would result in a AttributeError: exception.
Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup - it's meant for that and much more. The whole idea with BeautifulSoup is to avoid "manual" parsing using string ops or regular expressions.
import re
x = 'Your number is <b>123</b>'
re.search('(?<=Your number is )<b>(\d+)</b>',x).group(0)
this searches for the number that follows the 'Your number is' string
import re
print re.search(r'(\d+)', 'Your number is <b>123</b>').group(0)
The simplest way is just extract digit(number)
re.search(r"\d+",text)
val="Your number is <b>123</b>"
Option : 1
m=re.search(r'(<.*?>)(\d+)(<.*?>)',val)
m.group(2)
Option : 2
re.sub(r'([\s\S]+)(<.*?>)(\d+)(<.*?>)',r'\3',val)
import re
found = re.search("your number is <b>(\d+)</b>", "something.... Your number is <b>123</b> something...")
if found:
print found.group()[0]
Here (\d+) is the grouping, since there is only one group [0] is used. When there are several groupings [grouping index] should be used.
To extract as python list you can use findall
>>> import re
>>> string = 'Your number is <b>123</b>'
>>> pattern = '\d+'
>>> re.findall(pattern,string)
['123']
>>>
You can use the following example to solve your problem:
import re
search = re.search(r"\d+",text).group(0) #returns the number that is matched in the text
print("Starting Index Of Digit", search.start())
print("Ending Index Of Digit:", search.end())
import re
x = 'Your number is <b>123</b>'
output = re.search('(?<=Your number is )<b>(\d+)</b>',x).group(1)
print(output)
I'm using the finditer function in the re module to match some things and everything is working.
Now I need to find out how many matches I've got. Is it possible without looping through the iterator twice? (one to find out the count and then the real iteration)
Some code:
imageMatches = re.finditer("<img src\=\"(?P<path>[-/\w\.]+)\"", response[2])
# <Here I need to get the number of matches>
for imageMatch in imageMatches:
doStuff
Everything works, I just need to get the number of matches before the loop.
If you know you will want all the matches, you could use the re.findall function. It will return a list of all the matches. Then you can just do len(result) for the number of matches.
If you always need to know the length, and you just need the content of the match rather than the other info, you might as well use re.findall. Otherwise, if you only need the length sometimes, you can use e.g.
matches = re.finditer(...)
...
matches = tuple(matches)
to store the iteration of the matches in a reusable tuple. Then just do len(matches).
Another option, if you just need to know the total count after doing whatever with the match objects, is to use
matches = enumerate(re.finditer(...))
which will return an (index, match) pair for each of the original matches. So then you can just store the first element of each tuple in some variable.
But if you need the length first of all, and you need match objects as opposed to just the strings, you should just do
matches = tuple(re.finditer(...))
#An example for counting matched groups
import re
pattern = re.compile(r'(\w+).(\d+).(\w+).(\w+)', re.IGNORECASE)
search_str = "My 11 Char String"
res = re.match(pattern, search_str)
print(len(res.groups())) # len = 4
print (res.group(1) ) #My
print (res.group(2) ) #11
print (res.group(3) ) #Char
print (res.group(4) ) #String
If you find you need to stick with finditer(), you can simply use a counter while you iterate through the iterator.
Example:
>>> from re import *
>>> pattern = compile(r'.ython')
>>> string = 'i like python jython and dython (whatever that is)'
>>> iterator = finditer(pattern, string)
>>> count = 0
>>> for match in iterator:
count +=1
>>> count
3
If you need the features of finditer() (not matching to overlapping instances), use this method.
I know this is a little old, but this but here is a concise function for counting regex patterns.
def regex_cnt(string, pattern):
return len(re.findall(pattern, string))
string = 'abc123'
regex_cnt(string, '[0-9]')
For those moments when you really want to avoid building lists:
import re
import operator
from functools import reduce
count = reduce(operator.add, (1 for _ in re.finditer(my_pattern, my_string)))
Sometimes you might need to operate on huge strings. This might help.
if you are using finditer method best way you can count the matches is to initialize a counter and increment it with each match