Extract Digits from String [duplicate] - python

This question already has an answer here:
How can I find all matches to a regular expression in Python?
(1 answer)
Closed 4 years ago.
I'm trying to extract digits from a unicode string. The string looks like raised by 64 backers and raised by 2062 backers. I tried many different things, but the following code is the only one that actually worked.
backers = browser.find_element_by_xpath('//span[#gogo-test="backers"]').text
match = re.search(r'(\d+)', backers)
print(match.group(0))
Since I'm not sure how often I'll need to extract substrings from strings, and I don't want to be creating tons of extra variables and lines of code, I'm wondering if there's a shorter way to accomplish this?
I know I could do something like this.
def extract_digits(string):
return re.search(r'(\d+)', string)
But I was hoping for a one liner, so that I could structure the script without using an additional function like so.
backers = ...
title = ...
description = ...
...
Even though it obviously doesn't work, I'd like to do something similar to the following, but it doesn't work as intended.
backers = re.search(r'(\d+)', browser.find_element_by_xpath('//span[#gogo-test="backers"]').text)
And the output looks like this.
<_sre.SRE_Match object at 0x000000000542FD50>
Any way to deal with this?!

As an option you can skip using regex and use built-in Python isdigit() (no additional imports needed):
digit = [sub for sub in browser.find_element_by_xpath('//span[#gogo-test="backers"]').text.split() if sub.isdigit()][0]

You can try this:
number = backers.findall(r'\b\d+\b', 'raised by 64 backers')
output:
64
So the method could be like this:
def extract_digits(string):
return re.findall(r'\b\d+\b', string)
DEMO here
EDIT: since you want everything in one line, try this:
import re
backers = re.findall(r'\b\d+\b', browser.find_element_by_xpath('//span[#gogo-test="backers"]').text)[0]
PS:
search ⇒ find something anywhere in the string and return a match object
findall ⇒ find something anywhere in the string and return a list.
Documentation:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string.
Documentation link: docs.python.org/2/library/re.html
So to do the same with search use this:
backers = re.search(r'(\d+)', browser.find_element_by_xpath('//span[#gogo-test="backers"]').text).group(0)

Related

in python find index in list if combination of strings exist

I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.

A more powerful method than Python's find? A regex issue?

I'm looking for a list of strings and their variations within a very large string.
What I want to do is find even the implicit matches between two strings.
For example, if my start string is foo-bar, I want the matching to find Foo-bAr foo Bar, or even foo(bar.... Of course, foo-bar should also return a match.
EDIT: More specifically, I need the following matches.
The string itself, case insenstive.
The string with spaces separating any of the characters
The string with parentheses separating any of the characters.
How do I write an expression to meet these conditions?
I realize this might require some tricky regex. The thing is, I have a large list of strings I need to search for, and I feel regex is just the tool for making this as robust as I need.
Perhaps regex isn't the best solution?
Thanks for your help guys. I'm still learning to think in regex.
>>> def findString(inputStr, targetStr):
... if convertToStringSoup(targetStr).find(convertToStringSoup(inputStr)) != -1:
... return True
... return False
...
>>> def convertToStringSoup(testStr):
... testStr = testStr.lower()
... testStr = testStr.replace(" ", "")
... testStr = testStr.replace("(", "")
... testStr = testStr.replace(")", "")
... return testStr
...
>>>
>>> findString("hello", "hello")
True
>>> findString("hello", "hello1")
True
>>> findString("hello", "hell!o1")
False
>>> findString("hello", "hell( o)1")
True
should work according to your specification. Obviously, could be optimized. You're asking about regex, which I'm thinking hard about, and will hopefully edit this question soon with something good. If this isn't too slow, though, regexps can be miserable, and readable is often better!
I noticed that you're repeatedly looking in the same big haystack. Obviously, you only have to convert that to "string soup" once!
Edit: I've been thinking about regex, and any regex you do would either need to have many clauses or the text would have to be modified pre-regex like I did in this answer. I haven't benchmarked string.find() vs re.find(), but I imagine the former would be faster in this case.
I'm going to assume that your rules are right, and your examples are wrong, mainly since you added the rules later, as a clarification, after a bunch of questions. So:
EDIT: More specifically, I need the following matches.
The string itself, case insenstive.
The string with spaces separating any of the characters
The string with parentheses separating any of the characters.
The simplest way to do this is to just remove spaces and parens, then do a case-insensitive search on the result. You don't even need regex for that. For example:
haystack.replace(' ', '').replace('(', '').upper().find(needle.upper())
Try this regex:
[fF][oO]{2}[- ()][bB][aA][rR]
Test:
>>> import re
>>> pattern = re.compile("[fF][oO]{2}[- ()][bB][aA][rR]")
>>> m = pattern.match("foo-bar")
>>> m.group(0)
'foo-bar'
Using a regex, a case-insensitive search matches upper/lower case invariants, '[]' matches any contained characters and '|' lets you do multiple compares at once. Putting it all together, you can try:
import re
pairs = ['foo-bar', 'jane-doe']
regex = '|'.join(r'%s[ -\)]%s' % tuple(p.split('-')) for p in pairs)
print regex
results = re.findall(regex, your_text_here, re.IGNORECASE)

Referencing a RegEx Variable

I'm using python to loop through a large list of self reported locations to try to match them to their home states. The RegEx expression I'm using is:
/^"[^\s]+,\s*([a-zA-Z]{2})"$/
Basically, I'm trying to find a pattern that looks like XXXCITYXXX, [Statecode], where statecode is only two letters.
My issue is that I don't know how to reference the varying state code once I find a matching string. I know in Perl that I could use:
$state = uc($1)
However, I don't know the equivalent Python syntax. Anyone know?
You can do it with re.search, which returns a match object (if the regex matches at all) with a groups property containing the captured groups:
import re
match = re.search('^[^\s]+,\s*([a-zA-Z]{2})$', my_string)
if match:
print match.groups()[0]

Python regex confused by brackets ([])? [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
Is python confused, or is the programmer?
I've got a lot of lines of this:
some_dict[0x2a] = blah
some_dict[0xab] = blah, blah
What I'd like to do is to convert the hex codes into all uppercase to look like this:
some_dict[0x2A] = blah
some_dict[0xAB] = blah, blah
So I decided to call in the regular expressions. Normally, I'd just do this using my editor's regexps (xemacs), but the need to convert to uppercase pushes one into Lisp. ....ok... how about Python?
So I whip together a short script which doesn't work. I've condensed the code into this example, which doesn't work either. It looks to me like Python's regexps are getting confused by the brackets in the code. Is it me or Python?
import fileinput
import sys
import re
this = "0x2a"
that = "[0x2b]"
for line in [this, that]:
found = re.match("0x([0-9,a-f]{2})", line)
if found:
print("Found: %s" % found.group(0))
(I'm using the () grouping constructs so I don't capitalize the 'x' in '0x'.)
This example only prints the 0x2a value, not the 0x2b. Is this correct behavior?
I can easily work around this by changing the match expression to:
found = re.match("\[0x([0-9,a-f]{2}\])", line)
but I'm just wondering if someone can give me some insight into what's going on here.
Running Python 2.6.2 on Linux.
re.match matches from the start of the string. Use re.search instead to "match the first occurrence anywhere in the string". The key bit about this in the docs is here.
I don't think you need the comma within the brackets. i.e.:
found = re.match("0x([0-9,a-f]{2})", line)
tells python to look for commas which it might be mistakenly matching. I think you want
found = re.match("0x([0-9a-f]{2})", line)
You're using a partial pattern, so you can't use re.match, which expects to match the entire input string. You need to use re.search, which can perform partial matches.
>>> that = "[0x2b]"
>>> m = re.search("0x([0-9,a-f]{2})", that)
>>> m.group()
'0x2b'
You'll want to change
found = re.match("0x([0-9,a-f]{2})", line)
to
found = re.search("0x([0-9,a-f]{2})", line)
re.match will match only from the beginning of the string, which fails in the "[0x2b]" case.
re.search will match anywhere in the string, and thus ignore the leading "[" in the "[0x2b]" case.
See search() vs. match() for details.
You want to use re.search. This explains why.
If you use re.sub, and pass a callable as the replacement string, it will also do the uppercasing for you:
>>> that = 'some_dict[0x2a] = blah'
>>> m = re.sub("0x([0-9,a-f]{2})", lambda x: "0x"+x.group(1).upper(), that)
>>> m
'some_dict[0x2A] = blah'

Difference in regex behavior between Perl and Python?

I have a couple email addresses, 'support#company.com' and '1234567#tickets.company.com'.
In perl, I could take the To: line of a raw email and find either of the above addresses with
/\w+#(tickets\.)?company\.com/i
In python, I simply wrote the above regex as'\w+#(tickets\.)?company\.com' expecting the same result. However, support#company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?
The documentation for re.findall:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.
r'(\w+#(tickets\.)?company\.com)'
r'\w+#(?:tickets\.)?company\.com'
Note that you'll have to pick out the first element of each tuple returned by findall in the first case.
I think the problem is in your expectations of extracted values. Try using this in your current Python code:
'(\w+#(?:tickets\.)?company\.com)'
Two problems jump out at me:
You need to use a raw string to avoid having to escape "\"
You need to escape "."
So try:
r'\w+#(tickets\.)?company\.com'
EDIT
Sample output:
>>> import re
>>> exp = re.compile(r'\w+#(tickets\.)?company\.com')
>>> bool(exp.match("s#company.com"))
True
>>> bool(exp.match("1234567#tickets.company.com"))
True
There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this
#!/usr/bin/python
import re
regex = re.compile("(\w+#(?:tickets\.)?company\.com)");
a = [
"foo#company.com",
"foo#tickets.company.com",
"foo#ticketsacompany.com",
"foo#compant.org"
];
for string in a:
print regex.findall(string)

Categories

Resources