Regular expression for version number (vX.X.X) not working - python

I am trying to check that an input string which contains a version number of the correct format.
vX.X.X
where X can be any number of numerical digits, e.g:
v1.32.12 or v0.2.2 or v1232.321.23
I have the following regular expression:
v([\d.][\d.])([\d])
This does not work.
Where is my error?
EDIT: I also require the string to have a max length of 20 characters, is there a way to do this through regex or is it best to just use regular Python len()

Note that [\d.] should match any one character either a digit or a dot.
v(\d+)\.(\d+)\.\d+
Use \d+ to match one or more digit characters.
Example:
>>> import re
>>> s = ['v1.32.12', 'v0.2.2' , 'v1232.321.23', 'v1.2.434312543898765']
>>> [i for i in s if re.match(r'^(?!.{20})v(\d+)\.(\d+)\.\d+$', i)]
['v1.32.12', 'v0.2.2', 'v1232.321.23']
>>>
(?!.{20}) negative lookahead at the start checks for the string length before matching. If the string length is atleast 20 then it would fails immediately without do matching on that particular string.

#Avinash Raj.Your answer is perfect except for one correction.
It would allow only 19 characters.Slight correction
>>> import re
>>> s = ['v1.32.12', 'v0.2.2' , 'v1232.321.23', 'v1.2.434312543898765']
>>> [i for i in s if re.match(r'^(?!.{21})v(\d+)\.(\d+)\.\d+$', i)]
['v1.32.12', 'v0.2.2', 'v1232.321.23']
>>>

Related

Regex to get all occurrences of a pattern followed by a value in a comma separate string

This is in python
Input string:
Str = 'Y=DAT,X=ZANG,FU=_COG-GAB-CANE-,FU=FARE,T=TART,RO=TOP,FU=#-_MAP.com-,Z=TRY'
Expected output
'FU=_COG-GAB-CANE_,FU=FARE,FU=#-_MAP.com_'
here 'FU=' is the occurence we are looking for and the value which follows FU=
return all occurrences of FU=(with the associated value for FU=) in a comma-separated string, they can occur anywhere within the string and special characters are allowed.
Here is one approach.
>>> import re
>>> str_ = 'Y=DAT,X=ZANG,FU=FAT,T=TART,FU=GEM,RO=TOP,FU=MAP,Z=TRY'
>>> re.findall.__doc__[:58]
'Return a list of all non-overlapping matches in the string'
>>> re.findall(r'FU=\w+', str_)
['FU=FAT', 'FU=GEM', 'FU=MAP']
>>> ','.join(re.findall(r'FU=\w+', str_))
'FU=FAT,FU=GEM,FU=MAP'
Got it working
Python Code
import re
str_ = 'Y=DAT,X=ZANG,FU=_COG-GAB-CANE-,FU=FARE,T=TART,RO=TOP,FU=#-_MAP.com-,Z=TRY'
str2='FU='+',FU='.join(re.findall(r'FU=(.*?),', str_))
print(str2)
Gives the desired output:
'FU=_COG-GAB-CANE-,FU=FARE,FU=#-_MAP.com-'
Successfully gives me all the occurrences of FU= followed by values, irrespective of order and number of special characters.
Although a bit unclean way as I am manually adding FU= for the first occurrence.
Please suggest if there is a cleaner way of doing it ? , but yes it gets the work done.

How to clean the tweets having a specific but varying length pattern?

I pulled out some tweets for analysis. When I separate the words in tweets I can see a lot of following expressions in my output:
\xe3\x81\x86\xe3\x81\xa1
I want to use regular expressions to replace these patterns with nothing. I am not very good with regex. I tried using solution in some similar questions but nothing worked for me. They are replacing characters like "xt" from "extra".
I am looking for something that will replace \x?? with nothing, considering ?? can be either a-f or 0-9 but word must be 4 letter and starting with \x.
Also i would like to add replacement for anything other than alphabets. Like:
"Hi!! my number is (7097868709809)."
after replacement should yield
"Hi my number is."
Input:
\xe3\x81\x86\xe3Extra
Output required:
Extra
What you are seeing is Unicode characters that can't directly be printed, expressed as pairs of hexadecimal digits. So for a more printable example:
>>> ord('a')
97
>>> hex(97)
'0x61'
>>> "\x61"
'a'
Note that what appears to be a sequence of four characters '\x61' evaluates to a single character, 'a'. Therefore:
?? can't "be anything" - they can be '0'-'9' or 'a'-'f'; and
Although e.g. r'\\x[0-9a-f]{2}' would match the sequence you see, that's not what the regex would parse - each "word" is really a single character.
You can remove the characters "other than alphabets" using e.g. string.printable:
>>> s = "foo\xe3\x81"
>>> s
'foo\xe3\x81'
>>> import string
>>> valid_chars = set(string.printable)
>>> "".join([c for c in s if c in valid_chars])
'foo'
Note that e.g. '\xe3' can be directly printed in Python 3 (it's 'ã'), but isn't included in string.printable. For more on Unicode in Python, see the docs.

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Strip all characters after the final dash in a string in Python, and test if numeric?

If I have a string:
string = 'this-is-a-string-125'
How can I grab the last set of characters after the dash and check if they are digits?
If you want to verify that they are actually digits, you can do
x.rsplit('-', 1)[1].isdigit()
"Numeric" is a more general criteria that could be interpreted different ways. For instance "12.87" is numeric in some sense, but not all the characters are digits.
You can do int(x.rsplit('-', 1)[1]) to see if the string can be interpreted as a integer, or float(x.rsplit('-', 1)[1]) to see if it can be interpreted as a float. (These will raise a ValueError if the string isn't numeric in the appropriate sense, so you can catch that exception and do whatever you need to do if it's not numeric.)
s = 'this-is-a-string-125'.split('-')[-1].isdigit()
We split the string by dash ('-') which gives a list of substrings (see split()). We then take the last one ([-1]) and we verify that that string contains only digits (isdigit()):
>>> 'this-is-a-string-125'.split('-')
['this', 'is', 'a', 'string', '125']
>>> 'this-is-a-string-125'.split('-')[-1]
'125'
>>> 'this-is-a-string-125'.split('-')[-1].isdigit()
True
Nobody knows about partition or rpartition:
text.rpartition("-")[-1].isdigit()
How about:
str.split('-')[-1].isdigit()
Seems like a simple regex can do both the stripping and checking:
>>> import re
>>> s = 'this-is-a-string-125'
>>> m = re.search(r'-(\d+)$', s)
>>> m.group(1)
'125'
>>> s[:m.start()] # gives you what was stripped away.
'this-is-a-string'
Match object m will be None if the string lacks a dash character followed by one or more digits at the end.

Convert string into integer

How can I convert string into integer and remove every character from that change.
Example:
S = "--r10-" I want to have this: S = 10
This not work:
S = "--10-"
int(S)
You can use filter(str.isdigit, s) to keep only those characters of s that are digits:
>>> s = "--10-"
>>> int(filter(str.isdigit, s))
10
Note that this might lead to unexpected results for strings that contain multiple numbers
>>> int(filter(str.isdigit, "12 abc 34"))
1234
or negative numbers
>>> int(filter(str.isdigit, "-10"))
10
Edit: To make this work for unicode objects instead of str objects, use
int(filter(unicode.isdigit, u"--10-"))
remove all non digits first like that:
int(''.join(c for c in "abc123def456" if c.isdigit()))
You could just strip off - and r:
int("--r10-".strip('-r'))
use regex replace with /w to replace non word characters with "" empty string. then cast it
I prefer Sven Marnach's answer using filter and isdigit, but if you want you can use regular expressions:
>>> import re
>>> pat = re.compile(r'\d+') # '\d' means digit, '+' means one or more
>>> int(pat.search('--r10-').group(0))
10
If there are multiple integers in the string, it pulls the first one:
>>> int(pat.search('12 abc 34').group(0))
12
If you need to deal with negative numbers use this regex:
>>> pat = re.compile(r'\-{0,1}\d+') # '\-{0,1}' means zero or one dashes
>>> int(pat.search('negative: -8').group(0))
-8
This is simple and does not require you to import any packages.
def _atoi(self, string):
i = 0
for c in string:
i += ord(c)
return i

Categories

Resources