regex python pattern error - python

I am using the following regular expression pattern for searching in a text file :
hexadecimal numbers (to find : 1a2bc3d4e5 or 2369.235.26.158963 or Aaa4 )
only letters "a" or spaces. There may be "a", spaces or a mixture of
two but nothing else. :
Below my regex for hexadecimal numbers :
matches = re.compile(' 0[xX][0-9a-fA-F]+ ')
Below my regex for second pattern :
matches = re.compile(r'^[a| ]*$')
Unfortunately, it does not work.
Thanks in advance for your help

Honestly, sometimes I think it's best when asking questions to include some of the actual input (or something close to it) and the desired output. For your hex numbers I'm wondering if you want to capture the 0x which precedes the value or avoid it; secondly variable length hex with your regex prototype (slightly corrected) would capture things like 'def', 'bad', etc. Anyway, having input and desired output helps with understanding the problem. The same can be said for people who answer.
With that said, for your first regex (cause I couldn't understand what you wanted for the second), I tend to prefer using "findall" cause it's more direct and yields group matching, so with the following input (presuming you know I'm creating a string in place of using the file.read() method and making my regex capture strings of more than 4 characters)
Code
import re
input = '''This is a hex number 0xAF67E49
This is NOT a hex number tgey736zde
This hex number 0xb34df49a appears in the middle of a sentence
This could be a hex number but has no letters 3689320'''
matches1 = re.findall('([0-9a-fA-F]{4,})', input)
matches2 = re.findall('0x([0-9a-fA-F]{4,})', input)
matches3 = re.findall('(0x[0-9a-fA-F]{4,})', input)
print('matches1: %s' % (str(matches1)))
print('matches2: %s' % (str(matches2)))
print('matches3: %s' % (str(matches3)))
Output
matches1: ['AF67E49', 'b34df49a', '3689320']
matches2: ['AF67E49', 'b34df49a']
matches3: ['0xAF67E49', '0xb34df49a']
Explanation
matches1 indiscriminately matches anything that is 4 or more characters and falls within the hex range. Experiment with this by changing "tgey736zde" in the input to "tgey736de"
matches2 effectively says capture any hex string of more than 4 characters preceded by 0x, ignoring the 0x
matches3 effectively says capture any hex string of more than 4 characters preceded by 0x, but include the 0x
Extra Information
To make this more effective, you might want to research how to use lookaheads as well

Related

How I can use regex to remove repeated characters from string

I have a string as follows where I tried to remove similar consecutive characters.
import re
input = "abccbcbbb";
for i in input :
input = re.sub("(.)\\1+", "",input);
print(input)
Now I need to let the user specify the value of k.
I am using the following python code to do it, but I got the error message TypeError: can only concatenate str (not "int") to str
import re
input = "abccbcbbb";
k=3
for i in input :
input= re.sub("(.)\\1+{"+(k-1)+"}", "",input)
print(input)
The for i in input : does not do what you need. i is each character in the input string, and your re.sub is supposed to take the whole input as a char sequence.
If you plan to match a specific amount of chars you should get rid of the + quantifier after \1. The limiting {min,} / {min,max} quantifier should be placed right after the pattern it modifies.
Also, it is more convenient to use raw string literals when defining regexps.
You can use
import re
input_text = "abccbcbbb";
k=3
input_text = re.sub(fr"(.)\1{{{k-1}}}", "", input_text)
print(input_text)
# => abccbc
See this Python demo.
The fr"(.)\1{{{k-1}}}" raw f-string literal will translate into (.)\1{2} pattern. In f-strings, you need to double curly braces to denote a literal curly brace and you needn't escape \1 again since it is a raw string literal.
If I were you, I would prefer to do it like suggested before. But since I've already spend time on answering this question here is my handmade solution.
The pattern described below creates a named group named "letter". This group updates iterative, so firstly it is a, then b, etc. Then it looks ahead for all the repetitions of the group "letter" (which updates for each letter).
So it finds all groups of repeated letters and replaces them with empty string.
import re
input = 'abccbcbbb'
result = 'abcbcb'
pattern = r'(?P<letter>[a-z])(?=(?P=letter)+)'
substituted = re.sub(pattern, '', input)
assert substituted == result
Just to make sure I have the question correct you mean to turn "abccbcbbb" into "abcbcb" only removing sequential duplicate characters. Is there a reason you need to use regex? you could likely do a simple list comprehension. I mean this is a really cut and dirty way to do it but you could just put
input = "abccbcbbb"
input = list(input)
previous = input.pop(0)
result = [previous]
for letter in input:
if letter != previous : result += letter
previous = letter
result = "".join(result)
and with a method like this, you could make it easier to read and faster with a bit of modification id assume.

Python: Use Regex to Match Phone Number And Print Tuple (w/Formatting Constraints)

I want to write code that can parse American phone numbers (ie. "(664)298-4397") . Below are the constraints:
allow leading and trailing white spaces
allow white spaces that appear between area code and local numbers
no white spaces in area code or the seven digit number XXX-XXXX
Ultimately I want to print a tuple of strings (area_code, first_three_digits_local, last_four_digits_local)
I have two sets of questions.
Question 1:
Below are inputs my code should accept and print the tuple for:
'(664) 298-4397', '(664)298-4397', ' (664) 298-4397'
Below is the code I tried:
regex_parse1 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664) 298-4397')
print (f' groups are: {regex_parse1.groups()} \n')
regex_parse2 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664)298-4397')
print (f' groups are: {regex_parse2.groups()} \n')
regex_parse3 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' (664) 298-4397')
print (f' groups are: {regex_parse3.groups()}')
The string input for all three are valid and should return the tuple:
('664', '298', '4397')
But instead I'm getting the output below for all three:
groups are: ('', '', '4397')
What am I doing wrong?
Question 2:
The following two chunks of code should output an 'NoneType' object has no attribute 'group' error because the input phone number string violates the constraints. But instead, I get outputs for all three.
regex_parse4 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(404)555 -1212')
print (f' groups are: {regex_parse4.groups()}')
regex_parse5 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' ( 404)121-2121')
print (f' groups are: {regex_parse5.groups()}')
Expected output: should be an error but I get this instead for all three:
groups are: ('', '', '2121')
What is wrong with my regex code?
In general, your regex overuse the asterisk *. Details as follows:
You have 3 capturing groups:
([\s]*[(]*[0-9]*[)]*[\s]*)
([\s]*[0-9]*)
([0-9]*[\s]*)
You use asterisk on every single item, including the open and close parenthesis. Actually, almost everything in your regex is quoted with asterisk. Thus, the capturing groups match also null strings. That's why your first and second capturing groups return the null strings. The only item you don't use asterisk is the hyphen sign - just before the third capturing group. This is also the reason why your regex can capture the third capturing group as in the 4397 and 2121
To solve your problem, you have to use asterisk only when needed.
In fact, your regex still has plenty of rooms for improvement. For example, it now matches numeric digits of any length (instead of 3 or 4 digits chunks). It also allows the area code not enclosed in parenthesis (because of your use of asterisk around parenthesis symbols.
For this kind of common regex, I suggest you don't need to reinvent the wheel. You can refer to some already made regex easily found from the Internet. For example, you can refer to this post Although the post is using javascript instead of Python, the regex is just similar.
Try:
regex_parse4 = re.match(r'([(]*[0-9]{3}[)])\s*([0-9]{3}).([0-9]{4})', number)
Assumes 3 digit area code in parentheses, proceeded by XXX-XXXX.
Python returns 'NoneType' when there are no matches.
If above does not work, here is a helpful regex tool:
https://regex101.com
Edit:
Another suggestion is to clean data prior to applying a new regex. This helps with instances of abnormal spacing, gets rid of parentheses, and '-'.
clean_number = re.sub("[^0-9]", "", original_number)
regex_parse = re.match(r'([0-9]{3})([0-9]{3})([0-9]{4})', clean_number)
print(f'groups are: {regex_parse}.groups()}')
>>> ('xxx', 'xxx', 'xxxx')

Match characters and digits of fixed length and one occurance in Python

I have a list in Python with values
['JUL_2018', 'AUG_2018', 'SEP_2018', 'OCT_2018', 'NOV_2018', 'DEC_2018', 'JAN_2019', 'FEB_2019', 'MAR_2019', 'APR_2019', 'MAY_2019', 'JUN_2019', 'MAT_YA_1', 'MAT_TY_1', 'YTD_YA_1', 'YTD_TY_1', 'L3M_YA_1', 'L1M_YA_1']
I want to match only strings where length is 8 and there are 3 characters before underscore and 4 digits after underscore so I eliminate values not required. I am interested only in the MMM_YYYY values from above list.
Tried below and I am not able to filter values like YTD_TY_1 which has multiple underscores.
for c in col_headers:
d= (re.match('^(?=.*\d)(?=.*[A-Z0-9])[A-Z_0-9\d]{8}$',c))
if d:
data_period.append(d[0])
Update: based on #WiktorStribiżew observation that re.match does not require a full string match in Python
The regex I am using is based upon the one that #dvo provided in a comment:
import re
REGEX = '^[A-Z]{3}_[0-9]{4}$'
col_headers = ['JUL_2018', 'AUG_2018', 'SEP_2018', 'OCT_2018', 'NOV_2018', 'DEC_2018', 'JAN_2019', 'FEB_2019', 'MAR_2019', 'APR_2019', 'MAY_2019', 'JUN_2019', 'MAT_YA_1', 'MAT_TY_1', 'YTD_YA_1', 'YTD_TY_1', 'L3M_YA_1', 'L1M_YA_1']
regex = re.compile(REGEX)
data_period = list(filter(regex.search, col_headers))
Once again, based on a comment made by #WiktorStribiżew, if you do not want to match something as "SXX_0012" or "XYZ_0000", you should use the regex he has provided in a comment:
REGEX = r'^(?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)-[0-9]{4}$'
Rather than use regex for this, you should just try to parse it as a date in the first place:
from datetime import datetime
date_fmt = "%b_%Y"
for c in col_headers:
try:
d = datetime.strptime(c, date_fmt)
data_period.append(c) # Or just save the datetime object directly
except ValueError:
pass
The part of this code that is actually doing the matching in your solution is this
[A-Z_0-9\d]{8}
The problem with this is that you're asking to find exactly 8 characters that include A-Z, _, 0-9, and \d. Now, \d is equivalent to 0-9, so you can eliminate that, but that doesn't solve the whole problem, the issue here is that you've encased the entire solution in brackets []. Basically, your string will match anything that is 8 characters long and includes the above characters, ie: A_19_KJ9
What you need to do is specify that you want exactly 3 A-Z characters, then a single _, then 4 \d, see below:
[A-Z]{3}_\d{4}
This will match anything with exactly 3 A-Z characters, then a single _, then 4 \d(any numeric digit)
For a better understanding of regex, I'd encourage you to use an online tool, like regex101

How to clean the tweets having a specific but varying length pattern?

I pulled out some tweets for analysis. When I separate the words in tweets I can see a lot of following expressions in my output:
\xe3\x81\x86\xe3\x81\xa1
I want to use regular expressions to replace these patterns with nothing. I am not very good with regex. I tried using solution in some similar questions but nothing worked for me. They are replacing characters like "xt" from "extra".
I am looking for something that will replace \x?? with nothing, considering ?? can be either a-f or 0-9 but word must be 4 letter and starting with \x.
Also i would like to add replacement for anything other than alphabets. Like:
"Hi!! my number is (7097868709809)."
after replacement should yield
"Hi my number is."
Input:
\xe3\x81\x86\xe3Extra
Output required:
Extra
What you are seeing is Unicode characters that can't directly be printed, expressed as pairs of hexadecimal digits. So for a more printable example:
>>> ord('a')
97
>>> hex(97)
'0x61'
>>> "\x61"
'a'
Note that what appears to be a sequence of four characters '\x61' evaluates to a single character, 'a'. Therefore:
?? can't "be anything" - they can be '0'-'9' or 'a'-'f'; and
Although e.g. r'\\x[0-9a-f]{2}' would match the sequence you see, that's not what the regex would parse - each "word" is really a single character.
You can remove the characters "other than alphabets" using e.g. string.printable:
>>> s = "foo\xe3\x81"
>>> s
'foo\xe3\x81'
>>> import string
>>> valid_chars = set(string.printable)
>>> "".join([c for c in s if c in valid_chars])
'foo'
Note that e.g. '\xe3' can be directly printed in Python 3 (it's 'ã'), but isn't included in string.printable. For more on Unicode in Python, see the docs.

Python and Regex - extracting a number from a string

I'm new to regex, and I'm starting to sort of get the hang of things. I have a string that looks like this:
This is a generated number #123 which is an integer.
The text that I've shown here around the 123 will always stay exactly the same, but it may have further text on either side. But the number may be 123, 597392, really one or more digits. I believe I can match the number and the folowing text using using \d+(?= which is an integer.), but how do I write the look-behind part?
When I try (?<=This is a generated number #)\d+(?= which is an integer.), it does not match using regexpal.com as a tester.
Also, how would I use python to get this into a variable (stored as an int)?
NOTE: I only want to find the numbers that are sandwiched in between the text I've shown. The string might be much longer with many more numbers.
You don't really need a fancy regex. Just use a group on what you want.
re.search(r'#(\d+)', 'This is a generated number #123 which is an integer.').group(1)
if you want to match a number in the middle of some known text, follow the same rule:
r'some text you know (\d+) other text you also know'
res = re.search('#(\d+)', 'This is a generated number #123 which is an integer.')
if res is not None:
integer = int(res.group(1))
You can just use the findall() in the re module.
string="This is a string that contains #134534 and other things"
match=re.findall(r'#\d+ .+',string);
print match
Output would be '#1234534 and other things'
This will match any length number #123 or #123235345 then a space then the rest of the line till it hits a newline char.
if you want to get the numbers only if the numbers are following text "This is a generated number #" AND followed by " which is an integer.", you don't have to do look-behind and lookahead. You can simply match the whole string, like:
"This is a generated number #(\d+) which is an integer."
I am not sure if I understood what you really want though. :)
updated
In [16]: a='This is a generated number #123 which is an integer.'
In [17]: b='This should be a generated number #123 which could be an integer.'
In [18]: exp="This is a generated number #(\d+) which is an integer."
In [19]: result =re.search(exp, a)
In [20]: int(result.group(1))
Out[20]: 123
In [21]: result = re.search(exp,b)
In [22]: result == None
Out[22]: True

Categories

Resources