Unable to extract GSTIN using python regex - python

Can anyone help in fixing the issue here.
I am trying to extract GSTIN/UIN from texts.
#None of these works
#GSTIN_REG = re.compile(r'^\d{2}([a-z?A-Z?0-9]){5}([a-z?A-Z?0-9]){4}([a-z?A-Z?0-9]){1}?[Z]{1}[A-Z\d]{1}$')
#GSTIN_REG = re.compile(r'[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[A-Z0-9]{1}Z{1}[A-Z0-9]{1}')
#GSTIN_REG = re.compile(r'^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[A-Z0-9]{1}[Z]{1}[A-Z0-9]{1}$')
GSTIN_REG = re.compile(r'^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[1-9A-Z]{1}Z[0-9A-Z]{1}$')
#GSTIN_REG = re.compile(r'19AISPJ4698P1ZX') #This works
#GSTIN_REG = re.compile(r'06AACCE2308Q1ZK') #This works
def extract_gstin(text):
return re.findall(GSTIN_REG, text)
text = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
print(extract_gstin(text))

Your second pattern in the commented out part works, and you can omit {1} as it is the default.
What you might do to make it a bit more specific is add word boundaries \b to the left and right to prevent a partial word match.
If it should be after GSTIN : you can use a capture group as well.
Example with the commented pattern:
import re
GSTIN_REG = re.compile(r'[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][A-Z0-9]Z[A-Z0-9]')
def extract_gstin(s):
return re.findall(GSTIN_REG, s)
s = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
print(extract_gstin(s))
Output
['06AACCE2308Q1ZK']
A bit more specific pattern (which has the same output as re.findall returns the value of the capture group)
\bGSTIN : ([0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][A-Z0-9]Z[A-Z0-9])\b
Regex demo

Related

Pandas REGEX not returning expected results using "extract"

I am attempting to use REGEX to extract connection strings from blocks of text in a pandas dataframe.
My REGEX works on REGEX101.com (see Screenshot below). Link to my saved test here: https://regex101.com/r/ILnpS0/1
When I try to run the REGEX in a Pandas dataframe, I don’t get any REGEX matches/extracts (but no an error), despite getting matches on REGEX101. Link to my code in a Google Colab notebook: https://colab.research.google.com/drive/1WAMlGkHAOqe38Lzo_K0KHwD_ynVJyIq1?usp=sharing
Therefore the issue appears to be how pandas is interpreting my REGEX
Can anyone identify why I not getting any REGEX matches using pandas?
REGEX Logic
My REGEX consists of 3 groups
(?=Source = DB2.Database)(.*?)(?=\]\))
Group 1: (?=Source = DB2.Database) is a “Lookbehind” that looks for the text “Source = DB2.Database” i.e the start of my connection string.
Group 2: (.?)* looks for any characters and acts as a span between the 1st and 3rd group.
Group 3: (?=])) is a look behind assertion that aims to identify the end of the connection string)
Additional tests:
When I run a simplified version of the REGEX (DB2.Database) I get the match, as expected. This example is also in the notebook linked above.
My code (same as in linked Colab Notebook)
import pandas as pd
myDF = pd.DataFrame({'conn_str':['''{'expression': 'let\n Source = Snowflake.Databases("whitehouse.australia-east.azure.snowflakecomputing.com","USER"),\n WH_DW_Database = Source{[Name="WHOUSE_DW",Kind="Database"]}[Data],\n DWH_Schema = SPARK_DW_Database{[Name="DWH",Kind="Schema"]}[Data],\n D_ACCOUNT_CURR_View = DWH_Schema{[Name="D_ACCOUNT_CURR",Kind="View"]}[Data],\n #"Filtered Rows" = Table.SelectRows(D_ACCOUNT_CURR_View, each ([PAYMENT_TYPE] = "POSTPAID") and ([ACCOUNT_SEGMENT] <> "Consumer") ),\n #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"DESCRIPTION", "ACCOUNT_NUMBER"})\nin\n #"Removed Other Columns"'}''','''{'expression': 'let\n Source = DB2.Database("69.699.69.69", "WHUDB", [HierarchicalNavigation=true, Implementation="Microsoft", Query="SELECT\n base.HEAD_PARTY_NO,\n base.HEAD_PARTY_NAME,\n usg.BILL_MONTH,\n base.CUSTOMER_NUMBER,\n base.ACCOUNT_NUMBER,\n base.CHARGE_ARRANGEMENT_NUMBER,\n usg.DATA_MB,\n usg.DATA_MB/1024 as Data_GB,\n base.PRODUCT_DESCRIPTION,\nbase.LINE_DESCRIPTION\n\nFROM PRODUCT.MOBILE_ACTIVE_BASE base\nLEFT JOIN PRODUCT.MOBILE_USAGE_SUMMARY usg\n\nON\n base.CHARGE_ARRANGEMENT_NUMBER = usg.CHARGE_ARRANGEMENT_NUMBER\n\nand \nbase.CHARGE_ARRANGEMENT_ID = usg.CHARGE_ARRANGEMENT_ID\n\nWHERE base.PRODUCT_DESCRIPTION LIKE \'%Share%\' \n--AND (base.HEAD_PARTY_NO = 71474425 or base.HEAD_PARTY_NO = 73314303)\nAND usg.BILL_MONTH BETWEEN (current_date - 5 MONTHS) and CURRENT_DATE \nOrder by base.ACCOUNT_NUMBER,Data_MB desc with ur"]),\n #"Added Custom1" = Table.AddColumn(Source, "Line Number", each Text.Middle([CHARGE_ARRANGEMENT_NUMBER],1,14)),\n #"Renamed Columns" = Table.RenameColumns(#"Added Custom1",{{"LINE_DESCRIPTION", "Line Description"}, {"BILL_MONTH", "Bill Month"}}),\n #"Filtered Rows" = Table.SelectRows(#"Renamed Columns", each ([PRODUCT_DESCRIPTION] <> "Sharer Unlimited NZ & Aus mins + Unlimited NZ & Aus texts" and [PRODUCT_DESCRIPTION] <> "Sharer with Data Stretch"))\nin\n #"Filtered Rows"'}''']})
myDF
#why isn't this working?
#this regex works on REGEX 101 : https://regex101.com/r/ILnpS0/1
regex_db =r'(?=Source = DB2.Database)(.*?)(?=\]\))'
myDF['SQLDB connection2'] = myDF['conn_str'].str.extract(regex_db ,expand=True)
myDF
#This is a simplified version of the above REGEX, and works to extracts the text "DB2.Database"
#This works fine
regex_db2 =r'(DB2.Database)'
myDF['SQLDB connection1'] = myDF['conn_str'].str.extract(regex_db2 ,expand=True)
myDF
Any suggestions on what I am doing wrong?
Try running your regex in dot all mode, so that .* will match across newlines:
regex_db = r'(?=Source = DB2.Database)(.*?)(?=\]\))'
myDF["SQLDB connection2"] = myDF["conn_str"].str.extract(regex_db, expand=True, flags=re.S)
myDF

Regex not specific enough

So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.
This is the code for the regex that I'm using to detect the books title and author
titleRegex = re.compile('(.+)\((.+)\)')
Example
Desired book title and author match: Book title (Author name)
What would also get matched: *I like apples because they are green (they are sometimes red as well). *
In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted
Here is the unformatted text file that goes into my program
The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.
Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?
I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)
import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')
regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]
string = open("/Users/devinnagami/myclippings.txt").read()
for x in range(len(regexList)):
newString = re.sub(regexList[x], ' ', string)
string = newString
finalText = newString.split(' ')
with open('booknotes.txt', 'w') as f:
for item in finalText:
f.write('%s\n' % item)
There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.
For instance:
quoteInfoRegex = re.compile(
r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" +
r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" +
r"\n" +
r"(?P<quote>.*?)\n", flags=re.MULTILINE)
for m in quoteInfoRegex.finditer(data):
print(m.groupdict())
This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.

Filter built with Regex cannot target more than one word at a time (Django 2.1.5)

I have a custom filter that highlights the keywords that the user had put into the search bar (like on Google search). However, as of now, it only highlights the last word of the keywords. For example, if the keywords are "American film industry", only "industry" will be highlighted. But I want all three words to be highlighted whenever and wherever they are present on the webpage (even if they aren't next to each other). To treat the keywords string as individual words, I have split the keywords:
def highlight(value, search_term, autoescape=True):
search_term_list = search_term.split()
search_term_word = ''
for search_term_word in search_term_list:
pattern = re.compile(re.escape(search_term_word), re.IGNORECASE)
new_value = pattern.sub('<span class="highlight">\g<0></span>', value)
return mark_safe(new_value)
Any idea why the filter only highlights the last word and how to make the code work?
Here is an alternative to the proposed solution by #WiktorStribiżew
# import re
def highlight(value, search_term):
pattern = r'{}'.format(search_term.replace(' ', '|'))
return re.sub(pattern, '<span class="highlight">\g<0></span>', value)
highlight('Hello one world', 'one world')
# 'Hello <span class="highlight">one</span> <span class="highlight">world</span>'

Multiple distinct replaces using RegEx

I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?
The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)
The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.
The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)

How to get a value for a key in a string, when followed by another specific key=value set

my code is like:
string = "title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht color=red title=xxxy red=anything title=xxxyyy color=red"
pattern = r'title=(.*?) color=red'
print re.compile(pattern).search(string).group(0)
and I got
"title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht color=red title=xxxy red=anything title=xxxyyy color=red"
But I want to find all the contents of "title"s immediately followed by "color=red"
You want what immediately precedes color=red? Then use
.*title=(.*?) color=red
Demo: https://regex101.com/r/sR4kN2/1
This greedily matches everything that comes before color=red, so that only the desired title appears.
Alternatively, if you know there is a character that doesn't appear in the title, you can simplify by just using a character class exclusion. For example, if you know = won't appear:
title=([^=]*?) color=red
Or, if you know whitespace won't appear:
title=([^\s]*?) color=red
A third option, using a bit of code to find all red titles (assuming that the input always alternates title, color):
for title, color in re.findall(r'title=(.*?) color=(.*?)\( |$\)'):
if color == 'red':
print title
If you want to get the last match of a sub-regexp before a certain regexp the solution is to use a greedy skipper. For example:
>>> pattern = '.*title="([^"]*)".*color="#123"'
>>> text = 'title="123" color="#456" title="789" color="#123"'
>>> print(re.match(pattern, s).groups(1))
the first .* is greedy and it will skip as much as possible (thus skipping first title) backing up to the one that allows matching the desired color.
As a simpler example consider that
a(.*)b(.*)c
processed on
a1111b2222b3333c
will match 1111b2222 in the first group and 3333 in the second.
Why don't you skip the regexes, and use some split functionality instead:
search_title = False
found = None
string = "title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht colo\
r=red title=xxxy red=anything title=xxxyyy color=red"
parts = string.split()
for part in parts:
key, value = part.split('=', 1)
if search_title:
if key == 'title':
found = value
search_title = False
if key == 'color' and value == 'red':
search_title = True
print(found)
results in
xxxy
Regexes are nice, but can cause headaches at times.
Try this using re module
>>>string = 'title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht color=red'
>>>import re
>>>re.search('(.*title=?)(.*) color=red', string).group(2)
'whatIwaht'
>>>re.search('(.*title=?)(.*) color=red', string).group(2)
'xyxyx'

Categories

Resources