Pandas Python Regular Expression Assistance

Pandas Python Regular Expression Assistance - python

I wasn't sure what to call this title, feel free to edit it if you think there is a better name.
What I am trying to do is find cases that match certain search criteria.
Specifically, I am trying to find sentences that contain the word "where" in them. Once I have identified that, I am trying to find cases where the word "SQL" command is also located within that same tag.
Let's say I have a dataframe that looks like this:
search_criteria = ['where']
df4
Q R
0 file.sql <sentence>dave likes stuff</sentence><properties>version = "2", description = "example" type="SqlCommand">select id, name, from table where criteria = '5'</property><sentence>dave hates stuff>
0 file.sql <sentence>dave likes stuff</sentence><properties>version = "2", description = "example">select id, name, from table where criteria = '5'</properties><sentence>dave hates stuff>
I am trying to return this:
Q R
0 file.sql <properties>version = "2", description = "example">select id, name, from table</properties>
This record should get returned because it contains both "where" and "sqlcommand".
Here is my current process:
regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].R.str.findall('(<[^<]*?' + 'where' + '[^>]*?>)', re.IGNORECASE)
sql_command_regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].R.str.findall('(<property[^<]*?' + 'sqlcommand' + '[^>]*?<\/property>)', re.IGNORECASE)
if not regex_stuff.empty: #if one of the search criteria is found
if not sql_command_regex_stuff.empty: #check to see if the phrase "sqlcommand" is found anywhere as well
(insert rest of code)
This does not return anything.
What am I doing wrong?
Edit #1:
It seems like I need to do something at the end, to make the regex look something like this:
<property[^<]*?SqlCommand[^(<\/property>)]*
I feel like this is the right direction, doesn't work, but I feel like this is the right step.

You could just filter with str.contains:
df[(df['R'].str.contains('where', flags=re.IGNORECASE) & df['R'].str.contains('sqlcommand', flags=re.IGNORECASE))]
Q R
0 file.sql <sentence>dave likes stuff</sentence><properti...
or use ~ to return the opposite: strings that do not contain 'sqlcommand' or 'where'
df[~(df['R'].str.contains('where', flags=re.IGNORECASE) & df['R'].str.contains('sqlcommand', flags=re.IGNORECASE))]
Q R
1 file.sql <sentence>dave likes stuff</sentence><properti...

First of all, you have to have proper XML and SQL content, so you should
make the following corrections:
As the opening tag is <properties>, the closing tag must also be
</properties>, not </property>.
version, description and type are attributes (after them
there is > closing the opening tag, so after properties there
should be a space, not >.
Remove , after version="2".
Remove , after name.
Remove ( before <properties and ) after </properties>.
To find the required rows, use str.contains as the filtering
expression.
Below you have an example program:
import pandas as pd
import re
df4 = pd.DataFrame({
'Q' : 'file.sql',
'R' : [
'<s>dave</s><properties type="SqlCommand">select id, name '
'from table where criteria=\'5\'</properties><s>dave</s>',
'<s>dave</s><properties>select id, name from table '
'where criteria=\'6\'</properties><s>dave</s>',
'<s>mike</s><properties type="SqlCommand">drop table "Xyz"'
'</properties><s>mike</s>' ]})
df5 = df4[df4.R.str.contains(
'<properties[^<>]+?sqlcommand[^<>]+?>[^<>]+?where',
flags=re.IGNORECASE)]
print(df5)
Note that the regex takes care about the proper sequence of
strings:
First match <properties.
Then a sequence of chars other than < and > ([^<>]+?).
so we are still within the just opened XML tag.
Then match sqlcommand (ignoring case).
Then another sequence of chars other than < and >
([^<>]+?).
Then >, closing the tag.
Then another sequence of chars other than < and >
([^<>]+?).
And finally where (also ignoring case).
An attempt to check for sqlcommand and where in two separate
regexes is wrong, as these words can be at other locations,
which do not meet your requirement.

Related

Pandas REGEX not returning expected results using "extract"

I am attempting to use REGEX to extract connection strings from blocks of text in a pandas dataframe.
My REGEX works on REGEX101.com (see Screenshot below). Link to my saved test here: https://regex101.com/r/ILnpS0/1
When I try to run the REGEX in a Pandas dataframe, I don’t get any REGEX matches/extracts (but no an error), despite getting matches on REGEX101. Link to my code in a Google Colab notebook: https://colab.research.google.com/drive/1WAMlGkHAOqe38Lzo_K0KHwD_ynVJyIq1?usp=sharing
Therefore the issue appears to be how pandas is interpreting my REGEX
Can anyone identify why I not getting any REGEX matches using pandas?
REGEX Logic
My REGEX consists of 3 groups
(?=Source = DB2.Database)(.*?)(?=\]\))
Group 1: (?=Source = DB2.Database) is a “Lookbehind” that looks for the text “Source = DB2.Database” i.e the start of my connection string.
Group 2: (.?)* looks for any characters and acts as a span between the 1st and 3rd group.
Group 3: (?=])) is a look behind assertion that aims to identify the end of the connection string)
Additional tests:
When I run a simplified version of the REGEX (DB2.Database) I get the match, as expected. This example is also in the notebook linked above.
My code (same as in linked Colab Notebook)
import pandas as pd
myDF = pd.DataFrame({'conn_str':['''{'expression': 'let\n Source = Snowflake.Databases("whitehouse.australia-east.azure.snowflakecomputing.com","USER"),\n WH_DW_Database = Source{[Name="WHOUSE_DW",Kind="Database"]}[Data],\n DWH_Schema = SPARK_DW_Database{[Name="DWH",Kind="Schema"]}[Data],\n D_ACCOUNT_CURR_View = DWH_Schema{[Name="D_ACCOUNT_CURR",Kind="View"]}[Data],\n #"Filtered Rows" = Table.SelectRows(D_ACCOUNT_CURR_View, each ([PAYMENT_TYPE] = "POSTPAID") and ([ACCOUNT_SEGMENT] <> "Consumer") ),\n #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"DESCRIPTION", "ACCOUNT_NUMBER"})\nin\n #"Removed Other Columns"'}''','''{'expression': 'let\n Source = DB2.Database("69.699.69.69", "WHUDB", [HierarchicalNavigation=true, Implementation="Microsoft", Query="SELECT\n base.HEAD_PARTY_NO,\n base.HEAD_PARTY_NAME,\n usg.BILL_MONTH,\n base.CUSTOMER_NUMBER,\n base.ACCOUNT_NUMBER,\n base.CHARGE_ARRANGEMENT_NUMBER,\n usg.DATA_MB,\n usg.DATA_MB/1024 as Data_GB,\n base.PRODUCT_DESCRIPTION,\nbase.LINE_DESCRIPTION\n\nFROM PRODUCT.MOBILE_ACTIVE_BASE base\nLEFT JOIN PRODUCT.MOBILE_USAGE_SUMMARY usg\n\nON\n base.CHARGE_ARRANGEMENT_NUMBER = usg.CHARGE_ARRANGEMENT_NUMBER\n\nand \nbase.CHARGE_ARRANGEMENT_ID = usg.CHARGE_ARRANGEMENT_ID\n\nWHERE base.PRODUCT_DESCRIPTION LIKE \'%Share%\' \n--AND (base.HEAD_PARTY_NO = 71474425 or base.HEAD_PARTY_NO = 73314303)\nAND usg.BILL_MONTH BETWEEN (current_date - 5 MONTHS) and CURRENT_DATE \nOrder by base.ACCOUNT_NUMBER,Data_MB desc with ur"]),\n #"Added Custom1" = Table.AddColumn(Source, "Line Number", each Text.Middle([CHARGE_ARRANGEMENT_NUMBER],1,14)),\n #"Renamed Columns" = Table.RenameColumns(#"Added Custom1",{{"LINE_DESCRIPTION", "Line Description"}, {"BILL_MONTH", "Bill Month"}}),\n #"Filtered Rows" = Table.SelectRows(#"Renamed Columns", each ([PRODUCT_DESCRIPTION] <> "Sharer Unlimited NZ & Aus mins + Unlimited NZ & Aus texts" and [PRODUCT_DESCRIPTION] <> "Sharer with Data Stretch"))\nin\n #"Filtered Rows"'}''']})
myDF
#why isn't this working?
#this regex works on REGEX 101 : https://regex101.com/r/ILnpS0/1
regex_db =r'(?=Source = DB2.Database)(.*?)(?=\]\))'
myDF['SQLDB connection2'] = myDF['conn_str'].str.extract(regex_db ,expand=True)
myDF
#This is a simplified version of the above REGEX, and works to extracts the text "DB2.Database"
#This works fine
regex_db2 =r'(DB2.Database)'
myDF['SQLDB connection1'] = myDF['conn_str'].str.extract(regex_db2 ,expand=True)
myDF
Any suggestions on what I am doing wrong?

Try running your regex in dot all mode, so that .* will match across newlines:
regex_db = r'(?=Source = DB2.Database)(.*?)(?=\]\))'
myDF["SQLDB connection2"] = myDF["conn_str"].str.extract(regex_db, expand=True, flags=re.S)
myDF

Split String based on multiple Regex matches

First of all, I checked these previous posts, and did not help me. 1 & 2 & 3
I have this string (or a similar case could be) that need to be handled with regex:
"Text Table 6-2: Management of children study and actions"
What I am supposed to do is detect the word Table and the word(s) before if existed
detect the numbers following and they can be in this format: 6 or 6-2 or 66-22 or 66-2
Finally the rest of the string (in this case: Management of children study and actions)
After doing so, the return value must be like this:
return 1 and 2 as one string, the rest as another string
e.g. returned value must look like this: Text Table 6-2, Management of children study and actions
Below is my code:
mystr = "Text Table 6-2: Management of children study and actions"
if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
print("True matched")
parts_of_title = re.search("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr)
print(parts_of_title)
print(" ".join(parts_of_title.group().split()[0:3]), parts_of_title.group().split()[-1])
The first requirement is returned true as should be but the second doesn't so, I changed the code and used compile but the regex functionality changed, the code is like this:
mystr = "Text Table 6-2: Management of children study and actions"
if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
print("True matched")
parts_of_title = re.compile("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?").split(mystr)
print(parts_of_title)
Output:
True matched
['', 'Text ', 'Table', '-2', ':\tManagement of children study and actions']
So based on this, how I can achieve this and stick to a clean and readable code? and why does using compile change the matching?

The matching changes because:
In the first part, you call .group().split() where .group() returns the full match which is a string.
In the second part, you call re.compile("...").split() where re.compile returns a regular expression object.
In the pattern, this part will match only a single word [a-zA-Z0-9]+[ ], and if this part should be in a capture group [0-9]([-][0-9]+)? the first (single) digit is currently not part of the capture group.
You could write the pattern writing 4 capture groups:
^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)
See a regex demo.
import re
pattern = r"^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)"
s = "Text Table 6-2: Management of children study and actions"
m = re.match(pattern, s)
if m:
print(m.groups())
Output
('Text ', 'Table', '6-2', 'Management of children study and actions')
If you want point 1 and 2 as one string, then you can use 2 capture groups instead.
^((?:.*? )?(?:[Ll]ist|[Tt]able|[Ff]igure)\s+\d+(?:-\d+)?):\s+(.+)
Regex demo
The output will be
('Text Table 6-2', 'Management of children study and actions')

you have already had answers but I wanted to try your problem to train myself so I give you all the same what I found if you are interested:
((?:[a-zA-Z0-9]+)? ?(?:[Ll]ist|[Tt]able|[Ff]igure)).*?((?:[0-9]+\-[0-9]+)|(?<!-)[0-9]+): (.*)
And here is the link to my tests: https://regex101.com/r/7VpPM2/1

Regex not specific enough

So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.
This is the code for the regex that I'm using to detect the books title and author
titleRegex = re.compile('(.+)\((.+)\)')
Example
Desired book title and author match: Book title (Author name)
What would also get matched: *I like apples because they are green (they are sometimes red as well). *
In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted
Here is the unformatted text file that goes into my program
The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.
Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?
I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)
import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')
regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]
string = open("/Users/devinnagami/myclippings.txt").read()
for x in range(len(regexList)):
newString = re.sub(regexList[x], ' ', string)
string = newString
finalText = newString.split(' ')
with open('booknotes.txt', 'w') as f:
for item in finalText:
f.write('%s\n' % item)

There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.
For instance:
quoteInfoRegex = re.compile(
r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" +
r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" +
r"\n" +
r"(?P<quote>.*?)\n", flags=re.MULTILINE)
for m in quoteInfoRegex.finditer(data):
print(m.groupdict())
This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.

Find values using regex (includes brackets)

it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.

You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']

import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)

The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo

How to get a value for a key in a string, when followed by another specific key=value set

my code is like:
string = "title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht color=red title=xxxy red=anything title=xxxyyy color=red"
pattern = r'title=(.*?) color=red'
print re.compile(pattern).search(string).group(0)
and I got
"title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht color=red title=xxxy red=anything title=xxxyyy color=red"
But I want to find all the contents of "title"s immediately followed by "color=red"

You want what immediately precedes color=red? Then use
.*title=(.*?) color=red
Demo: https://regex101.com/r/sR4kN2/1
This greedily matches everything that comes before color=red, so that only the desired title appears.
Alternatively, if you know there is a character that doesn't appear in the title, you can simplify by just using a character class exclusion. For example, if you know = won't appear:
title=([^=]*?) color=red
Or, if you know whitespace won't appear:
title=([^\s]*?) color=red
A third option, using a bit of code to find all red titles (assuming that the input always alternates title, color):
for title, color in re.findall(r'title=(.*?) color=(.*?)\( |$\)'):
if color == 'red':
print title

If you want to get the last match of a sub-regexp before a certain regexp the solution is to use a greedy skipper. For example:
>>> pattern = '.*title="([^"]*)".*color="#123"'
>>> text = 'title="123" color="#456" title="789" color="#123"'
>>> print(re.match(pattern, s).groups(1))
the first .* is greedy and it will skip as much as possible (thus skipping first title) backing up to the one that allows matching the desired color.
As a simpler example consider that
a(.*)b(.*)c
processed on
a1111b2222b3333c
will match 1111b2222 in the first group and 3333 in the second.

Why don't you skip the regexes, and use some split functionality instead:
search_title = False
found = None
string = "title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht colo\
r=red title=xxxy red=anything title=xxxyyy color=red"
parts = string.split()
for part in parts:
key, value = part.split('=', 1)
if search_title:
if key == 'title':
found = value
search_title = False
if key == 'color' and value == 'red':
search_title = True
print(found)
results in
xxxy
Regexes are nice, but can cause headaches at times.

Try this using re module
>>>string = 'title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht color=red'
>>>import re
>>>re.search('(.*title=?)(.*) color=red', string).group(2)
'whatIwaht'
>>>re.search('(.*title=?)(.*) color=red', string).group(2)
'xyxyx'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Python Regular Expression Assistance - python

Related

Pandas REGEX not returning expected results using "extract"

Split String based on multiple Regex matches

Regex not specific enough

Find values using regex (includes brackets)

How to get a value for a key in a string, when followed by another specific key=value set

Categories

Resources