Filtering values from string with regex - python

I am new to regex expressions.
I am trying to filter out values from 2 columns.
The first column look like this:
_source.cookie
__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_path=https://google.com/free-test/windows/; country_code=OM; clid=06a98eb3-177a-4692-8a15-04cb4c084c1c; ct_t=1574680122; ct_tid=1574680122; _ga=GA1.2.575812751.1574680122; _gid=GA1.2.560773616.1574680122; _gac_UA-138885843-1=1.1574680161.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gat=1; _gcl_aw=GCL.1574680123.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gcl_au=1.1.1227740955.1574680123; sessionid=yr0pycyfhjh90vauf0z8yw4kxno5rom0; u_id=22b5d5e0-d2b5-4a4a-ad6f-128008b4b466; _gat_UA-138885843-1=1
...
__cfduid=de7d3a7e772a62b171f445ce489bc5f791574680110; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; ct_tid=1574680121; _ga=GA1.2.476582918.1574680125; _gid=GA1.2.1129397609.1574680125; _gat=1; _gcl_au=1.1.356653701.1574680128; _gat_UA-138885843-1=1; clid=3d0b5be5-8b7b-4094-ba47-879252a59a7a; ct_t=1574680159; _gcl_aw=GCL.1574680162.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; _gac_UA-138885843-1=1.1574680169.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE
__cfduid=d3b31d4cba74d440bf60e238a62bf46a51574680162; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; full_path=https://google.com/au/best-test/; country_code=AU; clid=4e65772c-5da2-471a-86dd-240a34fd36ac; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.242059245.1574680165; _gid=GA1.2.1757216414.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gcl_au=1.1.1892979809.1574680165
__cfduid=d054c8a93d4874e31aef9f2966829fefc1574680166; gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; clid=726ebc25-95b9-4507-b29d-998ab54a9eeb; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.1271977185.1574680165; _gid=GA1.2.506750010.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gcl_au=1.1.24394228.1574680165
__cfduid=d27ba2095c6b6ac5fb6108343075969f11574679826; full_path=https://google.com/reviews/testtest/; country_code=VN; ct_tid=1574679826; _ga=GA1.2.2008368313.1574679827; _gid=GA1.2.1231813533.1574679827; _gcl_au=1.1.299737663.1574679827; sessionid=dqwf1zmqdjkv9tdqi1cotr6m2judep2p; u_id=a71d0a87-b93d-4626-8f51-bcc0550dbbee; gclid=EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; clid=aeb5b4d0-400b-47ee-b916-69a7b03544aa; ct_t=1574680166; _gac_UA-138885843-1=1.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; _gat=1; _gcl_aw=GCL.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE
The second column look like this:
_source.request_url
https://google.com/go/test/?p3
https://google.com/au/test/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE
https://google.com/go/test/
...
https://google.com/api/dto/?click_type=gclid&click_id=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&click_src=GET&cid=242059245.1574680165&user_id=&landing_page_uri=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F%3Fgclid%3DCjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&landing_page_referer=&lpu=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F&lpr=https%3A%2F%2Fwww.google.com%2F&trigger=onLoad&gtmon=true&gaon=true&cookieon=true&ct_t=1574680164&ct_tid=1574680164&v=20191029&_=1574680164139
My goal is to extract glid values from both columns so that I would have 2 new columns Glid from cookie and Gclid from URL
What I have so far:
def get_glid_from_source(pattern, data):
result = re.search(pattern, str(data))
if result is not None:
return result.group(1)
return None
df['Gclid_from_url'] = df.apply(lambda x: get_glid_from_source('[gclid|click_id]=(.+?)&', x['_source.request_url']), axis=1)
df['Gclid_from_cookie'] = df.apply(lambda x: get_glid_from_source('gclid=(.+?);', x['_source.cookie']), axis=1)
I would need to edit the expression so that:
1. Gclid can only start with a letter a-z or A-Z
2. Gclid ends with one of the following - ;, %, & or the end of string
Now after filtering I get values wfrom the second column that have click_id=ZFGe... because the value I need can be either gclid=Value I need or gclid&click_id= Value I need
EDIT
I have a third column which looks like this:
_source.request_url
www.google.com/api/test...
www.google.com/go/test...
www.google.com/fire-start.php/test...
www.google.com/test...
www.google.com/api/test...
I am making new column in the pandas dataframe where I add TRUE or FALSE values depending if these conditions are met in the above column:
If the link has:
.com/fire-start.php or .com/go/ or .com/api/
New column would have value as FALSE if the patter is not found in the string 'TRUE' value is passed.
What I tried:
df['validate'] = df['_source.request_url'].str.extract(r'(www.google)=([a-zA-Z][^.com/fire-start.php|^.com/go/|^.com/api/]*)')
But that does not seem to work.
Thank you for your help, appreciate it.

You may use
df['Gclid_from_url'] = df['_source.request_url'].str.extract(r'(?:gclid|click_id)=([a-zA-Z][^&#]*)')
See the regex demo
The Gclid_from_cookie can be populated using
df['Gclid_from_cookie'] = df['_source.cookie'].str.extract(r'gclid=([a-zA-Z][^&#;%]*)')
See this regex demo
Note that [gclid|click_id] matches any 1 char defined in the character set, a g, c, l, i, d, |, k or _, not a sequence of chars, hence the non-capturing group (?:...) in my pattern.
The value pattern is [a-zA-Z][^&#]* or [a-zA-Z][^&#;%]* that is quite self-explanatory: an ASCII letter and 0 or more chars other than &, #, ;, %.
As far as the updated part of the question is concerned, you need to understand that a negated character class matches a single char, not a sequence of chars, you can't "say" [^not] to match any text but not, [^not] matches any char but n, o and t.
Add
import re
filters = ['.com/fire-start.php', '.com/go/', '.com/api/']
df['validate']=df['_source.request_url'].str.contains("|".join(map(re.escape,filters)))

Related

Python - select string based on length from second dataframe

I have a dataframe with website as one of the columns. Trying to create a clean string column that excludes everything after .com/ .net / .org / .edu , etc., my approach is to find the location of them and exclude anything after .com/.net by adding appropriate characters
**string**
https:/amazon.com
google.com/
http:/onlinelearning.edu/home
walmart.net/
https:/target.onlinesales.org/home/goods
https:/target.onlinesales.de/home/goods
**new string**
https:/amazon.com
google.com
http:/onlinelearning.edu
walmart.net
https:/target.onlinesales.org
https:/target.onlinesales.de
for the ones that contains .com
df['length'] = np.where(df['string'].str.contains('.com'), df['string'].str.find('.com') + 4, df['string'].str.len())
df['new_string'] = [y[:x] for (x, y) in zip(df['length'], account_dt['string'])]
This is a job for regex. You can use pd.Series.str.replace with negative lookbehind:
print (df["col"].str.replace("(?<!:)/.*", ""))
Or alternatively list out all your req domain by positive lookbehind:
print (df["col"].str.replace("(?:(?<=com)|(?<=edu)|(?<=org)|(?<=de)|(?<=net))/.*", ""))
0 -https:/amazon.com
1 -google.com
2 -http:/onlinelearning.edu
3 -walmart.net
4 -https:/target.onlinesales.org
5 -https:/target.onlinesales.de
Name: col, dtype: object
You can further refine the pattern to suit more cases.

Find values using regex (includes brackets)

it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.
You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']
import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)
The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo

Select strings based on numeric range between words

I am trying to write a regex that matches columns in my dataframe. All the columns in the dataframe are
cols = ['after_1', 'after_2', 'after_3', 'after_4', 'after_5', 'after_6',
'after_7', 'after_8', 'after_9', 'after_10', 'after_11', 'after_12',
'after_13', 'after_14', 'after_15', 'after_16', 'after_17', 'after_18',
'after_19', 'after_20', 'after_21', 'after_22', 'after_10_missing',
'after_11_missing', 'after_12_missing', 'after_13_missing',
'after_14_missing', 'after_15_missing', 'after_16_missing',
'after_17_missing', 'after_18_missing', 'after_19_missing',
'after_1_missing', 'after_20_missing', 'after_21_missing',
'after_22_missing', 'after_2_missing', 'after_3_missing',
'after_4_missing', 'after_5_missing', 'after_6_missing',
'after_7_missing', 'after_8_missing', 'after_9_missing']
I want to select all the columns that have values in the strings that range from 1-14.
This code works
df.filter(regex = '^after_[1-9]$|after_([1-9]\D|1[0-4])').columns
but I'm wondering how to make it in one line instead of splititng it in two. The first part selects all strings that end in a number between 1 and 9 (i.e. 'after_1' ... 'after_9') but not their "missing" counterparts. The second part (after the |), selects any string that begins with 'after' and is between 1 and 9 and is followed by a word character, or begins with 1 and is followed by 0-4.
Is there a better way to write this?
I already tried
df.filter(regex = 'after_([1-9]|1[0-4])').columns
But that picks up strings that begin with a 1 or a 2 (i.e. 'after_20')
Try this: after_([1-9]|1[0-4])[a-zA-Z_]*\b
import re
regexp = '''(after_)([1-9]|1[0-4])(_missing)*\\b'''
cols = ['after_1', 'after_14', 'after_15', 'after_14_missing', 'after_15_missing', 'after_9_missing']
for i in cols:
print(i , re.findall(regexp, i))

Pandas Python Regular Expression Assistance

I wasn't sure what to call this title, feel free to edit it if you think there is a better name.
What I am trying to do is find cases that match certain search criteria.
Specifically, I am trying to find sentences that contain the word "where" in them. Once I have identified that, I am trying to find cases where the word "SQL" command is also located within that same tag.
Let's say I have a dataframe that looks like this:
search_criteria = ['where']
df4
Q R
0 file.sql <sentence>dave likes stuff</sentence><properties>version = "2", description = "example" type="SqlCommand">select id, name, from table where criteria = '5'</property><sentence>dave hates stuff>
0 file.sql <sentence>dave likes stuff</sentence><properties>version = "2", description = "example">select id, name, from table where criteria = '5'</properties><sentence>dave hates stuff>
I am trying to return this:
Q R
0 file.sql <properties>version = "2", description = "example">select id, name, from table</properties>
This record should get returned because it contains both "where" and "sqlcommand".
Here is my current process:
regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].R.str.findall('(<[^<]*?' + 'where' + '[^>]*?>)', re.IGNORECASE)
sql_command_regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].R.str.findall('(<property[^<]*?' + 'sqlcommand' + '[^>]*?<\/property>)', re.IGNORECASE)
if not regex_stuff.empty: #if one of the search criteria is found
if not sql_command_regex_stuff.empty: #check to see if the phrase "sqlcommand" is found anywhere as well
(insert rest of code)
This does not return anything.
What am I doing wrong?
Edit #1:
It seems like I need to do something at the end, to make the regex look something like this:
<property[^<]*?SqlCommand[^(<\/property>)]*
I feel like this is the right direction, doesn't work, but I feel like this is the right step.
You could just filter with str.contains:
df[(df['R'].str.contains('where', flags=re.IGNORECASE) & df['R'].str.contains('sqlcommand', flags=re.IGNORECASE))]
Q R
0 file.sql <sentence>dave likes stuff</sentence><properti...
or use ~ to return the opposite: strings that do not contain 'sqlcommand' or 'where'
df[~(df['R'].str.contains('where', flags=re.IGNORECASE) & df['R'].str.contains('sqlcommand', flags=re.IGNORECASE))]
Q R
1 file.sql <sentence>dave likes stuff</sentence><properti...
First of all, you have to have proper XML and SQL content, so you should
make the following corrections:
As the opening tag is <properties>, the closing tag must also be
</properties>, not </property>.
version, description and type are attributes (after them
there is > closing the opening tag, so after properties there
should be a space, not >.
Remove , after version="2".
Remove , after name.
Remove ( before <properties and ) after </properties>.
To find the required rows, use str.contains as the filtering
expression.
Below you have an example program:
import pandas as pd
import re
df4 = pd.DataFrame({
'Q' : 'file.sql',
'R' : [
'<s>dave</s><properties type="SqlCommand">select id, name '
'from table where criteria=\'5\'</properties><s>dave</s>',
'<s>dave</s><properties>select id, name from table '
'where criteria=\'6\'</properties><s>dave</s>',
'<s>mike</s><properties type="SqlCommand">drop table "Xyz"'
'</properties><s>mike</s>' ]})
df5 = df4[df4.R.str.contains(
'<properties[^<>]+?sqlcommand[^<>]+?>[^<>]+?where',
flags=re.IGNORECASE)]
print(df5)
Note that the regex takes care about the proper sequence of
strings:
First match <properties.
Then a sequence of chars other than < and > ([^<>]+?).
so we are still within the just opened XML tag.
Then match sqlcommand (ignoring case).
Then another sequence of chars other than < and >
([^<>]+?).
Then >, closing the tag.
Then another sequence of chars other than < and >
([^<>]+?).
And finally where (also ignoring case).
An attempt to check for sqlcommand and where in two separate
regexes is wrong, as these words can be at other locations,
which do not meet your requirement.

Repeated regex groups of arbitrary number

I have this example text snippet
headline:
Status[apphmi]: blubb, 'Statustext1'
Main[apphmi]: bla, 'Maintext1'Main[apphmi]: blaa, 'Maintext2'
Popup[apphmi]: blaaa, 'Popuptext1'
and I want to extract the words within '', but sorted with the context (status, main, popup).
My current regex is (example at pythex.org):
headline:(?:\n +Status\[apphmi\]:.* '(.*)')*(?:\n +Main\[apphmi\]:.* '(.*)')*(?:\n +Popup\[apphmi\]:.* '(.*)')*
but with this I only get 'Maintext2' and not both. I don't know how to repeat the groups to an arbitrary number.
You can try with this:
r"(.*?]):(?:[^']*)'([^']*)'"g
Look here
Group1 and Group 2 for each match contains your key value pair
You can not merge the second match as one by using regex, once you get all the pairs... you can apply some programming here to merge duplicate keys as one.
Here I have used dictionary of list, if a key already exists in the dictionary then you should append the value to the list , otherwise insert a new key with a new list having the value.
This is how it should be done (tested in python 3+)
import re
d = dict()
regex = r"(.*?]):(?:[^']*)'([^']*)'"
test_str = ("headline: \n"
"Status[apphmi]: blubb, 'Statustext1'\n"
"Main[apphmi]: bla, 'Maintext1'Main[apphmi]: blaa, 'Maintext2'\n"
"Popup[apphmi]: blaaa, 'Popuptext1'")
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
if match.group(1) in d:
d[match.group(1)].append(match.group(2))
else:
d[match.group(1)] = [match.group(2),]
print(d)
Output:
{
'Popup[apphmi]': ['Popuptext1'],
'Main[apphmi]': ['Maintext1', 'Maintext2'],
'Status[apphmi]': ['Statustext1']
}

Categories

Resources