I am trying to write a regex that matches columns in my dataframe. All the columns in the dataframe are
cols = ['after_1', 'after_2', 'after_3', 'after_4', 'after_5', 'after_6',
'after_7', 'after_8', 'after_9', 'after_10', 'after_11', 'after_12',
'after_13', 'after_14', 'after_15', 'after_16', 'after_17', 'after_18',
'after_19', 'after_20', 'after_21', 'after_22', 'after_10_missing',
'after_11_missing', 'after_12_missing', 'after_13_missing',
'after_14_missing', 'after_15_missing', 'after_16_missing',
'after_17_missing', 'after_18_missing', 'after_19_missing',
'after_1_missing', 'after_20_missing', 'after_21_missing',
'after_22_missing', 'after_2_missing', 'after_3_missing',
'after_4_missing', 'after_5_missing', 'after_6_missing',
'after_7_missing', 'after_8_missing', 'after_9_missing']
I want to select all the columns that have values in the strings that range from 1-14.
This code works
df.filter(regex = '^after_[1-9]$|after_([1-9]\D|1[0-4])').columns
but I'm wondering how to make it in one line instead of splititng it in two. The first part selects all strings that end in a number between 1 and 9 (i.e. 'after_1' ... 'after_9') but not their "missing" counterparts. The second part (after the |), selects any string that begins with 'after' and is between 1 and 9 and is followed by a word character, or begins with 1 and is followed by 0-4.
Is there a better way to write this?
I already tried
df.filter(regex = 'after_([1-9]|1[0-4])').columns
But that picks up strings that begin with a 1 or a 2 (i.e. 'after_20')
Try this: after_([1-9]|1[0-4])[a-zA-Z_]*\b
import re
regexp = '''(after_)([1-9]|1[0-4])(_missing)*\\b'''
cols = ['after_1', 'after_14', 'after_15', 'after_14_missing', 'after_15_missing', 'after_9_missing']
for i in cols:
print(i , re.findall(regexp, i))
Related
I have a dataframe with website as one of the columns. Trying to create a clean string column that excludes everything after .com/ .net / .org / .edu , etc., my approach is to find the location of them and exclude anything after .com/.net by adding appropriate characters
**string**
https:/amazon.com
google.com/
http:/onlinelearning.edu/home
walmart.net/
https:/target.onlinesales.org/home/goods
https:/target.onlinesales.de/home/goods
**new string**
https:/amazon.com
google.com
http:/onlinelearning.edu
walmart.net
https:/target.onlinesales.org
https:/target.onlinesales.de
for the ones that contains .com
df['length'] = np.where(df['string'].str.contains('.com'), df['string'].str.find('.com') + 4, df['string'].str.len())
df['new_string'] = [y[:x] for (x, y) in zip(df['length'], account_dt['string'])]
This is a job for regex. You can use pd.Series.str.replace with negative lookbehind:
print (df["col"].str.replace("(?<!:)/.*", ""))
Or alternatively list out all your req domain by positive lookbehind:
print (df["col"].str.replace("(?:(?<=com)|(?<=edu)|(?<=org)|(?<=de)|(?<=net))/.*", ""))
0 -https:/amazon.com
1 -google.com
2 -http:/onlinelearning.edu
3 -walmart.net
4 -https:/target.onlinesales.org
5 -https:/target.onlinesales.de
Name: col, dtype: object
You can further refine the pattern to suit more cases.
I am passing about 11,000 texts extracted from csv as a dataframe to the remove_unique function. I am finding unique words and saving it as list named as "unique" in the function. The unique words is created out of all the unique words found in the entire column.
Using regex I'm trying to remove the unique words from each row(single column) of panda dataframe but the unique words do not get removed as expected, instead all the words are removed and empty "text" is returned.
def remove_unique(text):
//Gets all the unique words in the entire corpus
unique = list(set(text.str.findall("\w+").sum()))
pattern = re.compile(r'\b(?:{})\b'.format('|'.join(unique)))
//Ideally should remove the unique words from the corpus.
text = text.apply(lambda x: re.sub(pattern, '', x))
return text
Can somebody tell point out what is the issue?
before
0 card layout broken window resiz unabl error ex...
1 chart lower rang border patch merg recheck...
2 left align text team close c...
3 descript sma...
4 list disappear navig make contain...
Name: description_plus, dtype: object
0
1 ...
2
3
4 ...
Name: description_plus, dtype: object
Not sure I totally understand. Are you trying to see if a word appears multiple times throughout the entire column?
Maybe
import re
a_list = list(df["column"].values) #column to list
string = " ".join(a_list) # list of rows to string
words = re.findall("(\w+)", string) # split to single list of words
print([item for item in words if words.count(item) > 1]) #list of words that appear multiple times
I am new to regex expressions.
I am trying to filter out values from 2 columns.
The first column look like this:
_source.cookie
__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_path=https://google.com/free-test/windows/; country_code=OM; clid=06a98eb3-177a-4692-8a15-04cb4c084c1c; ct_t=1574680122; ct_tid=1574680122; _ga=GA1.2.575812751.1574680122; _gid=GA1.2.560773616.1574680122; _gac_UA-138885843-1=1.1574680161.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gat=1; _gcl_aw=GCL.1574680123.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gcl_au=1.1.1227740955.1574680123; sessionid=yr0pycyfhjh90vauf0z8yw4kxno5rom0; u_id=22b5d5e0-d2b5-4a4a-ad6f-128008b4b466; _gat_UA-138885843-1=1
...
__cfduid=de7d3a7e772a62b171f445ce489bc5f791574680110; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; ct_tid=1574680121; _ga=GA1.2.476582918.1574680125; _gid=GA1.2.1129397609.1574680125; _gat=1; _gcl_au=1.1.356653701.1574680128; _gat_UA-138885843-1=1; clid=3d0b5be5-8b7b-4094-ba47-879252a59a7a; ct_t=1574680159; _gcl_aw=GCL.1574680162.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; _gac_UA-138885843-1=1.1574680169.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE
__cfduid=d3b31d4cba74d440bf60e238a62bf46a51574680162; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; full_path=https://google.com/au/best-test/; country_code=AU; clid=4e65772c-5da2-471a-86dd-240a34fd36ac; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.242059245.1574680165; _gid=GA1.2.1757216414.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gcl_au=1.1.1892979809.1574680165
__cfduid=d054c8a93d4874e31aef9f2966829fefc1574680166; gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; clid=726ebc25-95b9-4507-b29d-998ab54a9eeb; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.1271977185.1574680165; _gid=GA1.2.506750010.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gcl_au=1.1.24394228.1574680165
__cfduid=d27ba2095c6b6ac5fb6108343075969f11574679826; full_path=https://google.com/reviews/testtest/; country_code=VN; ct_tid=1574679826; _ga=GA1.2.2008368313.1574679827; _gid=GA1.2.1231813533.1574679827; _gcl_au=1.1.299737663.1574679827; sessionid=dqwf1zmqdjkv9tdqi1cotr6m2judep2p; u_id=a71d0a87-b93d-4626-8f51-bcc0550dbbee; gclid=EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; clid=aeb5b4d0-400b-47ee-b916-69a7b03544aa; ct_t=1574680166; _gac_UA-138885843-1=1.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; _gat=1; _gcl_aw=GCL.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE
The second column look like this:
_source.request_url
https://google.com/go/test/?p3
https://google.com/au/test/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE
https://google.com/go/test/
...
https://google.com/api/dto/?click_type=gclid&click_id=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&click_src=GET&cid=242059245.1574680165&user_id=&landing_page_uri=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F%3Fgclid%3DCjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&landing_page_referer=&lpu=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F&lpr=https%3A%2F%2Fwww.google.com%2F&trigger=onLoad>mon=true&gaon=true&cookieon=true&ct_t=1574680164&ct_tid=1574680164&v=20191029&_=1574680164139
My goal is to extract glid values from both columns so that I would have 2 new columns Glid from cookie and Gclid from URL
What I have so far:
def get_glid_from_source(pattern, data):
result = re.search(pattern, str(data))
if result is not None:
return result.group(1)
return None
df['Gclid_from_url'] = df.apply(lambda x: get_glid_from_source('[gclid|click_id]=(.+?)&', x['_source.request_url']), axis=1)
df['Gclid_from_cookie'] = df.apply(lambda x: get_glid_from_source('gclid=(.+?);', x['_source.cookie']), axis=1)
I would need to edit the expression so that:
1. Gclid can only start with a letter a-z or A-Z
2. Gclid ends with one of the following - ;, %, & or the end of string
Now after filtering I get values wfrom the second column that have click_id=ZFGe... because the value I need can be either gclid=Value I need or gclid&click_id= Value I need
EDIT
I have a third column which looks like this:
_source.request_url
www.google.com/api/test...
www.google.com/go/test...
www.google.com/fire-start.php/test...
www.google.com/test...
www.google.com/api/test...
I am making new column in the pandas dataframe where I add TRUE or FALSE values depending if these conditions are met in the above column:
If the link has:
.com/fire-start.php or .com/go/ or .com/api/
New column would have value as FALSE if the patter is not found in the string 'TRUE' value is passed.
What I tried:
df['validate'] = df['_source.request_url'].str.extract(r'(www.google)=([a-zA-Z][^.com/fire-start.php|^.com/go/|^.com/api/]*)')
But that does not seem to work.
Thank you for your help, appreciate it.
You may use
df['Gclid_from_url'] = df['_source.request_url'].str.extract(r'(?:gclid|click_id)=([a-zA-Z][^&#]*)')
See the regex demo
The Gclid_from_cookie can be populated using
df['Gclid_from_cookie'] = df['_source.cookie'].str.extract(r'gclid=([a-zA-Z][^&#;%]*)')
See this regex demo
Note that [gclid|click_id] matches any 1 char defined in the character set, a g, c, l, i, d, |, k or _, not a sequence of chars, hence the non-capturing group (?:...) in my pattern.
The value pattern is [a-zA-Z][^&#]* or [a-zA-Z][^&#;%]* that is quite self-explanatory: an ASCII letter and 0 or more chars other than &, #, ;, %.
As far as the updated part of the question is concerned, you need to understand that a negated character class matches a single char, not a sequence of chars, you can't "say" [^not] to match any text but not, [^not] matches any char but n, o and t.
Add
import re
filters = ['.com/fire-start.php', '.com/go/', '.com/api/']
df['validate']=df['_source.request_url'].str.contains("|".join(map(re.escape,filters)))
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I am working on text analytics. I am stuck with one problem. I need a solution for that.
I am trying to find surrounding words (5 or more) for each word in a string column in a pandas dataframe. Dummy dataframe shown in a screenshot. I have id column and I have text column. I am trying to create a new dataframe which has four columns ( id column, before, Word, After) as shown in the second screenshot(result dataframe) attached.
For example
dummy dataframe
result dataframe
Initially I thought about using df.Text.extractall(...),
with 3 capturing groups (Before, Word and After), but the downside
was that e.g. the After group in one match could consume the content
that in the next match could be either the Word or at least the Before
group.
So I decided to do it other way:
Apply to each row a function, returning "partial" result for this row.
Gather results in a list of DataFrames.
Concatenate them.
Setup
Source DataFrame:
ID Text
0 ID1 The Company sells its products worldwide through its wide network of
1 ID2 Provides one of most often used search engines for HTTP sites
2 ID3 The most known of its products is the greatest airliner of the world
3 ID4 Xyz nothing
Note that I added a "no match" row (ID4).
Words to match:
words = ['products', 'most', 'for']
No of words before / after:
wNo = 3
In your code change it to whatever number you want.
The solution
The function finding matches in the current row:
def find(row, wanted, wNo):
wList = re.split(r'\W+', row.Text)
wListLC = list(map(lambda x: x.lower(), wList))
res = []
for wd in wanted: # Check each "wanted" word
for indW in [ i for i, x in enumerate(wListLC) if x == wd ]:
# For each index of "wd" in "wList"
wdBef = ''
if indW > 0:
indBefBeg = indW - wNo if indW >= wNo else 0
wdBef = ' '.join(wList[indBefBeg : indW])
indAftBeg = indW + 1
indAftEnd = indAftBeg + wNo
wdAft = ' '.join(wList[indAftBeg : indAftEnd])
res.append([row.ID, wdBef, wd, wdAft])
return pd.DataFrame(res, columns=['ID', 'Before', 'Word', 'After'])
Parameters are:
row - the source row,
wanted - the list of "wanted" words (lower case),
wNo - number of words before / after the wanted word.
For each match found, the result contains a row with:
ID - from the current row,
Before, Word, After - respective parts of the current match.
Of course, the actual number of words in Before / After group can be
smaller, if there is no enough such words in the current row.
Note that this function splits the source row into two lists:
wList - "original" words, to return later,
wListLC - words converted to lower case, to match (remember that the
"wanted" list should also be in lower case).
The result is a "partial" DataFrame (for this row, if no match then empty),
to be later concatenated with other partial results.
And now, how to use this function: To gather partial results, as a list
of DataFrames run:
tbl = df.apply(find, axis=1, wanted=words, wNo=wNo).tolist()
And to generate the final result, run:
pd.concat(tbl, ignore_index=True)
For my source data, the result is:
ID Before Word After
0 ID1 Company sells its products worldwide through its
1 ID2 Provides one of most often used search
2 ID2 used search engines for HTTP sites
3 ID3 known of its products is the greatest
4 ID3 The most known of its
Note that Before / After group can be an empty string, but only
in cases when the Word was either the first or the last in the current row.
How to speed up this solution
Some increase in speed can be achieved with the following steps:
Compile the regex in advance (pat = re.compile(r'\W+')) and use
it in the function finding matches.
Drop additional parameters and use global variables instead.
So the function can be:
def find2(row):
wList = re.split(pat, row.Text)
wListLC = list(map(lambda x: x.lower(), wList))
res = []
for wd in words: # Check each "wanted" word
for indW in [ i for i, x in enumerate(wListLC) if x == wd ]:
# For each index of "wd" in "wList"
wdBef = ''
if indW > 0:
indBefBeg = indW - wNo if indW >= wNo else 0
wdBef = ' '.join(wList[indBefBeg : indW])
indAftBeg = indW + 1
indAftEnd = indAftBeg + wNo
wdAft = ' '.join(wList[indAftBeg : indAftEnd])
res.append([row.ID, wdBef, wd, wdAft])
return pd.DataFrame(res, columns=['ID', 'Before', 'Word', 'After'])
And to call it, run:
tbl = df.apply(find2, axis=1).tolist()
pd.concat(tbl, ignore_index=True)
I compared both variants using %timeit (for my test data) and
the average execution time dropped from 46 to 39 ms (16 % shorter).
For larger dataset the difference should be more significant.
i have data frame that looks like this
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT
i want to be able to strip ,60 ,R-12,HT using regex and also deletes the moreinfo and GoCats rows from the df.
My expected Results:
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
2 MZF8-050Z-AAB
3 MZA2-0580-TFD
I first removed the strings
del = ['hello', 'moreinfo']
for i in del:
df = df[value!= i]
Can somebody suggest a way to use regex to match and delete all case that do match A067-M4FL-CAA-020 or MZF8-050Z-AAB pattern so i don't have to create a list for all possible cases?
I was able to strip a single line like this but i want to be able to strip all matching cases in the dataframe
pattern = r',\w+ \,\w+-\w+\,\w+ *'
line = 'MRF2-050A-TFC,60 ,R-12,HT'
for i in re.findall(pattern, line):
line = line.replace(i,'')
>>> MRF2-050A-TFC
I tried adjusting my code but it prints out the same output for each row
pattern = r',\w+ \,\w+-\w+\,\w+ *'
for d in df:
for i in re.findall(pattern, d):
d = d.replace(i,'')
Any suggestions will be greatly appreciated. Thanks
You may try this
(?:\w+-){2,}[^,\n]*
Demo
Python scripts may be as follows
ss="""0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT"""
import re
regx=re.compile(r'(?:\w+-){2,}[^,\n]*')
m= regx.findall(ss)
for i in range(len(m)):
print("%d %s" %(i, m[i]))
and the output is
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
2 MZF8-050Z-AAB
3 MZA2-0580-TFD
Here's a simpler approach you can try without using regex. pandas has many in-built functions to deal with text data.
# remove unwanted values
df['value'] = df.value.str.replace(r'moreinfo|60|R-.*|HT|GoCats|\,', '')
# drop na
df = df[(df != '')].dropna()
# print
print(df)
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
3 MZF8-050Z-AAB
5 MZA2-0580-TFD
-----------
# data used
df = pd.read_fwf(StringIO(u'''
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT'''),header=1)
I'd suggest capturing the data you DO want, since it's pretty particular, and the data you do NOT want could be anything.
Your pattern should look something like this:
^\w{4}-\w{4}-\w{3}(?:-\d{3})?
https://regex101.com/r/NtH2Ut/2
I'd recommend being more specific than \w where possible. (Like ^[A-Z]\w{3}) if you know the beginning four character chunk should start with a letter.
edit
Sorry, I may not have read your input and output literally enough:
https://regex101.com/r/NtH2Ut/3
^(?:\d+\s+\w{4}-\w{4}-\w{3}(?:-\d{3})?)|^\s+.*