Finding a regx expression in pyspark? - python

I have a column in pyspark dataframe which contain values separated by ;
+----------------------------------------------------------------------------------+
|name |
+----------------------------------------------------------------------------------+
|tppid=dfc36cc18bba07ae2419a1501534aec6fdcc22e0dcefed4f58c48b0169f203f6;xmaslist=no|
+----------------------------------------------------------------------------------+
So, in this column any number of key value pair can come if i use this
df.withColumn('test', regexp_extract(col('name'), '(?<=tppid=)(.*?);', 1)).show(1,False)
i can extract the tppid but when tppid comes as last key-value pair in a row it not able to extract, I want a regx which can extract the value of a key where ever the location of it in a row.

You may use a negated character class [^;] to match any char but ;:
tppid=([^;]+)
See the regex demo
Since the third argument to regexp_extract is 1 (accessing Group 1 contents), you may discard the lookbehind construct and use tppid= as part of the consuming pattern.

in addition to the Wiktor Stribiżew's answer, you can use anchors. $ is denoting the end of the string.
tppid=\w+(?=;|\s|$)
Also this regex extract for you only the values without the tppid= part:
(?<=tppid=)\w+(?=;|\s|$)

Related

Pandas: extractall - writing a capture group with or condition [duplicate]

I have columns in my dataframe (~2 milion rows) that look like this:
column
1/20/1"ADAF"
1/4/551BSSS
1/2/1AAAA
1/565/1 "AAA="
And I want to extract only:
1/20/1
1/4/551
1/2/1
1/565/1
I have tried with:
df['wanted_column'] = df['column'].str.extract(r'((\d+)/(\d+)/(\d+))', expand=True)
But I got an error:
ValueError: Wrong number of items passed 4, placement implies 1
Anyone knows where I am wrong? And if there is a better and faster solution for this, I would be thankful for a suggestion.
Thanks in advance.
If you want to extract a single part of a string into a single column, make sure your regex only contains a single capturing group. Remove all other capturing groups (if they are redundant) or convert them into non-capturing ones (if they are used as simple groupings for pattern sequences, e.g. (\W+\w+){0,3} -> (?:\W+\w+){0,3}).
Here, you can use
df['wanted_column'] = df['column'].str.extract(r'(\d+/\d+/\d+)', expand=True)
The point is to only use a single capturing group in the regex when you use it with str.extract to extract a value into a single column.
Mind that r'((\d+)/(\d+)/(\d+))' could be also re-written as r'((?:\d+)/(?:\d+)/(?:\d+))' for this use case, but these non-capturing groups would be redundant as they only group a single \d+ pattern in each of them, which makes no sense.
If you need to extract values into several columns, mind that the column number should be equal to the amount of capturing groups in the pattern, e.g.
df[['Val1', 'Val2', 'Val3']] = df['column'].str.extract(r'(\d+)/(\d+)/(\d+)', expand=True)
# 1 2 3 ^ 1 ^ ^ 2 ^ ^ 3 ^

Need help splitting a column in my DataFrame (Python)

I have a Python DataFrame "dt", one of the dt columns "betName" is filled with objects that sometimes have +/- numbers after the names. I'm trying to figure out how to separate "betName" into 2 columns "betName" & "line" where "betName" is just the name and "line" has the +/- number or regular number
Please see screenshots, thank you for helping!
example of problem and desired result
dt["betName"]
Try this (updated) code:
df2=df['betName'].str.split(r' (?=[+-]\d{1,}\.?\d{,}?)', expand=True).astype('str')
Explanation. You can use str.split to split a text in the rows into 2 or more columns by regular expression:
(?=[+-]\d{1,}\.?\d{,}?)
' ' - Space char is the first.
() - Indicates the start and end of a group.
?= - Lookahead assertion. Matches if ... matches next, but doesn’t consume any of the string.
[+-] - a set of characters. It will match + or -.
\d{1,} - \d is a digit from 0 to 9 with {start, end} number of digits. Here it means from 1 to any number: 1,200,4000 etc.
\.? - \. for a dot and ? - 0 or 1 repetitions of the preceding expression group or symbol.
str.split(pattern=None, n=- 1, expand=False)
pattern - string or regular expression to split on. If not specified, split on whitespace
n - number of splits in output. None, 0 and -1 will be interpreted as return all splits.
expand - expand the split strings into separate columns.
True for placing splitted groups into different columns
False for Series/Index lists of strings in a row.
by .astype('str') function you convert dataframe to string type.
The output.
EDIT: Added a split before doing the regex. This applies the regex only to the cell information that comes after the last white space.
I think you need to extract the bet information with a regular expression.
df["line"] = df["betName"].apply(lambda x: x.split()[-1]).str.extract('([0-9.+-]+)')
Here's how the regex works - the () sets up a capture group, i.e. specifies what information you want to extract.
The stuff inside the square brackets is a character class, so here it matches any number from 0-9, + or - signs and a full stop.
Then plus sign after the square brackets mean match one or more repetitions of anything in the character class.

regex: how to get repeating blocks as groups()? [duplicate]

I need to capture multiple groups of the same pattern. Suppose, I have the following string:
HELLO,THERE,WORLD
And I've written the following pattern
^(?:([A-Z]+),?)+$
What I want it to do is to capture every single word, so that Group 1 is : "HELLO", Group 2 is "THERE" and Group 3 is "WORLD". What my regex is actually capturing is only the last one, which is "WORLD".
I'm testing my regular expression here and I want to use it with Swift (maybe there's a way in Swift to get intermediate results somehow, so that I can use them?)
UPDATE: I don't want to use split. I just need to now how to capture all the groups that match the pattern, not only the last one.
With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
You have to use your language's regex implementation functions to find all matches of a pattern, then you would have to remove the anchors and the quantifier of the non-capturing group (and you could omit the non-capturing group itself as well).
Alternatively, expand your regex and let the pattern contain one capturing group per group you want to get in the result:
^([A-Z]+),([A-Z]+),([A-Z]+)$
The key distinction is repeating a captured group instead of capturing a repeated group.
As you have already found out, the difference is that repeating a captured group captures only the last iteration. Capturing a repeated group captures all iterations.
In PCRE (PHP):
((?:\w+)+),?
Match 1, Group 1. 0-5 HELLO
Match 2, Group 1. 6-11 THERE
Match 3, Group 1. 12-20 BRUTALLY
Match 4, Group 1. 21-26 CRUEL
Match 5, Group 1. 27-32 WORLD
Since all captures are in Group 1, you only need $1 for substitution.
I used the following general form of this regular expression:
((?:{{RE}})+)
Example at regex101
I think you need something like this....
b="HELLO,THERE,WORLD"
re.findall('[\w]+',b)
Which in Python3 will return
['HELLO', 'THERE', 'WORLD']
After reading Byte Commander's answer, I want to introduce a tiny possible improvement:
You can generate a regexp that will match either n words, as long as your n is predetermined. For instance, if I want to match between 1 and 3 words, the regexp:
^([A-Z]+)(?:,([A-Z]+))?(?:,([A-Z]+))?$
will match the next sentences, with one, two or three capturing groups.
HELLO,LITTLE,WORLD
HELLO,WORLD
HELLO
You can see a fully detailed explanation about this regular expression on Regex101.
As I said, it is pretty easy to generate this regexp for any groups you want using your favorite language. Since I'm not much of a swift guy, here's a ruby example:
def make_regexp(group_regexp, count: 3, delimiter: ",")
regexp_str = "^(#{group_regexp})"
(count - 1).times.each do
regexp_str += "(?:#{delimiter}(#{group_regexp}))?"
end
regexp_str += "$"
return regexp_str
end
puts make_regexp("[A-Z]+")
That being said, I'd suggest not using regular expression in that case, there are many other great tools from a simple split to some tokenization patterns depending on your needs. IMHO, a regular expression is not one of them. For instance in ruby I'd use something like str.split(",") or str.scan(/[A-Z]+/)
Just to provide additional example of paragraph 2 in the answer. I'm not sure how critical it is for you to get three groups in one match rather than three matches using one group. E.g., in groovy:
def subject = "HELLO,THERE,WORLD"
def pat = "([A-Z]+)"
def m = (subject =~ pat)
m.eachWithIndex{ g,i ->
println "Match #$i: ${g[1]}"
}
Match #0: HELLO
Match #1: THERE
Match #2: WORLD
The problem with the attempted code, as discussed, is that there is one capture group matching repeatedly so in the end only the last match can be kept.
Instead, instruct the regex to match (and capture) all pattern instances in the string, what can be done in any regex implementation (language). So come up with the regex pattern for this.
The defining property of the shown sample data is that the patterns of interest are separated by commas so we can match anything-but-a-comma, using a negated character class
[^,]+
and match (capture) globally, to get all matches in the string.
If your pattern need be more restrictive then adjust the exclusion list. For example, to capture words separated by any of the listed punctuation
[^,.!-]+
This extracts all words from hi,there-again!, without the punctuation. (The - itself should be given first or last in a character class, unless it's used in a range like a-z or 0-9.)
In Python
import re
string = "HELLO,THERE,WORLD"
pattern = r"([^,]+)"
matches = re.findall(pattern,string)
print(matches)
In Perl (and many other compatible systems)
use warnings;
use strict;
use feature 'say';
my $string = 'HELLO,THERE,WORLD';
my #matches = $string =~ /([^,]+)/g;
say "#matches";
(In this specific example the capturing () in fact aren't needed since we collect everything that is matched. But they don't hurt and in general they are needed.)
The approach above works as it stands for other patterns as well, including the one attempted in the question (as long as you remove the anchors which make it too specific). The most common one is to capture all words (usually meaning [a-zA-Z0-9_]), with the pattern \w+. Or, as in the question, get only the substrings of upper-case ascii letters[A-Z]+.
I know that my answer came late but it happens to me today and I solved it with the following approach:
^(([A-Z]+),)+([A-Z]+)$
So the first group (([A-Z]+),)+ will match all the repeated patterns except the final one ([A-Z]+) that will match the final one. and this will be dynamic no matter how many repeated groups in the string.
You actually have one capture group that will match multiple times. Not multiple capture groups.
javascript (js) solution:
let string = "HI,THERE,TOM";
let myRegexp = /([A-Z]+),?/g; // modify as you like
let match = myRegexp.exec(string); // js function, output described below
while (match != null) { // loops through matches
console.log(match[1]); // do whatever you want with each match
match = myRegexp.exec(string); // find next match
}
Syntax:
// matched text: match[0]
// match start: match.index
// capturing group n: match[n]
As you can see, this will work for any number of matches.
Sorry, not Swift, just a proof of concept in the closest language at hand.
// JavaScript POC. Output:
// Matches: ["GOODBYE","CRUEL","WORLD","IM","LEAVING","U","TODAY"]
let str = `GOODBYE,CRUEL,WORLD,IM,LEAVING,U,TODAY`
let matches = [];
function recurse(str, matches) {
let regex = /^((,?([A-Z]+))+)$/gm
let m
while ((m = regex.exec(str)) !== null) {
matches.unshift(m[3])
return str.replace(m[2], '')
}
return "bzzt!"
}
while ((str = recurse(str, matches)) != "bzzt!") ;
console.log("Matches: ", JSON.stringify(matches))
Note: If you were really going to use this, you would use the position of the match as given by the regex match function, not a string replace.
Design a regex that matches each particular element of the list rather then a list as a whole. Apply it with /g
Iterate throught the matches, cleaning them from any garbage such as list separators that got mixed in. You may require another regex, or you can get by with simple replace substring method.
The sample code is in JS, sorry :) The idea must be clear enough.
const string = 'HELLO,THERE,WORLD';
// First use following regex matches each of the list items separately:
const captureListElement = /^[^,]+|,\w+/g;
const matches = string.match(captureListElement);
// Some of the matches may include the separator, so we have to clean them:
const cleanMatches = matches.map(match => match.replace(',',''));
console.log(cleanMatches);
repeat the A-Z pattern in the group for the regular expression.
data="HELLO,THERE,WORLD"
pattern=r"([a-zA-Z]+)"
matches=re.findall(pattern,data)
print(matches)
output
['HELLO', 'THERE', 'WORLD']

Regex to python regex

I have a lot of file names with the pattern SURENAME__notalwaysmiddlename_firstnames_1230123Abc123-16x_notalways.pdf, e.g.:
SMITH_John_001322Cde444-16v_HA.pdf
FLORRICK-DOILE_Debora_Alicia_321333Gef213-16p.pdf
ROBINSON-SMITH_Maria-Louise_321333Gef213-16p_GH.pdf
My old regex was ([\w]*)_([\w-\w]+)\.\w+ but after switching to Python and getting the first double-barrelled surnames (and even in the first names) I'm unable to get it running.
With the old regex I got two groups:
SMITH_James
001322Cde444-16v_HA
But now I have no clue how to achieve this with re and even include the occasional double-barrelled names in group 1 and the ID in group 2.
([A-Z-]+)(?:_([A-z-]+))?_([A-z-]+)_(\d.*)\.
This pattern will return the surname, potential middle name, first name, and final string.
([A-Z-]+) returns a upper-cased word that can also contain -
(?:_([A-z-]+))? returns 0 or 1 matches of a word preceded by an _. The (?: makes the _ non-capturing
([A-z-]+) returns a word that can also contain -
(\d.*) returns a string that starts with a number
\. finds the escaped period right before the file type

Python Regex behaviour with Square Brackets []

This the text file abc.txt
abc.txt
aa:s0:education.gov.in
bb:s1:defence.gov.in
cc:s2:finance.gov.in
I'm trying to parse this file by tokenizing (correct me if this is the incorrect term :) ) at every ":" using the following regular expression.
parser.py
import re,sys,os,subprocess
path = "C:\abc.txt"
site_list = open(path,'r')
for line in site_list:
site_line = re.search(r'(\w)*:(\w)*:([\w\W]*\.[\W\w]*\.[\W\w]*)',line)
print('Regex found that site_line.group(2) = '+str(site_line.group(2))
Why is the output
Regex found that site_line.group(2) = 0
Regex found that site_line.group(2) = 1
Regex found that site_line.group(2) = 2
Can someone please help me understand why it matches the last character of the second group ? I think its matching 0 from s0 , 1 from s1 & 2 from s2
But Why ?
Let's show a simplified example:
>>> re.search(r'(.)*', 'asdf').group(1)
'f'
>>> re.search(r'(.*)', 'asdf').group(1)
'asdf'
If you have a repetition operator around a capturing group, the group stores the last repetition. Putting the group around the repetition operator does what you want.
If you were expecting to see data from the third group, that would be group(3). group(0) is the whole match, and group(1), group(2), etc. count through the actual parenthesized capturing groups.
That said, as the comments suggest, regexes are overkill for this.
>>> 'aa:s0:education.gov.in'.split(':')
['aa', 's0', 'education.gov.in']
And first group is entire match by default.
If a groupN argument is zero, the corresponding return value is the
entire matching string.
So you should skip it. And check group(3), if you want last one.
Also, you should compile regexp before for-loop. It increase performance of your parser.
And you can replace (\w)* to (\w*), if you want match all symbols between :.

Categories

Resources