Extracting Sub-string Between Two Characters in String in Pandas Dataframe - python

I have a column containing strings that are comprised of different words but always have a similar structure structure. E.g.:
2cm off ORDER AGAIN (191 1141)
I want to extract the sub-string that starts after the second space and ends at the space before the opening bracket/parenthesis. So in this example I want to extract ORDER AGAIN.
Is this possible?

You could use str.extract here:
df["out"] = df["col"].str.extract(r'^\w+ \w+ (.*?)(?: \(|$)')
Note that this answer is robust even if the string doesn't have a (...) term at the end.
Here is a demo showing that the regex logic is working.

You can try the following:
r"2cm off ORDER AGAIN (191 1141)".split(r"(")[0].split(" ", maxsplit=2)[-1].strip()
#Out[3]: 'ORDER AGAIN'

If the pattern of data is similar to what you have posted then I think the below code snippet should work for you:
import re
data = "2cm off ORDER AGAIN (191 1141)"
extr = re.match(r".*?\s.*?\s(.*)\s\(.*", data)
if extr:
print (extr.group(1))

You can try the following code
s = '2cm off ORDER AGAIN (191 1141)'
second_space = s.find(' ', s.find(' ') + 1)
openparenthesis = s.find('(')
substring = s[second_space : openparenthesis]
print(substring) #ORDER AGAIN

Related

How can I split a string with no delimeter?

I need to import CSV file which contains all values in one column although it should be on 3 different columns.
The value I want to split is looking like this "2020-12-30 13:17:00Mojito5.5". I want to look like this: "2020-12-30 13:17:00 Mojito 5.5"
I tried different approaches to splitting it but I either get the error " Dataframe object has no attribute 'split' or something similar.
Any ideas how I can split this?
Assuming you always want to add spaces around a word without special characters and numbers you can use this regex:
def add_spaces(m):
return f' {m.group(0)} '
import re
s = "2020-12-30 13:17:00Mojito5.5"
re.sub('[a-zA-Z]+', add_spaces, s)
We could use a regex approach here:
inp = "2020-12-30 13:17:00Mojito5.5"
m = re.findall(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})(\w+?)(\d+(?:\.\d+)?)', inp)
print(m) # [('2020-12-30 13:17:00', 'Mojito', '5.5')]

Regex in python: combining 2 regex expressions into one

Suppose I have the following list:
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','persons']
I want to remove all elements, that contain numbers and elements, that end with dots.
So I want to delete '35','7,000','10,000','mr.','rev.'
I can do it separately using the following regex:
regex = re.compile('[a-zA-Z\.]')
regex2 = re.compile('[0-9]')
But when I try to combine them I delete either all elements or nothing.
How can I combine two regex correctly?
This should work:
reg = re.compile('[a-zA-Z]+\.|[0-9,]+')
Note that your first regex is wrong because it deletes any string within a dot inside it.
To avoid this, I included [a-zA-Z]+\. in the combined regex.
Your second regex is also wrong as it misses a "+" and a ",", which I included in the above solution.
Here a demo.
Also, if you assume that elements which end with a dot might contain some numbers the complete solution should be:
reg = re.compile('[a-zA-Z0-9]+\.|[0-9,]+')
If you don't need to capture the result, this matches any string with a dot at the end, or any with a number in it.
\.$|\d
You could use:
(?:[^\d\n]*\d)|.*\.$
See a demo on regex101.com.
Here is a way to do the job:
import re
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','per.sons']
b = []
for s in a:
if not re.search(r'^(?:[\d,]+|.*\.)$', s):
b.append(s)
print b
Output:
['years', 'opened', 'churches', 'brandt', 'said', 'adding', 'denomination', 'national', 'goal', 'one', 'church', 'every', 'per.sons']
Demo & explanation

Python Regex to extract multiple complex groups

I am trying to extract some groups of data from a text and validate if the input text is correct. In the simplified form my input text looks like this:
Sample=A,B;C,D;E,F;G,H;I&other_text
In which A-I are groups I am interested in extracting them.
In the generic form, Sample looks like this:
val11,val12;val21,val22;...;valn1,valn2;final_val
arbitrary number of comma separated pairs which are separated by semicolon, and one single value at the very end.
There must be at least two pairs before the final value.
The regular expression I came up with is something like this:
r'Sample=(\w),(\w);(\w),(\w);((\w),(\w);)*(\w)'
Assuming my desired groups are simply words (in reality they are more complex but this is out of the scope of the question).
It actually captures the whole text but fails to group the values correctly.
I am just assuming that your "values" are any composed of any characters other than , and ;, i.e. [^,;]+. This clearly needs to be modified in the re.match and re.finditer calls to meet your actual requirements.
import re
s = 'Sample=val11,val12;val21,val22;val31,val32;valn1,valn2;final_val'
# verify if there is a match:
m = re.match(r'^Sample=([^,;]+),+([^,;]+)(;([^,;]+),+([^,;]+))+;([^,;]+)$', s)
if m:
final_val = m.group(6)
other_vals = [(m.group(1), m.group(2)) for m in re.finditer(r'([^,;]+),+([^,;]+)', s[7:])]
print(final_val)
print(other_vals)
Prints:
final_val
[('val11', 'val12'), ('val21', 'val22'), ('val31', 'val32'), ('valn1', 'valn2')]
You can do this with a regex that has an OR in it to decide which kind of data you are parsing. I spaced out the regex for commenting and clarity.
data = 'val11,val12;val21,val22;valn1,valn2;final_val'
pat = re.compile(r'''
(?P<pair> # either comma separated ending in semicolon
(?P<entry_1>[^,;]+) , (?P<entry_2>[^,;]+) ;
)
| # OR
(?P<end_part> # the ending token which contains no comma or semicolon
[^;,]+
)''', re.VERBOSE)
results = []
for match in pat.finditer(data):
if match.group('pair'):
results.append(match.group('entry_1', 'entry_2'))
elif match.group('end_part'):
results.append(match.group('end_part'))
print(results)
This results in:
[('val11', 'val12'), ('val21', 'val22'), ('valn1', 'valn2'), 'final_val']
You can do this without using regex, by using string.split.
An example:
words = map(lambda x : x.split(','), 'val11,val12;val21,val22;valn1,valn2;final_val'.split(';'))
This will result in the following list:
[
['val11', 'val12'],
['val21', 'val22'],
['valn1', 'valn2'],
['final_val']
]

Regular expression help to find space after a long string

My code is as follow:
list = re.findall(("PROGRAM S\d\d"), contents
If I print the list I just print S51 but I want to take everything.
I want to findall everything like that "PROGRAM S51_Mix_Station". I know how to put the digits to find them but I donĀ“t know how to find everything until the next space because usually after the last character there is an space.
Thanks in advance.
You can also use \w+:
import re
s = "PROGRAM S51_Mix_Station"
new_data = re.findall('^PROGRAM\s\w+\_\w+_\w+', s)
final_data = new_data[0] if new_data else new_data
Output:
'PROGRAM S51_Mix_Station'
Ok, thanks. I find another solution.
lista = re.findall(("PROGRAM S\d\d\S+") To find any character after the digit as repetition.
You could use this:
list = re.findall(r"PROGRAM S\d\d[^ ]*", contents)
This would match PROGRAM S followed by two digits, then followed by any number of non space characters. If you wanted to include all whitespace characters with spaces, then the #Wiktor comment would be better, i.e. use PROGRAM S\d\d\S*.

How to get sub string from a string in python using split or regex

I have a str in python like below. I want extract a substring from it.
table='abc_test_01'
number=table.split("_")[1]
I am getting test as a result.
What I want is everything after the first _.
The result I want is test_01 how can I achieve that.
Here is the code as already given by many of them
table='abc_test_01'
number=table.split("_",1)[1]
But the above one may fail in situations when the occurrence is not in the string, then you'll get IndexError: list index out of range
For eg.
table='abctest01'
number=table.split("_",1)[1]
The above one will raise IndexError, as the occurrence is not in the string
So the more accurate code for handling this is
table.split("_",1)[-1]
Therefore -1 will not get any harm because the number of occurrences is already set to one.
Hope it helps :)
To get the substring (all characters after the first occurrence of underscore):
number = table[table.index('_')+1:]
# Output: test_01
You could do it like:
import re
string = "abc_test_01"
rx = re.compile(r'[^_]*_(.+)')
match = rx.match(string).group(1)
print(match)
Or with normal string functions:
string = "abc_test_01"
match = '_'.join(string.split('_')[1:])
print(match)
Nobody mentions that the split() function can have an maxsplit argument:
str.split(sep=None, maxsplit=-1)
return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).
So the solution is only:
table.split('_', 1)[1]
You can try this:
Edit: Thanks to #valtah's comment:
table = 'abc_test_01'
#final = "_".join(table.split("_")[1:])
final = table.split("_", 1)[1]
print final
Output:
'test_01'
Also the answer of #valtah in the comment is correct:
final = table.partition("_")[2]
print final
Will output the same result

Categories

Resources