Splitting based on particular pattern and editing string - python

I am trying to split a string based on a particular pattern in an effort to rejoin it later after adding a few characters.
Here's a sample of my string: "123\babc\b:123" which I need to convert to "123\babc\\"b\":123". I need to do it several times in a long string. I have tried variations of the following:
regex = r"(\\b[a-zA-Z]+)\\b:"
test_str = "123\\babc\\b:123"
x = re.split(regex, test_str)
but it doesn't split at the right positions for me to join. Is there another way of doing this/another way of splitting and joining?

You're right, you can do it with re.split as suggested. You can split by \b and then rebuild your output with a specific separator (and keep the \b when you want too).
Here an example:
# Import module
import re
string = "123\\babc\\b:123"
# Split by "\n"
list_sliced = re.split(r'\\b', "123\\babc\\b:123")
print(list_sliced)
# ['123', 'abc', ':123']
# Define your custom separator
custom_sep = '\\\\"b\\"'
# Build your new output
output = list_sliced[0]
# Iterate over each word
for i, word in enumerate(list_sliced[1:]):
# Chose the separator according the parity (since we don't want to change the first "\b")
sep = "\\\\b"
if i % 2 == 1:
sep = custom_sep
# Update output
output += sep + word
print(output)
# 123\\babc\\"b\":123

Maybe, the following expression,
^([\\]*)([^\\]+)([\\]*)([^\\]+)([\\]*)([^:]+):(.*)$
and a replacement of,
\1\2\3\4\5\\"\6\\":\7
with a re.sub might return our desired output.
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

eg
Arun,Mishra,108,23,34,45,56,Mumbai
o\p I want is
Arun,Mishra,108.23,34,45,56,Mumbai
Tried to replace the comma with dot but all the demiliters are replaced with comma
tried text.replace(',','.') but replacing all the commas with dot
You can use regex for these kind of tasks:
import re
old_str = 'Arun,Mishra,108,23,34,45,56,Mumbai'
new_str = re.sub(r'(\d+)(,)(\d+)', r'\1.\3', old_str, 1)
>>> 'Arun,Mishra,108.23,34,45,56,Mumbai'
The search pattern r'(\d+)(,)(\d+)' was to find a comma between two numbers. There are three capture groups, therefore one can use them in the replacement: r\1.\3 (\1 and \3 are first and third groups). The old_str is the string and 1 is to tell the pattern to only replace the first occurrence (thus keep 34, 45).
It may be instructive to show how this can be done without additional module imports.
The idea is to search the string for all/any commas. Once the index of a comma has been identified, examine the characters either side (checking for digits). If such a pattern is observed, modify the string accordingly
s = 'Arun,Mishra,108,23,34,45,56,Mumbai'
pos = 1
while (pos := s.find(',', pos, len(s)-1)) > 0:
if s[pos-1].isdigit() and s[pos+1].isdigit():
s = s[:pos] + '.' + s[pos+1:]
break
pos += 1
print(s)
Output:
Arun,Mishra,108.23,34,45,56,Mumbai
Assuming you have a plain CSV file as in your single line example, we can assume there are 8 columns and you want to 'merge' columns 3 and 4 together. You can do this with a regular expression - as shown below.
Here I explicitly match the 8 columns into 8 groups - matching everything that is not a comma as a column value and then write out the 8 columns again with commas separating all except columns 3 and 4 where I put the period/dot you require.
$ echo "Arun,Mishra,108,23,34,45,56,Mumbai" | sed -r "s/([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)/\1,\2,\3.\4,\5,\6,\7,\8/"
Arun,Mishra,108.23,34,45,56,Mumbai
This regex is for your exact data. Having a generic regex to replace any comma between two subsequent sets of digits might give false matches on other data however so I think explicitly matching the data based on the exact columns you have will be the safest way to do it.
You can take the above regex and code it into your python code as shown below.
import re
inLine = 'Arun,Mishra,108,23,34,45,56,Mumbai'
outLine = re.sub(r'([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)'
, r'\1,\2,\3.\4,\5,\6,\7,\8', inLine, 0)
print(outLine)
As Tim Biegeleisen pointed out in an original comment, if you have access to the original source data you would be better fixing the formatting there. Of course that is not always possible.
First split the string using s.split() and then replace ',' in 2nd element
after replacing join the string back again.
s= 'Arun,Mishra,108,23,34,45,56,Mumbai '
ls = s.split(',')
ls[2] = '.'.join([ls[2], ls[3]])
ls.pop(3)
s = ','.join(ls)
It changes all the commas to dots if dot have numbers before and after itself.
txt = "2459,12 is the best number. lets change the dots . with commas , 458,45."
commaindex = 0
while commaindex != -1:
commaindex = txt.find(",",commaindex+1)
if txt[commaindex-1].isnumeric() and txt[commaindex+1].isnumeric():
txt = txt[0:commaindex] + "." + txt[commaindex+1:len(txt)+1]
print(txt)

Extract a portion of string from another string using regex

Lets assume I have a string as follows:
s = '23092020_indent.xlsx'
I want to extract only indent from the above string. Now there are many approaches:
#Via re.split() operation
s_f = re.split('_ |. ',s) <---This is returning 's' ONLY. Not the desired output
#Via re.findall() operation
s_f = re.findall(r'[^A-Za-z]',s,re.I)
s_f
['i','n','d','e','n','t','x','l','s','x']
s_f = ''.join(s_f) <----This is returning 'indentxlsx'. Not the desired output
Am I missing out anything? Or do I need to use regex at all?
P.S. In the whole part of s only '.'delimiter would be constant. Rests all delimiter can be changed.
Use os.path.splitext and then str.split:
import os
name, ext = os.path.splitext(s)
name.split("_")[1] # If the position is always fixed
Output:
"indent"
I LOVE regex's, so that's definitely the way I'd go.
The exactly right answer requires more information as to all possible input strings and what the right thing to extract is for each of them. Here's a solution that assumes:
one or more digits, then
a single underscore, then
a group of chars not containing a '.', then
a '.', then
anything besides a '.', but at least one char
The #3 part is captured.
import re
s = '23092020_indent.xlsx'
exp = re.compile(r"^\d+_(.*?)\.[^.]+$")
m = exp.match(s)
if m:
print(m.group(1))
Result:
indent

How to partial split and take the first portion of string in Python?

Have a scenario where I wanted to split a string partially and pick up the 1st portion of the string.
Say String could be like aloha_maui_d0_b0 or new_york_d9_b10. Note: After d its numerical and it could be any size.
I wanted to partially strip any string before _d* i.e. wanted only _d0_b0 or _d9_b10.
Tried below code, but obviously it removes the split term as well.
print(("aloha_maui_d0_b0").split("_d"))
#Output is : ['aloha_maui', '0_b0']
#But Wanted : _d0_b0
Is there any other way to get the partial portion? Do I need to try out in regexp?
How about just
stArr = "aloha_maui_d0_b0".split("_d")
st2 = '_d' + stArr[1]
This should do the trick if the string always has a '_d' in it
You can use index() to split in 2 parts:
s = 'aloha_maui_d0_b0'
idx = s.index('_d')
l = [s[:idx], s[idx:]]
# l = ['aloha_maui', '_d0_b0']
Edit: You can also use this if you have multiple _d in your string:
s = 'aloha_maui_d0_b0_d1_b1_d2_b2'
idxs = [n for n in range(len(s)) if n == 0 or s.find('_d', n) == n]
parts = [s[i:j] for i,j in zip(idxs, idxs[1:]+[None])]
# parts = ['aloha_maui', '_d0_b0', '_d1_b1', '_d2_b2']
I have two suggestions.
partition()
Use the method partition() to get a tuple containing the delimiter as one of the elements and use the + operator to get the String you want:
teste1 = 'aloha_maui_d0_b0'
partitiontest = teste1.partition('_d')
print(partitiontest)
print(partitiontest[1] + partitiontest[2])
Output:
('aloha_maui', '_d', '0_b0')
_d0_b0
The partition() methods returns a tuple with the first element being what is before the delimiter, the second being the delimiter itself and the third being what is after the delimiter.
The method does that to the first case of the delimiter it finds on the String, so you can't use it to split in more than 3 without extra work on the code. For that my second suggestion would be better.
replace()
Use the method replace() to insert an extra character (or characters) right before your delimiter (_d) and use these as the delimiter on the split() method.
teste2 = 'new_york_d9_b10'
replacetest = teste2.replace('_d', '|_d')
print(replacetest)
splitlist = replacetest.split('|')
print(splitlist)
Output:
new_york|_d9_b10
['new_york', '_d9_b10']
Since it replaces all cases of _d on the String for |_d there is no problem on using it to split in more than 2.
Problem?
A situation to which you may need to be careful would be for unwanted splits because of _d being present in more places than anticipated.
Following the apparent logic of your examples with city names and numericals, you might have something like this:
teste3 = 'rio_de_janeiro_d3_b32'
replacetest = teste3.replace('_d', '|_d')
print(replacetest)
splitlist = replacetest.split('|')
print(splitlist)
Output:
rio|_de_janeiro|_d3_b32
['rio', '_de_janeiro', '_d3_b32']
Assuming you always have the numerical on the end of the String and _d won't happen inside the numerical, rpartition() could be a solution:
rpartitiontest = teste3.rpartition('_d')
print(rpartitiontest)
print(rpartitiontest[1] + rpartitiontest[2])
Output:
('rio_de_janeiro', '_d', '3_b32')
_d3_b32
Since rpartition() starts the search on the String's end and only takes the first match to separate the terms into a tuple, you won't have to worry about the first term (city's name?) causing unexpected splits.
Use regex's split and keep delimiters capability:
import re
patre = re.compile(r"(_d\d)")
#👆 👆
#note the surrounding parenthesises - they're what drives "keep"
for line in """aloha_maui_d0_b0 new_york_d9_b10""".split():
parts = patre.split(line)
print("\n", line)
print(parts)
p1, p2 = parts[0], "".join(parts[1:])
print(p1, p2)
output:
aloha_maui_d0_b0
['aloha_maui', '_d0', '_b0']
aloha_maui _d0_b0
new_york_d9_b10
['new_york', '_d9', '_b10']
new_york _d9_b10
credit due: https://stackoverflow.com/a/15668433

Python Regex to extract multiple complex groups

I am trying to extract some groups of data from a text and validate if the input text is correct. In the simplified form my input text looks like this:
Sample=A,B;C,D;E,F;G,H;I&other_text
In which A-I are groups I am interested in extracting them.
In the generic form, Sample looks like this:
val11,val12;val21,val22;...;valn1,valn2;final_val
arbitrary number of comma separated pairs which are separated by semicolon, and one single value at the very end.
There must be at least two pairs before the final value.
The regular expression I came up with is something like this:
r'Sample=(\w),(\w);(\w),(\w);((\w),(\w);)*(\w)'
Assuming my desired groups are simply words (in reality they are more complex but this is out of the scope of the question).
It actually captures the whole text but fails to group the values correctly.
I am just assuming that your "values" are any composed of any characters other than , and ;, i.e. [^,;]+. This clearly needs to be modified in the re.match and re.finditer calls to meet your actual requirements.
import re
s = 'Sample=val11,val12;val21,val22;val31,val32;valn1,valn2;final_val'
# verify if there is a match:
m = re.match(r'^Sample=([^,;]+),+([^,;]+)(;([^,;]+),+([^,;]+))+;([^,;]+)$', s)
if m:
final_val = m.group(6)
other_vals = [(m.group(1), m.group(2)) for m in re.finditer(r'([^,;]+),+([^,;]+)', s[7:])]
print(final_val)
print(other_vals)
Prints:
final_val
[('val11', 'val12'), ('val21', 'val22'), ('val31', 'val32'), ('valn1', 'valn2')]
You can do this with a regex that has an OR in it to decide which kind of data you are parsing. I spaced out the regex for commenting and clarity.
data = 'val11,val12;val21,val22;valn1,valn2;final_val'
pat = re.compile(r'''
(?P<pair> # either comma separated ending in semicolon
(?P<entry_1>[^,;]+) , (?P<entry_2>[^,;]+) ;
)
| # OR
(?P<end_part> # the ending token which contains no comma or semicolon
[^;,]+
)''', re.VERBOSE)
results = []
for match in pat.finditer(data):
if match.group('pair'):
results.append(match.group('entry_1', 'entry_2'))
elif match.group('end_part'):
results.append(match.group('end_part'))
print(results)
This results in:
[('val11', 'val12'), ('val21', 'val22'), ('valn1', 'valn2'), 'final_val']
You can do this without using regex, by using string.split.
An example:
words = map(lambda x : x.split(','), 'val11,val12;val21,val22;valn1,valn2;final_val'.split(';'))
This will result in the following list:
[
['val11', 'val12'],
['val21', 'val22'],
['valn1', 'valn2'],
['final_val']
]

Pandas to match column contents to keywords (with spaces and brackets )

A columns in data frame contains the keywords I want to match with.
I want to check if each column contains any of the keywords. If yes, print them.
Tried below:
import pandas as pd
import re
Keywords = [
"Caden(S, A)",
"Caden(a",
"Caden(.A))",
"Caden.Q",
"Caden.K",
"Caden"
]
data = {'People' : ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]}
df = pd.DataFrame(data)
pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
df["found"] = df['People'].str.findall(pat).str.join('; ')
print df["found"]
It returns Nan. I guess the challenge lies in the spaces and brackets in the keywords.
What's the right way to get the ideal outputs? Thank you.
Caden(S, A); Caden.K
Caden(a
Caden(.A))
Caden.Q
Since you do not need to find every keyword, but the longest ones if they are overlapping you may use a regex with findall approach.
The point here is that you need to sort the keywords by length in the descending order first (because there are whitespaces in them), then you need to escape these values as they contain special characters, then you must amend the word boundaries to use unambiguous word boundaries, (?<!\w) and (?!\w) (note that \b is context dependent).
Use
pat = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
See an online Python test:
import re
Keywords = ["Caden(S, A)", "Caden(a","Caden(.A))", "Caden.Q", "Caden.K", "Caden"]
rx = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
# => (?<!\w)(?:Caden\(S,\ A\)|Caden\(\.A\)\)|Caden\(a|Caden\.Q|Caden\.K|Caden)(?!\w)
strs = ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]
for s in strs:
print(re.findall(rx, s))
Output
['Caden(S, A)', 'Caden.K']
['Caden(a']
['Caden(.A))']
['Caden.Q']
Hey don't know if this solution is optimal but it works. I just replaced dot by 8 and '(' by 6 and ')' by 9 don't know why those character are ignored by str.findall ?
A kind of bijection between {8,6,9} and {'.','(',')'}
for i in range(len(Keywords)):
Keywords[i] = Keywords[i].replace('(','6').replace(')','9').replace('.','8')
for i in range(len(df['People'])):
df['People'][i] = df['People'][i].replace('(','6').replace(')','9').replace('.','8')
And then you apply your function
pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
df["found"] = df['People'].str.findall(pat).str.join('; ')
Final step get back the {'.','(',')'}
for i in range(len(df['found'])):
df['found'][i] = df['found'][i].replace('6','(').replace('9',')').replace('8','.')
df['People'][i] = df['People'][i].replace('6','(').replace('9',')').replace('8','.')
Voilà

Categories

Resources