Extracting multiple substrings from one string

Extracting multiple substrings from one string - python

I have the following string which I am parsing from another file :
"CHEM1(5GL) CH3M2(55LB) CHEM3954114(50KG)"
What I want to do is split them up into individual values, which I achieve using the .split() function. So I get them as an array:
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
Now I want to further split them into 3 segments, and store them in 3 other variables so I can write them to excel as such :
a = CHEM1
b = 5
c = GL
for the first array, then I will loop back for the second array:
a = CH3M2
b = 55
c = LB
and finally :
a = CHEM3954114
b = 50
c = KG
I am unsure how to go about that as I am still new in python. To the best of my acknowledge I iterate multiple times with the split function, but I believe there has to be a better way to do it than that.
Thank you.

You should use the re package:
import re
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
pattern = re.compile("([^\(]+)\((\d+)(.+)\)")
for x1 in x:
m = pattern.search(x1)
if m:
a, b, c = m.group(1), int(m.group(2)), m.group(3)
FOLLOW UP:
The regex topic is enormous and extremely well covered on this site - as Tim has highlighted above. I can share my thinking for this specific case.
Essentially, there are 3 groups of characters you want to extract:
All the characters (letters and numbers) up to the ( - not included
The digits after the (
The letters after the digits extracted in the previous step - up to the ) - not included.
A group is anything included between brackets (): in this specific case, it may become confusing because, as stressed above, you have brackets as part of sentence - which will need to be escaped with a \ to be distinguished from the ones used in the regular expression.
The first group is ([^\(]+), which essentially means: match one or more characters which are not ( (the ^ is the negation, and the bracket ( needs to be escaped here, for the reasons described above). Note that characters may include not only letters and numbers but also special characters like $, £, - and so forth. I wanted to keep my options open here, but you can be more laser guided if you need (including, for example, only numbers and letters using [\w]+)
The second group is (\d+), which is essentially matching 1 or more (expressed with +) digits (expressed with \d).
The last group is (.+) - match any remaining characters, with the final \) making sure that you match any remaining characters up to the closing bracket.

Using re.findall we can try:
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
for inp in x:
matches = re.findall(r'(\w+)\((\d+)(\w+)\)', inp)
print(matches)
# [('CHEM1', '5', 'GL')]
# [('CH3M2', '55', 'LB')]
# [('CHEM3954114', '50', 'KG')]

Considering the elements you provided in your question, I assume that there can not be '(' more than once in an element.
Here is the function I wrote.
def decontruct(chem):
name = chem[:chem.index('(')]
qty = chem[chem.index('(') + 1:-1]
mag, unit = "", ""
for char in qty:
if char.isalpha():
unit += char
else:
mag += char
return {"name": name, "mag": float(mag), "unit": unit} # If you don't want to convert mag into float then just use int(mag) instead of float(mag).
Usage:
x = ['CHEM1(5.4GL)', 'CH3M2(55LB)', 'CHEM3954114(50KG)']
for chem in x:
d = decontruct(chem)
print(d["name"], d["mag"], d["unit"])

Use re and create a list of dictionaries
import re
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
keys =['a', 'b', 'c']
y = []
for s in x:
vals = re.sub(r'(.*?)\((\d*)(.*?)\)', r'\1 \2 \3', s).split()
y.append(dict(zip(keys, vals)))
[print("a: %s\nb: %s\nc: %s\n" % (i['a'], i['b'], i['c'])) for i in y]
gives
a: CHEM1
b: 5
c: GL
a: CH3M2
b: 55
c: LB
a: CHEM3954114
b: 50
c: KG

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

eg
Arun,Mishra,108,23,34,45,56,Mumbai
o\p I want is
Arun,Mishra,108.23,34,45,56,Mumbai
Tried to replace the comma with dot but all the demiliters are replaced with comma
tried text.replace(',','.') but replacing all the commas with dot

You can use regex for these kind of tasks:
import re
old_str = 'Arun,Mishra,108,23,34,45,56,Mumbai'
new_str = re.sub(r'(\d+)(,)(\d+)', r'\1.\3', old_str, 1)
>>> 'Arun,Mishra,108.23,34,45,56,Mumbai'
The search pattern r'(\d+)(,)(\d+)' was to find a comma between two numbers. There are three capture groups, therefore one can use them in the replacement: r\1.\3 (\1 and \3 are first and third groups). The old_str is the string and 1 is to tell the pattern to only replace the first occurrence (thus keep 34, 45).

It may be instructive to show how this can be done without additional module imports.
The idea is to search the string for all/any commas. Once the index of a comma has been identified, examine the characters either side (checking for digits). If such a pattern is observed, modify the string accordingly
s = 'Arun,Mishra,108,23,34,45,56,Mumbai'
pos = 1
while (pos := s.find(',', pos, len(s)-1)) > 0:
if s[pos-1].isdigit() and s[pos+1].isdigit():
s = s[:pos] + '.' + s[pos+1:]
break
pos += 1
print(s)
Output:
Arun,Mishra,108.23,34,45,56,Mumbai

Assuming you have a plain CSV file as in your single line example, we can assume there are 8 columns and you want to 'merge' columns 3 and 4 together. You can do this with a regular expression - as shown below.
Here I explicitly match the 8 columns into 8 groups - matching everything that is not a comma as a column value and then write out the 8 columns again with commas separating all except columns 3 and 4 where I put the period/dot you require.
$ echo "Arun,Mishra,108,23,34,45,56,Mumbai" | sed -r "s/([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)/\1,\2,\3.\4,\5,\6,\7,\8/"
Arun,Mishra,108.23,34,45,56,Mumbai
This regex is for your exact data. Having a generic regex to replace any comma between two subsequent sets of digits might give false matches on other data however so I think explicitly matching the data based on the exact columns you have will be the safest way to do it.
You can take the above regex and code it into your python code as shown below.
import re
inLine = 'Arun,Mishra,108,23,34,45,56,Mumbai'
outLine = re.sub(r'([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)'
, r'\1,\2,\3.\4,\5,\6,\7,\8', inLine, 0)
print(outLine)
As Tim Biegeleisen pointed out in an original comment, if you have access to the original source data you would be better fixing the formatting there. Of course that is not always possible.

First split the string using s.split() and then replace ',' in 2nd element
after replacing join the string back again.
s= 'Arun,Mishra,108,23,34,45,56,Mumbai '
ls = s.split(',')
ls[2] = '.'.join([ls[2], ls[3]])
ls.pop(3)
s = ','.join(ls)

It changes all the commas to dots if dot have numbers before and after itself.
txt = "2459,12 is the best number. lets change the dots . with commas , 458,45."
commaindex = 0
while commaindex != -1:
commaindex = txt.find(",",commaindex+1)
if txt[commaindex-1].isnumeric() and txt[commaindex+1].isnumeric():
txt = txt[0:commaindex] + "." + txt[commaindex+1:len(txt)+1]
print(txt)

How to remove substring from string in pandas column that contain both numbers and chars

I know that if we have strings like this
May21
James
Adi22
Hello
Girl90
zt411
We can use regex with \d+ to remove all the numbers. But how would I also remove the entire string if the string also contains characters. Thus the only thing that would be returned in the latter above would be James and Hello?
I can do this for just one string:
c = 'xterm has been replaced new mac 008064c79202'
c = ' '.join(w for w in c.split() if not any(x.isdigit() for x in w))
c
How would I apply this across an entire dataframe?

You can apply your function to a column as follows:
df = pd.DataFrame(['May21', 'James', 'Adi22', 'Hello', 'Girl90', 'zt411'], columns=['word'])
def remove_semi_nums(c):
return ' '.join(w for w in c.split() if not any(x.isdigit() for x in w))
# option A: list comprehension (I like this better)
df['word'] = [remove_semi_nums(x) for x in df.word]
# option B: use `apply` which I don't recommend for big data sets because it's slow. (Also cumbersome to use for functions that use multiple columns as args)
df['word'] = df['word'].apply(remove_semi_nums)

Use a regular expression like
(?:[A-Za-z]+\d|\d+[A-Za-z])[A-Za-z\d]+$
with Series.str.match. See the regex demo. Details:
^ (implicit in .match): start of string
(?:[A-Za-z]+\d|\d+[A-Za-z]) - either one or more letters and then a digit or one or more digits and then a letter
[A-Za-z\d]+ - one or more letters or digits
$ - end of string.
See the Pandas test:
df = pd.DataFrame(['May21', 'James', 'Adi22', 'Hello', 'Girl90', 'zt411'], columns=['word'])
df[df['word'].str.match(r'(?:[A-Za-z]+\d|\d+[A-Za-z])[A-Za-z\d]+$')] = ""
>>> df
word
0
1 James
2
3 Hello
4
5

Pandas to match column contents to keywords (with spaces and brackets )

A columns in data frame contains the keywords I want to match with.
I want to check if each column contains any of the keywords. If yes, print them.
Tried below:
import pandas as pd
import re
Keywords = [
"Caden(S, A)",
"Caden(a",
"Caden(.A))",
"Caden.Q",
"Caden.K",
"Caden"
]
data = {'People' : ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]}
df = pd.DataFrame(data)
pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
df["found"] = df['People'].str.findall(pat).str.join('; ')
print df["found"]
It returns Nan. I guess the challenge lies in the spaces and brackets in the keywords.
What's the right way to get the ideal outputs? Thank you.
Caden(S, A); Caden.K
Caden(a
Caden(.A))
Caden.Q

Since you do not need to find every keyword, but the longest ones if they are overlapping you may use a regex with findall approach.
The point here is that you need to sort the keywords by length in the descending order first (because there are whitespaces in them), then you need to escape these values as they contain special characters, then you must amend the word boundaries to use unambiguous word boundaries, (?<!\w) and (?!\w) (note that \b is context dependent).
Use
pat = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
See an online Python test:
import re
Keywords = ["Caden(S, A)", "Caden(a","Caden(.A))", "Caden.Q", "Caden.K", "Caden"]
rx = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
# => (?<!\w)(?:Caden\(S,\ A\)|Caden\(\.A\)\)|Caden\(a|Caden\.Q|Caden\.K|Caden)(?!\w)
strs = ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]
for s in strs:
print(re.findall(rx, s))
Output
['Caden(S, A)', 'Caden.K']
['Caden(a']
['Caden(.A))']
['Caden.Q']

Hey don't know if this solution is optimal but it works. I just replaced dot by 8 and '(' by 6 and ')' by 9 don't know why those character are ignored by str.findall ?
A kind of bijection between {8,6,9} and {'.','(',')'}
for i in range(len(Keywords)):
Keywords[i] = Keywords[i].replace('(','6').replace(')','9').replace('.','8')
for i in range(len(df['People'])):
df['People'][i] = df['People'][i].replace('(','6').replace(')','9').replace('.','8')
And then you apply your function
pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
df["found"] = df['People'].str.findall(pat).str.join('; ')
Final step get back the {'.','(',')'}
for i in range(len(df['found'])):
df['found'][i] = df['found'][i].replace('6','(').replace('9',')').replace('8','.')
df['People'][i] = df['People'][i].replace('6','(').replace('9',')').replace('8','.')
Voilà

How can I split a string at the first occurrence of a letter in Python?

A have a series of strings in the following format. Demonstration examples would look like this:
71 1 * abwhf
8 askg
*14 snbsb
00ab
I am attempting to write a Python 3 program that will use a for loop to cycle through each string and split it once at the first occurrence of a letter into a list with two elements.
The output for the strings above would become lists with the following elements:
71 1 * and abwhf
8and askg
*14 and snbsb
00 and ab
There is supposed to be a space after the first string of the first three examples but this only shows in the editor
How can I split the string in this way?
Two posts look of relevance here:
Splitting on first occurrence
Python: Split a string at uppercase letters
The first answer for the first question allows me to split a string at the first occurrence of a single character but not multiple characters (like all the letters of the alphabet).
The second allows me to split at the first letter, but not just one time. Using this would result in an array with many elements.

Using re.search:
import re
strs = ["71 1 * abwhf", "8 askg", "*14 snbsb", "00ab"]
def split_on_letter(s):
match = re.compile("[^\W\d]").search(s)
return [s[:match.start()], s[match.start():]]
for s in strs:
print split_on_letter(s)
The regex [^\W\d] matches all alphabetical characters.
\W matches all non-alphanumeric characters and \d matches all numeric characters. ^ at the beginning of the set inverts the selection to match everything that is not (non-alphanumeric or numeric), which corresponds to all letters.
match searches the string to find the index of the first occurrence of the matching expression. You can slice the original string based on the location of the match to get two lists.

The only way I can think of is to write the function yourself:
import string
def split_letters(old_string):
index = -1
for i, char in enumerate(old_string):
if char in string.letters:
index = i
break
else:
raise ValueError("No letters found") # or return old_string
return [old_string[:index], old_string[index:]]

Use re.split()
import re
strings = [
"71 1 * abwhf",
"8 askg",
"*14 snbsb",
"00ab",
]
for string in strings:
a, b, c = re.split(r"([a-z])", string, 1, flags=re.I)
print(repr(a), repr(b + c))
Produces:
'71 1 * ' 'abwhf'
'8 ' 'askg'
'*14 ' 'snbsb'
'00' 'ab'
The trick here is we're splitting on any letter but only asking for a single split. By putting the pattern in parentheses, we save the split character which would normally be lost. We then add the split character back onto the front of the second string.

sample1 = '71 1 * abwhf'
sample2 = '8 askg'
sample3 = '*14 snbsb'
sample4 = '00ab'
sample5 = '1234'
def split_at_first_letter(txt):
for value in txt:
if value.isalpha():
result = txt.split(value, 1)
return [result[0], '{}{}'.format(value, result[1], )]
return [txt]
print(split_at_first_letter(sample1))
print(split_at_first_letter(sample2))
print(split_at_first_letter(sample3))
print(split_at_first_letter(sample4))
print(split_at_first_letter(sample5))
Result
['71 1 * ', 'abwhf']
['8 ', 'askg']
['*14 ', 'snbsb']
['00', 'ab']
['1234']

Python - Parse strings with variable repeating substring

I am trying to do something which I thought would be simple (and probably is), however I am hitting a wall. I have a string that contains document numbers. In most cases the format is ######-#-### however in some cases, where the single digit should be, there are multiple single digits separated by a comma (i.e. ######-#,#,#-###). The number of single digits separated by a comma is variable. Below is an example:
For the string below:
('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
I need to return:
['030421-1-001', '030421-2-001' '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002' '030421-1-003']
I have only gotten as far as returning the strings that match the ######-#-### pattern:
import re
p = re.compile('\d{6}-\d{1}-\d{3}')
m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
print m
Thanks in advance for any help!
Matt

Perhaps something like this:
>>> import re
>>> s = '030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003'
>>> it = re.finditer(r'(\b\d{6}-)(\d(?:,\d)*)(-\d{3})\b', s)
>>> for m in it:
a, b, c = m.groups()
for x in b.split(','):
print a + x + c
...
030421-1-001
030421-2-001
030421-1-002
030421-1-002
030421-2-002
030421-3-002
030421-1-003
Or using a list comprehension
>>> [a+x+c for a, b, c in (m.groups() for m in it) for x in b.split(',')]
['030421-1-001', '030421-2-001', '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002', '030421-1-003']

Use '\d{6}-\d(,\d)*-\d{3}'.
* means "as many as you want (0 included)".
It is applied to the previous element, here '(,\d)'.

I wouldn't use a single regular expression to try and parse this. Since it is essentially a list of strings, you might find it easier to replace the "&" with a comma globally in the string and then use split() to put the elements into a list.
Doing a loop of the list will allow you to write a single function to parse and fix the string and then you can push it onto a new list and the display your string.
replace(string, '&', ',')
initialList = string.split(',')
for item in initialList:
newItem = myfunction(item)
newList.append(newItem)
newstring = newlist(join(','))

(\d{6}-)((?:\d,?)+)(-\d{3})
We take 3 capturing groups. We match the first part and last part the easy way. The center part is optionally repeated and optionally contains a ','. Regex will however only match the last one, so ?: won't store it at all. What where left with is the following result:
>>> p = re.compile('(\d{6}-)((?:\d,?)+)(-\d{3})')
>>> m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
>>> m
[('030421-', '1,2', '-001'), ('030421-', '1', '-002'), ('030421-', '1,2,3', '-002'), ('030421-', '1', '-003')]
You'll have to manually process the 2nd term to split them up and join them, but a list comprehension should be able to do that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting multiple substrings from one string - python

Using re.findall we can try: x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)'] for inp in x: matches = re.findall(r'(\w+)\((\d+)(\w+)\)', inp) print(matches) # [('CHEM1', '5', 'GL')] # [('CH3M2', '55', 'LB')] # [('CHEM3954114', '50', 'KG')]

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

How to remove substring from string in pandas column that contain both numbers and chars

Pandas to match column contents to keywords (with spaces and brackets )

How can I split a string at the first occurrence of a letter in Python?

Python - Parse strings with variable repeating substring

Categories

Resources