Parsing Korean text into a list using regex - python

I have some data stored as pandas data frame and one of the columns contains text strings in Korean. I would like to process each of these text strings as follows:
my_string = '모질상태불량(피부상태불량, 심하게 야윔), 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성(활력저하)'
Into a list like this:
parsed_text = '모질상태불량, 피부상태불량, 심하게 야윔, 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성, 활력저하'
So the problem is to identify cases where a word (or several words) are followed by parentheses with text only (can be one words or several words separated by commas) and replace them by all the words (before and inside parentheses) separated by comma (for later processing). If a word is followed by parentheses containing numbers (as in this case 7/22), it should be kept as it is. If a word is not followed by any parentheses, it should also be kept as it is. Furthermore, I would like to preserve the order of words (as they appeared in the original string).
I can extract text in parentheses by using regex as follows:
corrected_string = re.findall(r'(\w+)\((\D.*?)\)', my_string)
which yields this:
[('모질상태불량', '피부상태불량, 심하게 야윔'), ('코로나음성', '활력저하')]
But I'm having difficulty creating my resulting string, i.e. replacing my original text with the pattern I've matched. Any suggestions? Thank you.

You can use re.findall with a pattern that optionally matches a number enclosed in parentheses:
corrected_string = re.findall(r'[^,()]+(?:\([^)]*\d[^)]*\))?', my_string)

It's little bit clumsy but you can try:
my_string_list = [x.strip() for x in re.split(r"\((?!\d)|(?<!\d)\)|,", my_string) if x]
# you can make string out of list then.

Related

How to retain delimiter within list item python

I'm writing a program which jumbles clauses within a text using punctuation marks as delimiters for when to split the text.
At the moment my code has a large list where each item is a group of clauses.
import re
from random import shuffle
clause_split_content = []
text = ["this, is. a test?", "this: is; also. a test!"]
for i in text:
clause_split = re.split('[,;:".?!]', i)
clause_split.remove(clause_split[len(clause_split)-1])
for x in range(0, len(clause_split)):
clause_split_content.append(clause_split[x])
shuffle(clause_split_content)
print(*content, sep='')
at the moment the result jumbles the text without retaining the punctuation which is used as the delimiter to split it.
The output would be something like this:
a test this also this is a test is
I want to retain the punctuation within the final output so it would look something like this:
a test! this, also. this: is. a test? is;
I think you are simply using the wrong function of re for your purpose. split() excludes your separator, but you can use another function e.g. findall() to manually select all words you want. For example with the following code I can create your desired output:
import re
from random import shuffle
clause_split_content = []
text = ["this, is. a test?", "this: is; also. a test!"]
for i in text:
words_with_seperator = re.findall(r'([^,;:".?!]*[,;:".?!])\s?', i)
clause_split_content.extend(words_with_seperator)
shuffle(clause_split_content)
print(*clause_split_content, sep=' ')
Output:
this, this: is. also. a test! a test? is;
The pattern ([^,;:".?!]*[,;:".?!])\s? simply takes all characters that are not a separator until a separator is seen. These characters are all in the matching group, which creates your result. The \s? is only to get rid of the space characters in between the words.
Here's a way to do what you've asked:
import re
from random import shuffle
text = ["this, is. a test?", "this: is; also. a test!"]
content = [y for x in text for y in re.findall(r'([^,;:".?!]*[,;:".?!])', x)]
shuffle(content)
print(*content, sep=' ')
Output:
is; is. also. a test? this, a test! this:
Explanation:
the regex pattern r'([^,;:".?!]*[,;:".?!])' matches 0 or more non-separator characters followed by a separator character, and findall() returns a list of all such non-overlapping matches
the list comprehension iterates over the input strings in list text and has an inner loop that iterates over the findall results for each input string, so that we create a single list of every matched pattern within every string.
shuffle and print are as in your original code.

Split string with defined and undefined character

I have a text file which looks like
text='\n> lefortoff\n> donna_marta\n> agizatullina\n> shshifter\n< bagira\n< recoder'
and I would like to split it by every \n but also skipping > and < and spaces after them.
I'm doing it via this code
names = text.split('\n> ')
last_names = names[-1].split('\n< ')
names = names[1:-1]
names.extend(last_names)
but wonder if there is simpler way of doing this with pseudocode like:
text.split('\n%s1%s2', %s1 = undefined, %s2 = undefined)
so those s1 can be >, < and s2 would be space .
You can try to use the regex: \n[<>]\s as follows:
[<>] is just like your s1, and \s is your s2.
Explanation:
\n to match \n
[<>] character class to match either < or >
\s to match also the trailing space so that the extracted strings have no extra spaces
import re
re.split(r'\n[<>]\s', text)
#output
['',
'lefortoff',
'donna_marta',
'agizatullina',
'shshifter',
'bagira',
'recoder']
#SeaBean's answer is the most general and would be my first choice. But if you really only have two expressions to split on, you could use a nested loop with str.split(), like this:
text='\n> lefortoff\n> donna_marta\n> agizatullina\n> shshifter\n< bagira\n< recoder'
last_names = [
n
for part in text.split('\n> ')
for n in part.split('\n< ')
]
print(last_names)
# ['', 'lefortoff', 'donna_marta', 'agizatullina', 'shshifter', 'bagira', 'recoder']
This uses the fact that if you split on one of the delimiters, the resulting parts will sometimes be lists of names with the other delimiter between them. So we split first on one delimiter, then take each part and split on the other delimiter (if needed).
This creates a nested list structure, but if we just report everything we get in the inner loop, it produces a flat list.

Replace Items with regex (kind of)

I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".
I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'

Python Regex to extract multiple complex groups

I am trying to extract some groups of data from a text and validate if the input text is correct. In the simplified form my input text looks like this:
Sample=A,B;C,D;E,F;G,H;I&other_text
In which A-I are groups I am interested in extracting them.
In the generic form, Sample looks like this:
val11,val12;val21,val22;...;valn1,valn2;final_val
arbitrary number of comma separated pairs which are separated by semicolon, and one single value at the very end.
There must be at least two pairs before the final value.
The regular expression I came up with is something like this:
r'Sample=(\w),(\w);(\w),(\w);((\w),(\w);)*(\w)'
Assuming my desired groups are simply words (in reality they are more complex but this is out of the scope of the question).
It actually captures the whole text but fails to group the values correctly.
I am just assuming that your "values" are any composed of any characters other than , and ;, i.e. [^,;]+. This clearly needs to be modified in the re.match and re.finditer calls to meet your actual requirements.
import re
s = 'Sample=val11,val12;val21,val22;val31,val32;valn1,valn2;final_val'
# verify if there is a match:
m = re.match(r'^Sample=([^,;]+),+([^,;]+)(;([^,;]+),+([^,;]+))+;([^,;]+)$', s)
if m:
final_val = m.group(6)
other_vals = [(m.group(1), m.group(2)) for m in re.finditer(r'([^,;]+),+([^,;]+)', s[7:])]
print(final_val)
print(other_vals)
Prints:
final_val
[('val11', 'val12'), ('val21', 'val22'), ('val31', 'val32'), ('valn1', 'valn2')]
You can do this with a regex that has an OR in it to decide which kind of data you are parsing. I spaced out the regex for commenting and clarity.
data = 'val11,val12;val21,val22;valn1,valn2;final_val'
pat = re.compile(r'''
(?P<pair> # either comma separated ending in semicolon
(?P<entry_1>[^,;]+) , (?P<entry_2>[^,;]+) ;
)
| # OR
(?P<end_part> # the ending token which contains no comma or semicolon
[^;,]+
)''', re.VERBOSE)
results = []
for match in pat.finditer(data):
if match.group('pair'):
results.append(match.group('entry_1', 'entry_2'))
elif match.group('end_part'):
results.append(match.group('end_part'))
print(results)
This results in:
[('val11', 'val12'), ('val21', 'val22'), ('valn1', 'valn2'), 'final_val']
You can do this without using regex, by using string.split.
An example:
words = map(lambda x : x.split(','), 'val11,val12;val21,val22;valn1,valn2;final_val'.split(';'))
This will result in the following list:
[
['val11', 'val12'],
['val21', 'val22'],
['valn1', 'valn2'],
['final_val']
]

splitting a string by specific letters while preserving them in the string

I'm trying to split a string by specific letters(in this case:'r','g' and'b') so that I can then later append them to a list. The catch here is that I want the letters to be copied to over to the list as well.
string = '1b24g55r44r'
What I want:
[[1b], [24g], [55r], [44r]]
You can use findall:
import re
print([match for match in re.findall('[^rgb]+?[rgb]', '1b24g55r44r')])
Output
['1b', '24g', '55r', '44r']
The regex match:
[^rgb]+? everything that is not rgb one or more times
followed by one of [rgb].
If you need the result to be singleton lists you can do it like this:
print([[match] for match in re.findall('[^rgb]+?[rgb]', '1b24g55r44r')])
Output
[['1b'], ['24g'], ['55r'], ['44r']]
Also if the string is only composed of digits and rgb you can do it like this:
import re
print([[match] for match in re.findall('\d+?[rgb]', '1b24g55r44r')])
The only change in the above regex is \d+?, that means match one or more digits.
Output
[['1b'], ['24g'], ['55r'], ['44r']]

Categories

Resources