I'm writing a program which jumbles clauses within a text using punctuation marks as delimiters for when to split the text.
At the moment my code has a large list where each item is a group of clauses.
import re
from random import shuffle
clause_split_content = []
text = ["this, is. a test?", "this: is; also. a test!"]
for i in text:
clause_split = re.split('[,;:".?!]', i)
clause_split.remove(clause_split[len(clause_split)-1])
for x in range(0, len(clause_split)):
clause_split_content.append(clause_split[x])
shuffle(clause_split_content)
print(*content, sep='')
at the moment the result jumbles the text without retaining the punctuation which is used as the delimiter to split it.
The output would be something like this:
a test this also this is a test is
I want to retain the punctuation within the final output so it would look something like this:
a test! this, also. this: is. a test? is;
I think you are simply using the wrong function of re for your purpose. split() excludes your separator, but you can use another function e.g. findall() to manually select all words you want. For example with the following code I can create your desired output:
import re
from random import shuffle
clause_split_content = []
text = ["this, is. a test?", "this: is; also. a test!"]
for i in text:
words_with_seperator = re.findall(r'([^,;:".?!]*[,;:".?!])\s?', i)
clause_split_content.extend(words_with_seperator)
shuffle(clause_split_content)
print(*clause_split_content, sep=' ')
Output:
this, this: is. also. a test! a test? is;
The pattern ([^,;:".?!]*[,;:".?!])\s? simply takes all characters that are not a separator until a separator is seen. These characters are all in the matching group, which creates your result. The \s? is only to get rid of the space characters in between the words.
Here's a way to do what you've asked:
import re
from random import shuffle
text = ["this, is. a test?", "this: is; also. a test!"]
content = [y for x in text for y in re.findall(r'([^,;:".?!]*[,;:".?!])', x)]
shuffle(content)
print(*content, sep=' ')
Output:
is; is. also. a test? this, a test! this:
Explanation:
the regex pattern r'([^,;:".?!]*[,;:".?!])' matches 0 or more non-separator characters followed by a separator character, and findall() returns a list of all such non-overlapping matches
the list comprehension iterates over the input strings in list text and has an inner loop that iterates over the findall results for each input string, so that we create a single list of every matched pattern within every string.
shuffle and print are as in your original code.
Related
I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".
I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'
I have a String from which I want to take the values within the parenthesis. Then, get the values that are separated from a comma.
Example: x(142,1,23ERWA31)
I would like to get:
142
1
23ERWA31
Is it possible to get everything with one regex?
I have found a method to do so, but it is ugly.
This is how I did it in python:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
secondResult = re.search("(?<=\()(.*?)(?=\))", firstResult.group(0))
finalResult = [x.strip() for x in secondResult.group(0).split(',')]
for i in finalResult:
print(i)
142
1
23ERWA31
This works for your example string:
import re
string = "x(142,1,23ERWA31)"
l = re.findall (r'([^(,)]+)(?!.*\()', string)
print (l)
Result: a plain list
['142', '1', '23ERWA31']
The expression matches a sequence of characters not in (,,,) and – to prevent the first x being picked up – may not be followed by a ( anywhere further in the string. This makes it also work if your preamble x consists of more than a single character.
findall rather than search makes sure all items are found, and as a bonus it returns a plain list of the results.
You can make this a lot simpler. You are running your first Regex but then not taking the result. You want .group(1) (inside the brackets), not .group(0) (the whole match). Once you have that you can just split it on ,:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
for e in firstResult.group(1).split(','):
print(e)
A little wonky looking, and also assuming there's always going to be a grouping of 3 values in the parenthesis - but try this regex
\((.*?),(.*?),(.*?)\)
To extract all the group matches to a single object - your code would then look like
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?),(.*?),(.*?)\)", string).groups()
You can then call the firstResult object like a list
>> print(firstResult[2])
23ERWA31
I am trying to split a string based on a particular pattern in an effort to rejoin it later after adding a few characters.
Here's a sample of my string: "123\babc\b:123" which I need to convert to "123\babc\\"b\":123". I need to do it several times in a long string. I have tried variations of the following:
regex = r"(\\b[a-zA-Z]+)\\b:"
test_str = "123\\babc\\b:123"
x = re.split(regex, test_str)
but it doesn't split at the right positions for me to join. Is there another way of doing this/another way of splitting and joining?
You're right, you can do it with re.split as suggested. You can split by \b and then rebuild your output with a specific separator (and keep the \b when you want too).
Here an example:
# Import module
import re
string = "123\\babc\\b:123"
# Split by "\n"
list_sliced = re.split(r'\\b', "123\\babc\\b:123")
print(list_sliced)
# ['123', 'abc', ':123']
# Define your custom separator
custom_sep = '\\\\"b\\"'
# Build your new output
output = list_sliced[0]
# Iterate over each word
for i, word in enumerate(list_sliced[1:]):
# Chose the separator according the parity (since we don't want to change the first "\b")
sep = "\\\\b"
if i % 2 == 1:
sep = custom_sep
# Update output
output += sep + word
print(output)
# 123\\babc\\"b\":123
Maybe, the following expression,
^([\\]*)([^\\]+)([\\]*)([^\\]+)([\\]*)([^:]+):(.*)$
and a replacement of,
\1\2\3\4\5\\"\6\\":\7
with a re.sub might return our desired output.
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
I'm trying to solve this problem were they give me a set of strings where to count how many times a certain word appears within a string like 'code' but the program also counts any variant where the 'd' changes like 'coze' but something like 'coz' doesn't count this is what I made:
def count(word):
count=0
for i in range(len(word)):
lo=word[i:i+4]
if lo=='co': # this is what gives me trouble
count+=1
return count
Test if the first two characters match co and the 4th character matches e.
def count(word):
count=0
for i in range(len(word)-3):
if word[i:i+1] == 'co' and word[i+3] == 'e'
count+=1
return count
The loop only goes up to len(word)-3 so that word[i+3] won't go out of range.
You could use regex for this, through the re module.
import re
string = 'this is a string containing the words code, coze, and coz'
re.findall(r'co.e', string)
['code', 'coze']
from there you could write a function such as:
def count(string, word):
return len(re.findall(word, string))
Regex is the answer to your question as mentioned above but what you need is a more refined regex pattern. since you are looking for certain word appears you need to search for boundary words. So your pattern should be sth. like this:
pattern = r'\bco.e\b'
this way your search will not match with the words like testcodetest or cozetest but only match with code coze coke but not leading or following characters
if you gonna test for multiple times, then it's better to use a compiled pattern, that way it'd be more memory efficient.
In [1]: import re
In [2]: string = 'this is a string containing the codeorg testcozetest words code, coze, and coz'
In [3]: pattern = re.compile(r'\bco.e\b')
In [4]: pattern.findall(string)
Out[4]: ['code', 'coze']
Hope that helps.
I have some data stored as pandas data frame and one of the columns contains text strings in Korean. I would like to process each of these text strings as follows:
my_string = '모질상태불량(피부상태불량, 심하게 야윔), 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성(활력저하)'
Into a list like this:
parsed_text = '모질상태불량, 피부상태불량, 심하게 야윔, 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성, 활력저하'
So the problem is to identify cases where a word (or several words) are followed by parentheses with text only (can be one words or several words separated by commas) and replace them by all the words (before and inside parentheses) separated by comma (for later processing). If a word is followed by parentheses containing numbers (as in this case 7/22), it should be kept as it is. If a word is not followed by any parentheses, it should also be kept as it is. Furthermore, I would like to preserve the order of words (as they appeared in the original string).
I can extract text in parentheses by using regex as follows:
corrected_string = re.findall(r'(\w+)\((\D.*?)\)', my_string)
which yields this:
[('모질상태불량', '피부상태불량, 심하게 야윔'), ('코로나음성', '활력저하')]
But I'm having difficulty creating my resulting string, i.e. replacing my original text with the pattern I've matched. Any suggestions? Thank you.
You can use re.findall with a pattern that optionally matches a number enclosed in parentheses:
corrected_string = re.findall(r'[^,()]+(?:\([^)]*\d[^)]*\))?', my_string)
It's little bit clumsy but you can try:
my_string_list = [x.strip() for x in re.split(r"\((?!\d)|(?<!\d)\)|,", my_string) if x]
# you can make string out of list then.