I have strings in a text file with more than 2000 lines, like:
cool.add.come.ADD_COPY
add.cool.warm.ADD_IN
warm.cool.warm.MINUS
cool.add.go.MINUS_COPY
I have a list of more than 200 matching words, like:
store=['ADD_COPY','add.cool.warm.ADD_IN', 'warm.cool.warm.MINUS', 'MINUS_COPY']
I'm using regular expression in the code
def all(store, file):
lst=[]
for match in re.finditer(r'[\w.]+', file):
words = match.group()
if words in store:
lst.append(words)
return lst
Then I check in a loop for requirement.
Output I'm getting:
add.cool.warm.ADD_IN
warm.cool.warm.MINUS
If I change the identifiers to \w+ then I get only:
ADD_COPY
MINUS_COPY
Required output:
add.cool.warm.ADD_IN
warm.cool.warm.MINUS
ADD_COPY
MINUS_COPY
It appears you want to get the results using a mere list comprehension:
results = set([item for item in store if item in text])
If you need a regex (in case you plan to match whole words only, or match your store items only in specific contexts), you may get the matches using
import re
text="""cool.add.come.ADD_COPY
add.cool.warm.ADD_IN
warm.cool.warm.MINUS
cool.add.go.MINUS_COPY"""
store=['ADD_COPY','add.cool.warm.ADD_IN', 'warm.cool.warm.MINUS', 'MINUS_COPY']
rx="|".join(sorted(map(re.escape, store), key=len, reverse=True))
print(re.findall(rx, text))
The regex will look like
add\.cool\.warm\.ADD_IN|warm\.cool\.warm\.MINUS|MINUS_COPY|ADD_COPY
See the regex demo, basically, all your store items with escaped special characters and sorted by length in the descending order.
Related
I'm writing a program which jumbles clauses within a text using punctuation marks as delimiters for when to split the text.
At the moment my code has a large list where each item is a group of clauses.
import re
from random import shuffle
clause_split_content = []
text = ["this, is. a test?", "this: is; also. a test!"]
for i in text:
clause_split = re.split('[,;:".?!]', i)
clause_split.remove(clause_split[len(clause_split)-1])
for x in range(0, len(clause_split)):
clause_split_content.append(clause_split[x])
shuffle(clause_split_content)
print(*content, sep='')
at the moment the result jumbles the text without retaining the punctuation which is used as the delimiter to split it.
The output would be something like this:
a test this also this is a test is
I want to retain the punctuation within the final output so it would look something like this:
a test! this, also. this: is. a test? is;
I think you are simply using the wrong function of re for your purpose. split() excludes your separator, but you can use another function e.g. findall() to manually select all words you want. For example with the following code I can create your desired output:
import re
from random import shuffle
clause_split_content = []
text = ["this, is. a test?", "this: is; also. a test!"]
for i in text:
words_with_seperator = re.findall(r'([^,;:".?!]*[,;:".?!])\s?', i)
clause_split_content.extend(words_with_seperator)
shuffle(clause_split_content)
print(*clause_split_content, sep=' ')
Output:
this, this: is. also. a test! a test? is;
The pattern ([^,;:".?!]*[,;:".?!])\s? simply takes all characters that are not a separator until a separator is seen. These characters are all in the matching group, which creates your result. The \s? is only to get rid of the space characters in between the words.
Here's a way to do what you've asked:
import re
from random import shuffle
text = ["this, is. a test?", "this: is; also. a test!"]
content = [y for x in text for y in re.findall(r'([^,;:".?!]*[,;:".?!])', x)]
shuffle(content)
print(*content, sep=' ')
Output:
is; is. also. a test? this, a test! this:
Explanation:
the regex pattern r'([^,;:".?!]*[,;:".?!])' matches 0 or more non-separator characters followed by a separator character, and findall() returns a list of all such non-overlapping matches
the list comprehension iterates over the input strings in list text and has an inner loop that iterates over the findall results for each input string, so that we create a single list of every matched pattern within every string.
shuffle and print are as in your original code.
Using Regular Expression, I want to find all the match words in a sentence and extract the wanted part in the matches words at the same time.
I use the API "findall" from "re" module to find the match words and plus the brackets to extract the parts I want.
For example I have a string "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C".
I only want the remaining two words after "0xQQ" or "0xWW", which will result in a list ["1A", "2B, "4C"].
Here is my code:
import re
MyString = "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C"
MySearch = re.compile("0xQQ(\w{2})|0xWW(\w{2})")
MyList = MySearch.findall(MyString)
print MyList
So my expected result is ["1A", "2B, "4C"].
But the actual result is [('1A', ''), ('', '2B'), ('4C', '')]
I think I might have used the combination of "()" and "|" in the wrong way.
Thx for the help!
Two different capturing groups will result in two items in the output (whatever matched each).
Instead, use a single capturing group and put your | (OR) earlier:
re.compile("0x(?:QQ|WW)(\w{2})")
((?:...) is a non-capturing group that matches ... - used to limit the effects of the | to only the QQ/WW split, without adding another capture to the output.)
You can try this:
import re
string = "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C"
pattern = re.compile(r"(0xQQ|0xWW)(\w{2})")
result = [match[2] for match in pattern.finditer(string)]
result will be:
['1A', '2B', '4C']
I have some data stored as pandas data frame and one of the columns contains text strings in Korean. I would like to process each of these text strings as follows:
my_string = '모질상태불량(피부상태불량, 심하게 야윔), 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성(활력저하)'
Into a list like this:
parsed_text = '모질상태불량, 피부상태불량, 심하게 야윔, 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성, 활력저하'
So the problem is to identify cases where a word (or several words) are followed by parentheses with text only (can be one words or several words separated by commas) and replace them by all the words (before and inside parentheses) separated by comma (for later processing). If a word is followed by parentheses containing numbers (as in this case 7/22), it should be kept as it is. If a word is not followed by any parentheses, it should also be kept as it is. Furthermore, I would like to preserve the order of words (as they appeared in the original string).
I can extract text in parentheses by using regex as follows:
corrected_string = re.findall(r'(\w+)\((\D.*?)\)', my_string)
which yields this:
[('모질상태불량', '피부상태불량, 심하게 야윔'), ('코로나음성', '활력저하')]
But I'm having difficulty creating my resulting string, i.e. replacing my original text with the pattern I've matched. Any suggestions? Thank you.
You can use re.findall with a pattern that optionally matches a number enclosed in parentheses:
corrected_string = re.findall(r'[^,()]+(?:\([^)]*\d[^)]*\))?', my_string)
It's little bit clumsy but you can try:
my_string_list = [x.strip() for x in re.split(r"\((?!\d)|(?<!\d)\)|,", my_string) if x]
# you can make string out of list then.
In Python, I would like to search through a dictionary such as the Scrabble official list and identify all the words with x number of characters in a particular order. For example, I have "mmt" and would like the output to generate a list of words such as what you see below.
"mmt":
AMALGAMATED
AMMONIATED
CIRCUMAMBULATED
COMMENTATED
Thank you!!
You can generate a dynamic regular expression pattern and filter your list based on that:
import re
words = ["AMALGAMATED", "AMMONIATED", "CIRCUMAMBULATED", "COMMENTATED",
"TAMTAM", "BLUB", "HOUSE", "SOMETHING"]
filter = "mmt"
regex = re.compile(".*".join(filter), re.IGNORECASE)
filtered_words = [word for word in words if regex.search(word)]
print(*filtered_words, sep="\n")
See this code running on ideone.com
address = ('http://www.somesite.com/article.php?page=' +numb)
html = urllib2.urlopen(address).read()
regex = re.findall(r"([a-f\d]{12})", html)
if you run the script the output will be something similiar to this:
['aaaaaaaaaaaa', 'bbbbbbbbbbbb', 'cccccccccccc']
how do i make the script print this output (note the line break):
aaaaaaaaaaaa
bbbbbbbbbbbb
cccccccccccc
any help?
Just print regex like this:
print "\n".join(regex)
address = ('http://www.somesite.com/article.php?page=' +numb)
html = urllib2.urlopen(address).read()
regex = re.findall(r"([a-f\d]{12})", html)
print "\n".join(regex)
re.findall() returns a list. So you can either iterate over the list and print out each element separately like so:
address = ('http://www.somesite.com/article.php?page=' +numb)
html = urllib2.urlopen(address).read()
for match in re.findall(r"([a-f\d]{12})", html)
print match
Or you can do as #bigOTHER suggests and join the list together into one long string and print the string. It's essentially doing the same thing.
Source: https://docs.python.org/2/library/re.html#re.findall
re.findall(pattern, string, flags=0) Return all non-overlapping
matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found. If
one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result unless they touch the
beginning of another match.
Use join on the result:
"".join("{0}\n".format(x) for x in re.findall(r"([a-f\d]{12})", html)