How to place comma accurately with regular expression? - python

I am new in Python regular expression. here is my text:
'Condition: Remanufactured Grade: Commercial Warranty: 1 Year Parts & On-Site Labor w/Ext. Ships: Fully Assembled Processing Time: Ships from our Warehouse in 2-4 Weeks
I want to add comma using python regular expression and the result will be look like this:
'Condition: Remanufactured ,Grade: Commercial ,Warranty: 1 Year Parts & On-Site Labor w/Ext. Ships: Fully Assembled ,Processing Time: Ships from our Warehouse in 2-4 Weeks
Basically I want to target words which contain colon and want to add comma from second string.

Honestly I wouldn't do this with a regular expression, in large part based on your "Processing Time" example, which makes it looks like you've got a problem which can only be solved by knowing the specific expected strings to solve.
Code can't magically know that "Processing " is more tightly bound to "Time" than to "Fully Assembled".
So I see basically three solution shapes, and I'm just going to focus on the first one because I think its the best one, but I'll briefly summarize all three:
Use a list of known field names which make the comma insertions harder, and replace those strings just for the duration of your comma-insertion logic. This frees your comma-insertion logic to be simpler and regular.
Get a list of all known field names, and look for them specifically to insert commas in front of them. This is probably worse but if the list of names doesn't change and isn't expected to change, and most names are tricky, then this could be cleaner.
Throw a modern language modeling text prediction AI at the problem: given an ambiguous string like "...: Fully Assembled Processing Time: ..." you could basically prompt your AI with "Assembled" and see how much confidence it gives to the next tokens being "Processing Time", and then prompt it with "Processing" and see how much confidence it gives to the next tokens being "Time", and pick the one it has more confidence for as your field name. I think this is overkill unless you really get so little guarantees about your input that you have to treat it like a natural language processing problem.
So I would do option 1, and the general idea looks something like this:
tricky_fields = {
"Processing Time": "ProcessingTime",
# add others here as needed
}
for proper_name, easier_name in tricky_fields:
my_text = my_text.replace(f" {proper_name}: ", f" {easier_name}: ")
# do the actual comma insertions here
for proper_name, easier_name in tricky_fields:
my_text = my_text.replace(f" {easier_name}: ", f" {proper_name}: ")
Notice that I put spaces and the colon around the field names in the replacements. If you know that your fields are always separated by spaces and colons like that, this is better practice because it's less likely to automatically replace something you didn't mean to replace, and thus less likely to be a source of bugs later.
Then the comma insertion itself becomes an easy regex if all of your replacements don't use any spaces or colons, because your target is just [^ :]+:, but regex is a cryptic micro-language which is not optimized for human readability, and it doesn't need to be a regex, because you can just split on : and then for each result of that split you can split on the last and then rejoin with , or , and then rejoin the whole thing:
def insert_commas(text):
parts = text.split(":")
new_parts = []
for part in parts:
most, last = part.split(" ", -1)
new_part = " ,".join((most, last))
new_parts.append(new_part)
return ":".join(new_parts)
But if you really wanted to use a regex, here's a simple one that does what you want:
def insert_commas(text):
return re.sub(' ([^ :]+: )', r' ,\1', text)
Although in real production code I'd improve the tricky field replacements by factoring the two replacements out into one separate testable function and use something bidict instead of a regular dictionary, like this:
from bidict import bidict
tricky_fields = bidict({
"Processing Time": "ProcessingTime",
# add others here as needed
})
def replace_fields(names, text):
for old_name, new_name in names:
text = text.replace(f" {old_name}: ", f" {new_name}: ")
return text
Using a bidict and a dedicated function is clearer, more self-descriptive, more maintainable, less code to keep consistent, and easier to test/verify, and even gets some runtime safety against accidentally mapping two tricky field names to the same replacement field.
So composing those two previous code blocks together:
text = replace_fields(tricky_fields, text)
text = insert_commas(text)
text = replace_fields(tricky_fields.inverse, text)
Of course, if you don't need to do the second replacement to undo the initial replacement, you can just leave it as-is after comma insertion is done. Either way, this way decomposed the comma problem from the problem of tricky names which make the comma problem harder/complected.

Related

To Split text based on words using python code

I have a long text like the one below. I need to split based on some words say ("In","On","These")
Below is sample data:
On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain. These cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is untrammelled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted. The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains.
Can this problem be solved with a code as I have 1000 rows in a csv file.
As per my comment, I think a good option would be to use regular expression with the pattern:
re.split(r'(?<!^)\b(?=(?:On|In|These)\b)', YourStringVariable)
Yes this can be done in python. You can load the text into a variable and use the built in Split function for string. For example:
with open(filename, 'r') as file:
lines = file.read()
lines = lines.split('These')
# lines is now a list of strings split whenever 'These' string was encountered
To find whole words that are not part of larger words, I like using the regular expression:
[^\w]word[^\w]
Sample python code, assuming the text is in a variable named text:
import re
exp = re.compile(r'[^\w]in[^\w]', flags=re.IGNORECASE)
all_occurrences = list(exp.finditer(text))

How to prevent Python strings being split() if the delimiter is surrounded by brackets

I am working on a bit of code that loops through a txt file and creates a list containing individual lines. I need specific content from each line, where a comma is used as a delimiter. However, I run into an issue when there is a comma in one of the list items. The list comprehension line separates this single item into two items. The item, author, is enclosed in brackets. Can I have the list comprehension overlook items contained in brackets perhaps?
inventory = open("inventory.txt").readlines()
seperated_inventory = [x.split(",") for x in inventory]
isbn_list = [item[0] for item in seperated_inventory]
author_list = [item[1] for item in seperated_inventory]
title_list = [item[2] for item in seperated_inventory]
category_list = [item[3] for item in seperated_inventory]
active_list = [item[4] for item in seperated_inventory]
example of line with two authors
0520085477,['Richard L. Abel', 'Philip Simon Coleman Lewis'],Lawyers in Society,['Law’],False
I think using a single-character split is not a great strategy when you have sub-lists that can contain the character you're splitting on.
There are three main ways you can approach this (that I've thought of) . . . well, two ways and an alternative:
Option 1: stick with split(',') and re-join sub-arrays.
This is reasonably brittle, lengthy, and inferior to the second approach. I'm putting it first because it directly answers the question, not because it's what you should do:
line="0520085477,['Richard L. Abel', 'Philip Simon Coleman Lewis'],Lawyers in Society,['Law’],False"
# Index of the left hand side of any found sub-arrays.
left = 0
# Iterator position, also used as the index of the right hand side of any found sub-arrays.
right = 0
array = line.split(',')
while right < len(array):
if array[right].startswith('['):
array[right] = array[right][1:] # Remove the bracket
left = right
if array[right].endswith(']'):
array[right] = array[right][:-1] # Remove the bracket
# Pull the stuff between brackets out into a sub-array, and then
# replace that segment of the original array with a single element
# which is the sub-array.
array[left:right+1] = [array[left:right+1]]
# Preserve the "leading search position", since we just changed
# the size of the array.
right = left
right += 1
print(array)
As you can see, that code is much less legible than a comprehension. It's also complex; it probably has bugs and edge cases I did not test.
This will only work with a single level of nested sub-arrays.
Option 2: Regex
Despite what xkcd says about regex, in this case it is a much clearer and simpler solution to extracting sub-arrays. More information on how to use regex can be found in the documentation for the re module. Online regex testers are also readily available, and are a great help when debugging regular expressions.
import re
line="0520085477,['Richard L. Abel', 'Philip Simon Coleman Lewis'],Lawyers in Society,['Law’],False"
r = re.compile(r'(?:\[(?P<nested>.*?)\]|(?P<flat>[^,]+?)),')
array = []
# For each matched part of the line, figure out if we matched a
# sub-array (in which case, split it on comma and add the resulting
# array to the final list) or a normal item (just add it to the final
# list).
# We append a comma to the string we search so our regex always matches
# the last element.
for match in r.finditer(line + ","):
if match.group('nested'): # It's a sub-array
array.append(match.group('nested').split(","))
else: # It's a normal top-level element
array.append(match.group('flat'))
print(array)
The regex says, roughly:
Start a non-capturing group (?:) that wraps the two sub-patterns. Just like parentheses forcing the order of operations in a math formula, this makes it explicit that the trailing comma at the end of this regex is not part of either capturing group. It's not strictly necessary, but makes things clearer.
Match one of two groups. The first group is some characters between a pair of square brackets, ignoring commas and splitting. The match should be done lazily (stop as soon as a closing bracket is seen; that's the ?), and anything in the match should be made available to the regex API with the name "nested". The name is totally optional; array indexes on the match object could be used just as well, but this is more explicit for code readers.
The second group that could be matched is some characters that do not contain a comma ([^,]). Depending on the eagerness of the regex engine, you could potentially replace this with "any character", and trust the comma outside of the outer non-capturing ?: group would prevent these matches from running away, but saying "not comma" is more explicit for readers. Anything that matches this group should be stored with the name "flat".
Lastly, look for a comma following occurrences of either of those patterns. Since the last element in the array isn't followed by a comma, I just kludge and match against the line plus one additional comma rather than further complicate the regex.
Once the regex is understood, the rest is simple: loop through each match, see if it was "flat" or "nested", and if it was nested, split it based on comma and add that as a sub-array to the result.
This will not work with more than a single level of nested sub-arrays, and will break/do unexpected things if commas end up adjacent to each other or if a sub-array isn't "closed" (malformed input, basically), which brings me to . . .
Option 3: Use a structured data format
Both of those parsers are prone to errors. Elements in your arrays could contain special characters (e.g. what if a title like this had a square bracket as part of its name?), multiple commas could appear around fields that are "empty", you could need multiple-levels of nested sub-arrays (you can make either of the first two options recursive, but the code will just get that much harder to read), or, perhaps most commonly, you could be handed input that's slightly broken/not compliant with what you expect, and have to parse it anyway.
Dealing with all of those issues can be accomplished with more code, but that code typically makes the parsing system less reliable, not more.
Instead, consider switching your data interchange format to be something like JSON. The line you supplied is already nearly valid JSON already, so you might be able to just use the json Python module directly and have things "just work" without needing to write a single line of parsing code. There are many other options for structured data parsing, including YAML and TOML. Anything you choose in that area will likely be more robust than rolling parsing logic by hand.
Of course, if this is for fun/education and you want to make something from scratch, code away! Parsers are an excellent educational project, since there are a lot of corner cases, but each corner case tends to be discrete/interact only minimally with other weird cases.

efficient way to get words before and after substring in text (python)

I'm using regex to find occurrences of string patterns in a body of text. Once I find that the string pattern occurs, I want to get x words before and after the string as well (x could be as small as 4, but preferably ~10 if still as efficient).
I am currently using regex to find all instances, but occasionally it will hang. Is there a more efficient way to solve this problem?
This is the solution I currently have:
sub = r'(\w*)\W*(\w*)\W*(\w*)\W*(\w*)\W*(%s)\W*(\w*)\W*(\w*)\W*(\w*)\W*(\w*)' % result_string #refind string and get surrounding += 4 words
surrounding_text = re.findall(sub, text)
for found_text in surrounding_text:
result_found.append(" ".join(map(str,found_text)))
I'm not sure if this is what you're looking for:
>>> text = "Hello, world. Regular expressions are not always the answer."
>>> words = text.partition("Regular expressions")
>>> words
('Hello, world. ', 'Regular expressions', ' are not always the answer.')
>>> words_before = words[0]
>>> words_before
'Hello, world. '
>>> separator = words[1]
>>> separator
'Regular expressions'
>>> words_after = words[2]
>>> words_after
' are not always the answer.'
Basically, str.partition() splits the string into a 3-element tuple. In this example, the first element is all of the words before the specific "separator", the second element is the separator, and the third element is all of the words after the separator.
The main problem with your pattern is that it begins with optional things that causes a lot of tries for each positions in the string until a match is found. The number of tries increases with the text size and with the value of n (the number of words before and after). This is why only few lines of text suffice to crash your code.
A way consists to begin the pattern with the target word and to use lookarounds to capture the text (or the words) before and after:
keyword (?= words after ) (?<= words before - keyword)
Starting a pattern with the searched word (a literal string) makes it very fast, and words around are then quickly found from this position in the string. Unfortunately the re module has some limitations and doesn't allow variable length lookbehinds (as many other regex flavors).
The new regex module supports variable length lookbehinds and other useful features like the ability to store the matches of a repeated capture group (handy to get the separated words in one shot).
import regex
text = '''In strange contrast to the hardly tolerable constraint and nameless
invisible domineerings of the captain's table, was the entire care-free
license and ease, the almost frantic democracy of those inferior fellows
the harpooneers. While their masters, the mates, seemed afraid of the
sound of the hinges of their own jaws, the harpooneers chewed their food
with such a relish that there was a report to it.'''
word = 'harpooneers'
n = 4
pattern = r'''
\m (?<target> %s ) \M # target word
(?<= # content before
(?<before> (?: (?<wdb>\w+) \W+ ){0,%d} )
%s
)
(?= # content after
(?<after> (?: \W+ (?<wda>\w+) ){0,%d} )
)
''' % (word, n, word, n)
rgx = regex.compile(pattern, regex.VERBOSE | regex.IGNORECASE)
class Result(object):
def __init__(self, m):
self.target_span = m.span()
self.excerpt_span = (m.starts('before')[0], m.ends('after')[0])
self.excerpt = m.expandf('{before}{target}{after}')
self.words_before = m.captures('wdb')[::-1]
self.words_after = m.captures('wda')
results = [Result(m) for m in rgx.finditer(text)]
print(results[0].excerpt)
print(results[0].excerpt_span)
print(results[0].words_before)
print(results[0].words_after)
print(results[1].excerpt)
Making a regex (well, anything, for that matter) with "as much repetitions as you will ever possibly need" is an extremely bad idea. That's because you
do an excessive amount of needless work every time
cannot really know for sure how much you will ever possibly need, thus introducing an arbitrary limitation
The bottom line for the below solutions: the 1st solution is the most effective one for large data; the 2nd one is the closest to your current, but scales much worse.
strip your entities to exactly what you are interested in at each moment:
find the substring (e.g. str.index. For whole words only, re.find with e.g. r'\b%s\b'%re.escape(word) is more suitable)
go N words back.
Since you mentioned a "text", your strings are likely to be very large, so you want to avoid copying potentially unlimited chunks of them.
E.g. re.finditer over a substring-reverse-iterator-in-place according to slices to immutable strings by reference and not copy and Best way to loop over a python string backwards. This will only become better than slicing when the latter is expensive in terms of CPU and/or memory - test on some realistic examples to find out. Doesn't work. re works directly with the memory buffer. Thus it's impossible to reverse a string for it without copying the data.
There's no function to find a character from a class in Python, nor an "xsplit". So the fastest way appears to be (i for i,c in enumerate(reversed(buffer(text,0,substring_index)) if c.isspace()) (timeit gives ~100ms on P3 933MHz for a full pass through a 100k string).
Alternatively:
Fix your regex to not be subject to catastrophic backtracking and eliminate code duplication (DRY principle).
The 2nd measure will eliminate the 2nd issue: we'll make the number of repetitions explicit (Python Zen, koan 2) and thus highly visible and manageable.
As for the 1st issue, if you really only need "up to known, same N" items in each case, you won't actually be doing "excessive work" by finding them together with your string.
The "fix" part here is \w*\W* -> \w+\W+. This eliminates major ambiguity (see the above link) from the fact that each x* can be a blank match.
Matching up to N words before the string effectively is harder:
with (\w+\W+){,10} or equivalent, the matcher will be finding every 10 words before discovering that your string doesn't follow them, then trying 9,8, etc. To ease it up on the matcher somewhat, \b before the pattern will make it only perform all this work at the beginning of each word
lookbehind is not allowed here: as the linked article explains, the regex engine must know how many characters to step back before trying the contained regex. And even if it was - a lookbehind is tried before every character - i.e. it's even more of a CPU hog
As you can see, regexes aren't quite cut to match things backwards
To eliminate code duplication, either
use the aforementioned {,10}. This will not save individual words but should be noticeably faster for large text (see the above on how the matching works here). We can always parse the retrieved chunk of text in more details (with the regex in the next item) once we have it. Or
autogenerate the repetitive part
note that (\w+\W+)? repeated mindlessly is subject to the same ambiguity as above. To be unambiguous, the expression must be like this (w=(\w+\W+) here for brevity): (w(w...(ww?)?...)?)? (and all the groups need to be non-capturing).
I personally think that using text.partition() is the best option, as it eliminates the messy regular expressions, and automatically leaves output in an easy-to-access tuple.

python string replacement, all possible combinations #2

I have sentences like the following:
((wouldyou)) give me something ((please))
and a bunch of keywords, stored in arrays / lists:
keywords["wouldyou"] = ["can you", "would you", "please"]
keywords["please"] = ["please", "ASAP"]
I want to replace every occurrence of variables in parentheses with a suitable set of strings stored in an array and get every possible combination back. The amount of variables and keywords is undefined.
James helped me with the following code:
def filler(word, from_char, to_char):
options = [(c,) if c != from_char else (from_char, to_char) for c in word.split(" ")]
return (' '.join(o) for o in product(*options))
list(filler('((?please)) tell me something ((?please))', '((?please))', ''))
It works great but only replaces one specific variable with empty strings. Now I want to go through various variables with different set of keywords. The desired result should look something like this:
can you give me something please
would you give me something please
please give me something please
can you give me something ASAP
would you give me something ASAP
please give me something ASAP
I guess it has something to do with to_ch, but I have no idea how to compare through list items at this place.
The following would work. It uses itertools.product to construct all of the possible pairings (or more) of your keywords.
import re, itertools
text = "((wouldyou)) give me something ((please))"
keywords = {}
keywords["wouldyou"] = ["can you", "would you", "please"]
keywords["please"] = ["please", "ASAP"]
# Get a list of bracketed terms
lsources = re.findall("\(\((.*?)\)\)", text)
# Build a list of the possible substitutions
ldests = []
for source in lsources:
ldests.append(keywords[source])
# Generate the various pairings
for lproduct in itertools.product(*ldests):
output = text
for src, dest in itertools.izip(lsources, lproduct):
# Replace each term (you could optimise this using a single re.sub)
output = output.replace("((%s))" % src, dest)
print output
You could further improve it by avoiding the need to do multiple replace() and assignment calls with one re.sub() call.
This scripts gives the following output:
can you give me something please
can you give me something ASAP
would you give me something please
would you give me something ASAP
please give me something please
please give me something ASAP
It was tested using Python 2.7. You will need to think how to solve it if multiple identical keywords were used. Hopefully you find this useful.
This is a job for Captain Regex!
Partial, pseudo-codey, solution...
One direct, albeit inefficient (like O(n*m) where n is number of words to replace and m is average number of replacements per word), way to do this would be to use the regex functionality in the re module to match the words, then use the re.sub() method to swap them out. Then you could just embed that in nested loops. So (assuming you get your replacements into a dict or something first), it would look something like this:
for key in repldict:
regexpattern = # construct a pattern on the fly for key
for item in repldict[key]:
newstring = re.sub(regexpattern, item)
And so forth. Only, you know, like with correct syntax and stuff. And then just append the newstring to a list, or print it, or whatever.
For creating the regexpatterns on the fly, string concatenation just should do it. Like a regex to match left parens, plus the string to match, plus a regex to match right parens.
If you do it that way, then you can handle the optional features just by looping over a second version of the regex pattern which appends a question mark to the end of the left parens, then does whatever you want to do with that.

Split string with caret character in python

I have a huge text file, each line seems like this:
Some sort of general menu^a_sub_menu_title^^pagNumber
Notice that the first "general menu" has white spaces, the second part (a subtitle) each word is separate with "_" character and finally a number (a pag number). I want to split each line in 3 (obvious) parts, because I want to create some sort of directory in python.
I was trying with re module, but as the caret character has a strong meaning in such module, I couldn't figure it out how to do it.
Could someone please help me????
>>> "Some sort of general menu^a_sub_menu_title^^pagNumber".split("^")
['Some sort of general menu', 'a_sub_menu_title', '', 'pagNumber']
If you only want three pieces you can accomplish this through a generator expression:
line = 'Some sort of general menu^a_sub_menu_title^^pagNumber'
pieces = [x for x in line.split('^') if x]
# pieces => ['Some sort of general menu', 'a_sub_menu_title', 'pagNumber']
What you need to do is to "escape" the special characters, like r'\^'. But better than regular expressions in this case would be:
line = "Some sort of general menu^a_sub_menu_title^^pagNumber"
(menu, title, dummy, page) = line.split('^')
That gives you the components in a much more straightforward fashion.
You could just say string.split("^") to divide the string into an array containing each segment. The only caveat is that it will divide consecutive caret characters into an empty string. You could protect against this by either collapsing consecutive carats down into a single one, or detecting empty strings in the resultant array.
For more information see http://docs.python.org/library/stdtypes.html
Does that help?
It's also possible that your file is using a format that's compatible with the csv module, you could also look into that, especially if the format allows quoting, because then line.split would break. If the format doesn't use quoting and it's just delimiters and text, line.split is probably the best.
Also, for the re module, any special characters can be escaped with \, like r'\^'. I'd suggest before jumping to use re to 1) learn how to write regular expressions, 2) first look for a solution to your problem instead of jumping to regular expressions - «Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. »

Categories

Resources