Replacing a word in a psycopg2.sql.Composed object - python

I have a rather complex psycopg2.sql.Composed object on which I simply need to replace a word for retro-compatibility issue.
Before having such an object, I was having an f-string on which this snippet worked like a charm:
if v4:
sql_update_query = re.sub(
'word_to_replace,',
'new_replacement_word',
sql_update_query
)
I naively tried to do on the psycopg2.sql.Composed object:
if v4:
sql_update_query = re.sub(
'word_to_replace,',
'new_replacement_word',
sql_update_query.as_string(conn) # conversion to a string for re.sub() to work
)
It's OK, but then how to get back to a true psycopg2.sql.Composed object?

Never mind, I noticed that the replacement was done on column identifiers only.
Therefore, I extracted them out in a list and did the replacement within the list itself, e.g. as :
columns_list = [
re.sub(
'^word_to_replace$',
'new_replacement_word',
col
) for col in columns
]
Please, notice that I had a comma at the end of the word to replace in the original post; it was a trick because some columns were starting with the same name but then with some _suffixes. Therefore, I had to precise at least using the dollar sign in the re.sub() first expression, that I restrict the word to be replace to this particular one, letting all suffixes versions untouched.
Then I added the columns as sql.Identifier:
sql.SQL(', ').join(map(sql.Identifier, columns_list)),
and voila.

Related

Python re.sub always returns the original string value and ignores given pattern

My code below
old = """
B07K6VMVL5
B071XQ6H38
B0B7F6Q9BH
B082KTHRBT
B0B78CWZ91
B09T8TJ65B
B09K55Z433
"""
duplicate = """
B0B78CWZ91
B09T8TJ65B
B09K55Z433
"""
final = re.sub(r"\b{}\b".format(duplicate),"",old)
print(final)
The final always prints the old variable values.I want the duplicate values to be removed in the old variable
The block string should not start/end in a new line since it will introduce a \n character. Try with
old = """B07K6VMVL5
B071XQ6H38
B0B7F6Q9BH
B082KTHRBT
B0B78CWZ91 # <-
B09T8TJ65B # <-
B09K55Z433""" # <-
duplicate = """B0B78CWZ91
B09T8TJ65B
B09K55Z433"""
and the result will not equal to the old.
Output
B07K6VMVL5
B071XQ6H38
B0B7F6Q9BH
B082KTHRBT
Alternatively use the block string like this
"""\
B0B78CWZ91
B09T8TJ65B
B09K55Z433\
"""
It seems you can use
final = re.sub(r"(?!\B\w){}(?<!\w\B)".format(re.escape(duplicate.strip())),"",old)
Note several things here:
duplicate.strip() - the whitespaces on both ends may prevent from matching, so strip() removes them from the duplicates
re.escape(...) - if there are special chars they are properly escaped with re.escape
(?!\B\w) and (?<!\w\B) are dynamic adaptive word boundaries. They provide proper matching at word boundaries if required.

Extract a match in a row and take everything until a comma or take it if it is the end ot the string in pandas

I have a dataset. In the column 'Tags' I want to extract from each row all the content that has the word player. I could repeat or be alone in the same cell. Something like this:
'view_snapshot_hi:hab,like_hi:hab,view_snapshot_foinbra,completed_profile,view_page_investors_landing,view_foinbra_inv_step1,view_foinbra_inv_step2,view_foinbra_inv_step3,view_snapshot_acium,player,view_acium_inv_step1,view_acium_inv_step2,view_acium_inv_step3,player_acium-ronda-2_r1,view_foinbra_rinv_step1,view_page_makers_landing'
expected output:
'player,player_acium-ronda-2_r1'
And I need both.
df["Tags"] = df["Tags"].str.ectract(r'*player'*,?\s*')
I tried this but it´s not working.
You need to use Series.str.extract keeping in mind that the pattern should contain a capturing group embracing the part you need to extract.
The pattern you need is player[^,]*:
df["Tags"] = df["Tags"].str.extract(r'(player[^,]*)', expand=False)
The expand=False returns a Series/Index rather than a dataframe.
Note that Series.str.extract finds and fetches the first match only. To get all matches use either of the two solutions below with Series.str.findall:
df["Tags"] = df["Tags"].str.findall(r'player[^,]*', expand=False)
df["Tags"] = df["Tags"].str.findall(r'player[^,]*', expand=False).str.join(", ")
This simple list also gives what you want
words_with_players = [item for item in your_str.split(',') if 'player' in item]
players = ','.join(words_with_players)

Python: re.findall() does not work for overlapping substrings

I want to match a string with a list of values. These can be overlapping, so for example string = "test1 test2" and values = ["test1", "test1 test2"].
EDIT: Below is my entire code for a simple example
import regex
string = "This is a test string"
values = ["test", "word", "string", "test string"]
pattern = r'\b({})\b'.format('|'.join(map(regex.escape, values)))
matches = set(map(str.lower, regex.findall(pattern, string, regex.IGNORECASE)))
output = ([x.upper() for x in values if x.lower() in matches])
print(output) # ['TEST', 'STRING']
# Expected output: ['TEST', 'STRING', 'TEST STRING']
As Wiktor commented, if you want to find all matches, you can not
use alternatives, because regex processor tries consecutive alternatives
and returns only the first alternative found.
So your program has to use a separate pattern for each value to test,
but for performance reason you can compile all of them in advance.
Another difference I spotted, between your Python instalation and mine
is import regex. Apparently you use some older Python version, as
I use import re (version 3.7). I checked even Python version 2.7.15, it
also uses import re.
The script can look like below:
import re
def mtch(pat, str):
s = pat.search(str)
return s.group().upper() if s else None
# Strings to look for
values = ["test", "word", "string", "test string"]
# Compile patterns
patterns = [ re.compile(r'\b({})\b'.format(re.escape(v)),
re.IGNORECASE) for v in values ]
# The string to check
string = "This is a test string"
# What has been found
list(filter(None, [ mtch(pat, string) for pat in patterns ]))
mtch function returns the text found by pat (the compiled pattern)
in str (source string) or None in the match failed.
patterns contains a list of compiled patterns.
Then there is [ mtch(pat, string) for pat in patterns ] a list
comprehension, generating match result list (with None values
if the match attempt failed).
To filter out None values I used filter function.
And finally list gathers all filtered strings and prints:
['TEST', 'STRING', 'TEST STRING']
If you want to perform this search for multiple source strings,
run only the last statement for each source string, probably adding
the result (and some indication of what string has been searched)
to some result list.
If your source list is very long, you should not attempt to read them all.
Instead, you should read them one by one in a loop and run the check
only for the current input string.
Edit concerning comment as of 2019-02-18 10:00Z
As I read from your comment, the code reading strings is as follows:
with open("Test_data.csv") as f:
for entry in f:
entry = entry.split(',')
string = entry[2] + " " + entry[3] + " " + entry[6]
Note that you overwrite string in every loop, so after the loop completed,
you have there the result from the last row (only).
Or maybe just after reading you run the search for patterns for the current
string?
Another hints to change the code:
Avoid such combinations that e.g. entry variable initially holds
the whole string and then a list - product of splitting.
Maybe a more readable variant is:
for row in f:
entry = row.split(',')
After you read a row and before doing anything else, check whether the row
just read is not empty. If the row is empty, omit it.
A quick way to test it is just to use the string in if (an empty string
evaluates to False).
for row in f:
if row:
entry = row.split(',')
...
Before string = entry[2] + " " + entry[3] + " " + entry[6] check
whether entry list has at least 7 items (numeration is from 0).
Maybe some of your input rows contain smaller number of fragments
and hence your program attempts to read from a non-existing element of
this list?
To be sure, what strings you are checking, write a short program
which only splits the input and prints resulting strings. Then look at them, maybe you find something wrong.
If you determine that foobar is in the text, you don't need to search the text separately for foo and bar: you know the answer already.
First group your searches:
searches = ['test', 'word', 'string', 'test string', 'wo', 'wordy']
unique = set(searches)
ordered = sorted(unique, key = len)
grouped = {}
while unique:
s1 = ordered.pop()
if s1 in unique:
unique.remove(s1)
grouped[s1] = [s1]
redundant = [s2 for s2 in unique if s2 in s1]
for s2 in redundant:
unique.remove(s2)
grouped[s1].append(s2)
for s, dups in grouped.items():
print(s, dups)
# Output:
# test string ['test string', 'string', 'test']
# wordy ['wordy', 'word', 'wo']
Once you have things grouped, you can confine the searching to just the top-level searches (the keys of grouped).
Also, if scale and performance are concerns, do you really need regular expressions? Your current examples could be handled with ordinary in tests, which are faster. If you do indeed need regular expressions, the idea of grouping the searches is harder -- but perhaps not impossible under some conditions.

pandas read_table with regex header definition

For the data file formated like this:
("Time Step" "courantnumber_max" "courantnumber_avg" "flow-time")
0 0.55432343242 0.34323443432242 0.00001
I can use pd.read_table(filename, sep=' ', header=0) and it will get everything correct except for the very first header, "Time Step".
Is there a way to specify a regex string for read_table() to use to parse out the header names?
I know a way to solve the issue is to just use regex to create a list of names for the read_table() function to use, but I figured there might/should be a way to directly express that in the import itself.
Edit: Here's what it returns as headers:
['("Time', 'Step"', 'courantnumber_max', 'courantnumber_avg', 'flow-time']
So it doesn't appear to be actually possible to do this inside the pandas.read_table() function. Below is posted the actual solution I ended up using to fix the problem:
import re
def get_headers(file, headerline, regexstring, exclude):
# Get string of selected headerline
with file.open() as f:
for i, line in enumerate(f):
if i == headerline-1:
headerstring = line
elif i > headerline-1:
break
# Parse headerstring
reglist = re.split(regexstring, headerstring)
# Filter entries in reglist
#filter out blank strs
filteredlist = list(filter(None, reglist))
#filter out items in exclude list
headerslist = []
if exclude:
for entry in filteredlist:
if not entry in exclude:
headerslist.append(entry)
return headerslist
get_headers(filename, 3, r'(?:" ")|["\)\(]', ['\n'])
Code explanation:
get_headers():
Arguments, file is a file object that contains the header. headerline is the line number (starting at 1) that the header names exist. regexstring is the pattern that will be fed into re.split(). Highly recommended that you prepend a r to the regex pattern. exclude is a list of miscellaneous strings that you want to be removed from the headerlist.
The regex pattern I used:
First up we have the pipe (|) symbol. This was done to separate both the "normal" split method (which is the " ") and the other stuff that needs to be rid of (namely the parenthesis).
Starting with the first group: (?:" "). We have the (...) since we want to match those characters in order. The " " is what we want to match as the stuff to split around. The ?: basically says to not capture the contents of the group. This is important/useful as otherwise re.split() will keep any groups as a separate item. See re.split() in documentation.
The second group is simply the other characters. Without them, the first and last items would be '("Time Step' and 'flow-time)\n'. Note that this causes \n to be treated as a separate entry to the list. This is why we use the exclude argument to fix that up after the fact.

The difference between ( [^,]*) and (.*,) in regular expression? Using python

When I tried to transform the string into a dict-like form, I met this problem
s = '&a: 12, &b:13, &c:14, &d: 15' # the string I want to convert
Before converting it, I tried to find all the matched results at first so I used
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
result = dict_form.findall(s)
print(result) # [('&a:', ' 12, &b:13, &c:14')]
It's quite unexpected, and a little bit messy
But when I tried another way to match the string:
dict_form1 = re.compile(r'(&[a-zA-Z]*:)([^,]*)')
result = dict_form1.findall(s)
print(result) # [('&a:', ' 12'), ('&b:', '13'), ('&c:', '14'), ('&d:', ' 15')]
This time, I get a better one with key and item separately stored in a tuple.
The only difference I made was (.), into [^,]
The first one I thought was to find anything until it matches a comma
The second one I thought was to find anything but comma
What's the difference?
In the first instance:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
the (.*) operator is greedy. This means it will match everything up to the last comma, which is why you see the match extend up to &c:14.
In the second instance, by excluding the comma, you are forcing the match to be bound by a comma-- it's like saying "match everything until we hit a comma". This will cause the matching behavior you were expecting in the first place.
as have been said the .* will be greedy and try to match as much as possible, to make it non-greedy use the question mark (?) as in .*?. In your code:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*?),')
result = dict_form.findall(s)
print(result)
Another maybe easier solution is to just use string splits instead of regex:
result = [_s.split(':') for _s in s.split(',')]

Categories

Resources