splicing between sequence of two defined boundaries

splicing between sequence of two defined boundaries - python

I have created a function that takes in four strings. The first two strings will be long strings that can be anything. The last two strings will be referred to as boundaries. I want to take everything in string1 between the defined boundaries and replace everything in string2 between the defined boundaries. The part of the string taken away from string 1 will be removed and the part replaced in string 2 will be removed. An example of this function is below:
def bound('DOGYOMAMA','ROOGMEMAD', 'OG' 'MA') --> RETURNS('DMA','ROOGYOMAD',
'OG', 'MA')
This is the function I have created to do what I wrote above
def bound(st,sz,a,b):
s1=''.join(st)
s2=''.join(sz)
if a in s1 and b in s1 and a in s2 and b in s2:
f1=s1.find(a)
l1=s1.find(b)
f2=s2.find(a)
l2=s2.find(b)
blen1 = len(b)
blen2 = len(b)
s1_n = s1[:f1] +s1[l1+blen1:]
s2_n = s2[:f2] + s1[f1:l1 + blen1] +s2[l2+blen2]
return s1_n, s2_n, a, b
print(bound('DOGYOMAMA','ROOGMEMAD', 'OG','MA'))
My problem is that I also need to make it so this will work in reverse so if I have ('DOGYOMAMA','ROOGMEMAD', 'OG' 'MA') it should also look for ('AMAMOYGOD','DAMEMGOOR', 'GO' 'AM'). Another thing would be if the string can be spliced both ways it will take only the sequence that is spliced at the lowest index.

Try this :
and if you have to return many items then don't return the output instead of store the output in a list and return that list at last , that i did there :
def bound(st,sz,a,b):
result=[]
string_s = [''.join(st), ''.join(sz), ''.join(st)[::-1], ''.join(sz)[::-1]]
boundaries = [a, b, a[::-1], b[::-1]]
for chunk in range(0, len(string_s), 2):
word = string_s[chunk:chunk + 2]
bound = boundaries[chunk:chunk + 2]
if bound[0] in word[0] and bound[1] in word[0] and bound[0] in word[1] and bound[1] in word[1]:
f1 = word[0].find(bound[0])
l1 = word[0].find(bound[1])
f2 = word[1].find(bound[0])
l2 = word[1].find(bound[1])
blen1 = len(bound[1])
blen2 = len(bound[1])
s1_n = word[0][:f1] + word[0][l1 + blen1:]
s2_n = word[1][:f2] + word[0][f1:l1 + blen1] + word[1][l2 + blen2]
result.append([s1_n, s2_n, bound[0], bound[1]])
return result
print(bound('DOGYOMAMA','ROOGMEMAD', 'OG','MA'))
output:
[['DMA', 'ROOGYOMAD', 'OG', 'MA'], ['AMAMOYAMOYGOD', 'DAMEME', 'GO', 'AM']]

Related

Why won't len(temp_list) update its value in a loop where values of temp_list changes every iteration?

I am trying to create a dataframe data which consists of two columns which are 'word' and 'misspelling'. I have 5 parts in which I attempt to achieve it which are 1 function, 3 dataframes, and 1 loop.
A function which generate misspellings (got this from Peter Norvig):
def generate(word):
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) +1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
A dataframe with words to generate the misspelling:
wl = ['a', 'is', 'the']
word_list = pd.DataFrame(wl, columns = ['word'])
An empty dataframe meant to be filled up in the loop:
data = pd.DataFrame(columns = ['word', 'misspelling'])
An empty dataframe meant to temporarily hold the values from the function 'generate' in the loop:
temp_list = pd.DataFrame(columns = ['misspelling'])
A loop that will fill up the dataframe data:
y = 0
for a in range(len(word_list)):
temp_list['misspelling'] = pd.DataFrame(generate(word_list.at[a,'word']))
data = pd.concat([data,temp_list], ignore_index = True)
print(len(temp_list)) #to check the length of 'temp_list' in each loop
for x in range(len(temp_list)):
data.at[y,'word'] = word_list.at[a,'word']
y = y + 1
y = data.index[-1] + 1temp_list.drop(columns = ['misspelling'])
What I expected when I check data outside of the loop is for it to have a total of 390 rows which is the total of len(generate('is')) + len(generate('a')) + len(generate('the')).
The total of rows in data turned out to be 234 which is way less. When I went around to check which variable was not tallying up, it turned out to be len(temp_list) which I expect it to update every loop since new values are replacing it.
len(temp_list) remains the same which is causing temp_list['misspelling'] = pd.DataFrame(generate(word_list.at[a,'word'])) to only have the maximum length of len(generate('a')) (in which 'a' is the first value in word_list) although the generated misspellings in temp_list was different each loop.
I thought adding temp_list.drop(columns = ['misspelling']) at the end of the outer loop would reset temp_list but it doesn't seem like it resetted len(temp_list).

temp_list.drop() with inplace=False (which is the default) does not modify the existing dataframe, but returns a new one. However, even if you fix that, it still won’t work, because you would also need to drop the index, and I’m not sure that’s even possible.
I don’t quite understand what you are trying to do (for example, the for x in ... loop never uses x) but I suspect you might be better off using plain Python lists instead of dataframes.

Python RegEx: how to replace each match individually

I have a string s, a pattern p and a replacement r, i need to get the list of strings in which only one match with p has been replaced with r.
Example:
s = 'AbcAbAcc'
p = 'A'
r = '_'
// Output:
['_bcAbAcc', 'Abc_bAcc', 'AbcAb_cc']
I have tried with re.finditer(p, s) but i couldn't figure out how to replace each match with r.

You can replace them manually after finding all the matches:
[s[:m.start()] + r + s[m.end():] for m in re.finditer(p,s)]
The result is:
['_bcAbAcc', 'Abc_bAcc', 'AbcAb_cc']
How does it work?
re.finditer(p,s) will find all matches (each will be a re.Match
object)
the re.Match objects have start() and end() method which return the location of the match
you can replace the part of string between begin and end using this code: s[:begin] + replacement + s[end:]

You don't need regex for this, it's as simple as
[s[:i]+r+s[i+1:] for i,c in enumerate(s) if c==p]
Full code: See it working here
s = 'AbcAbAcc'
p = 'A'
r = '_'
x = [s[:i]+r+s[i+1:] for i,c in enumerate(s) if c==p]
print(x)
Outputs:
['_bcAbAcc', 'Abc_bAcc', 'AbcAb_cc']
As mentioned, this only works on one character, for anything longer than one character or requiring a regex, use zvone's answer.
For a performance comparison between mine and zvone's answer (plus a third method of doing this without regex), see here or test it yourself with the code below:
import timeit,re
s = 'AbcAbAcc'
p = 'A'
r = '_'
def x1():
return [s[:i]+r+s[i+1:] for i,c in enumerate(s) if c==p]
def x2():
return [s[:i]+r+s[i+1:] for i in range(len(s)) if s[i]==p]
def x3():
return [s[:m.start()] + r + s[m.end():] for m in re.finditer(p,s)]
print(x1())
print(timeit.timeit(x1, number=100000))
print(x2())
print(timeit.timeit(x2, number=100000))
print(x3())
print(timeit.timeit(x3, number=100000))

Repeating characters results in wrong repetition counts

My function looks like this:
def accum(s):
a = []
for i in s:
b = s.index(i)
a.append(i * (b+1))
x = "-".join(a)
return x.title()
with the expected input of:
'abcd'
the output should be and is:
'A-Bb-Ccc-Dddd'
but if the input has a recurring character:
'abccba'
it returns:
'A-Bb-Ccc-Ccc-Bb-A'
instead of:
'A-Bb-Ccc-Cccc-Bbbbb-Aaaaaa'
how can I fix this?

Don't use str.index(), it'll return the first match. Since c and b and a appear early in the string you get 2, 1 and 0 back regardless of the position of the current letter.
Use the enumerate() function to give you position counter instead:
for i, letter in enumerate(s, 1):
a.append(i * letter)
The second argument is the starting value; setting this to 1 means you can avoid having to + 1 later on. See What does enumerate mean? if you need more details on what enumerate() does.
You can use a list comprehension here rather than use list.append() calls:
def accum(s):
a = [i * letter for i, letter in enumerate(s, 1)]
x = "-".join(a)
return x.title()
which could, at a pinch, be turned into a one-liner:
def accum(s):
a = '-'.join([i * c for i, c in enumerate(s, 1)]).title()

This is because s.index(a) returns the first index of the character. You can use enumerate to pair elements to their indices:
Here is a Pythonic solution:
def accum(s):
return "-".join(c*(i+1) for i, c in enumerate(s)).title()

simple:
def accum(s):
a = []
for i in range(len(s)):
a.append(s[i]*(i+1))
x = "-".join(a)
return x.title()

Check intersection between two strings in python

I'm trying to check intersection between two strings using Python.
I defined this function:
def check(s1,s2):
word_array = set.intersection(set(s1.split(" ")), set(s2.split(" ")))
n_of_words = len(word_array)
return n_of_words
It works with some sample string, but in this specific case:
d_word = "BANGKOKThailand"
nlp_word = "Despite Concerns BANGKOK"
print(check(d_word,nlp_word))
I got 0. What am I missing?

I was looking for the maximum common part of 2 strings no matter where this part would be.
def get_intersection(s1, s2):
res = ''
l_s1 = len(s1)
for i in range(l_s1):
for j in range(i + 1, l_s1):
t = s1[i:j]
if t in s2 and len(t) > len(res):
res = t
return res
#get_intersection(s1, s2)
Works for this example as well:
>>> s1 = "BANGKOKThailand"
>>> s2 = "Despite Concerns BANGKOK"
>>> get_intersection('aa' + s1 + 'bb', 'cc' + s2 + 'dd')
'BANGKOK'

Set one contains single string, set two 3 strings, and string "BANGKOKThailand" is not equal to the string "BANGKOK".

I can see two might-be mistakes:
n_of_words = len(array)
should be
n_of_words = len(word_array)
and
d_word = "BANGKOKThailand"
is missing a space in-between as
"BANGKOK Thailand"
Fixing those two changes gave me a result of 1.

Split list based on first character - Python

I am new to Python and can't quite figure out a solution to my Problem. I would like to split a list into two lists, based on what the list item starts with. My list looks like this, each line represents an item (yes this is not the correct list notation, but for a better overview i'll leave it like this) :
***
**
.param
+foo = bar
+foofoo = barbar
+foofoofoo = barbarbar
.model
+spam = eggs
+spamspam = eggseggs
+spamspamspam = eggseggseggs
So I want a list that contains all lines starting with a '+' between .param and .model and another list that contains all lines starting with a '+' after model until the end.
I have looked at enumerate() and split(), but since I have a list and not a string and am not trying to match whole items in the list, I'm not sure how to implement them.
What I have is this:
paramList = []
for line in newContent:
while line.startswith('+'):
paramList.append(line)
if line.startswith('.'):
break
This is just my try to create the first list. The Problem is, the code reads the second block of '+'s as well because break just Exits the while Loop, not the for Loop.
I hope you can understand my question and thanks in advance for any pointers!

What you want is really a simple task that can be accomplish using list slices and list comprehension:
data = ['**','***','.param','+foo = bar','+foofoo = barbar','+foofoofoo = barbarbar',
'.model','+spam = eggs','+spamspam = eggseggs','+spamspamspam = eggseggseggs']
# First get the interesting positions.
param_tag_pos = data.index('.param')
model_tag_pos = data.index('.model')
# Get all elements between tags.
params = [param for param in data[param_tag_pos + 1: model_tag_pos] if param.startswith('+')]
models = [model for model in data[model_tag_pos + 1: -1] if model.startswith('+')]
print(params)
print(models)
Output
>>> ['+foo = bar', '+foofoo = barbar', '+foofoofoo = barbarbar']
>>> ['+spam = eggs', '+spamspam = eggseggs']
Answer to comment:
Suppose you have a list containing numbers from 0 up to 5.
l = [0, 1, 2, 3, 4, 5]
Then using list slices you can select a subset of l:
another = l[2:5] # another is [2, 3, 4]
That what we are doing here:
data[param_tag_pos + 1: model_tag_pos]
And for your last question: ...how does python know param are the lines in data it should iterate over and what exactly does the first paramin param for paramdo?
Python doesn't know, You have to tell him.
First param is a variable name I'm using here, it cuold be x, list_items, whatever you want.
and I will translate the line of code to plain english for you:
# Pythonian
params = [param for param in data[param_tag_pos + 1: model_tag_pos] if param.startswith('+')]
# English
params is a list of "things", for each "thing" we can see in the list `data`
from position `param_tag_pos + 1` to position `model_tag_pos`, just if that "thing" starts with the character '+'.

data = {}
for line in newContent:
if line.startswith('.'):
cur_dict = {}
data[line[1:]] = cur_dict
elif line.startswith('+'):
key, value = line[1:].split(' = ', 1)
cur_dict[key] = value
This creates a dict of dicts:
{'model': {'spam': 'eggs',
'spamspam': 'eggseggs',
'spamspamspam': 'eggseggseggs'},
'param': {'foo': 'bar',
'foofoo': 'barbar',
'foofoofoo': 'barbarbar'}}

I am new to Python
Whoops. Don't bother with my answer then.
I want a list that contains all lines starting with a '+' between
.param and .model and another list that contains all lines starting
with a '+' after model until the end.
import itertools as it
import pprint
data = [
'***',
'**',
'.param',
'+foo = bar',
'+foofoo = barbar',
'+foofoofoo = barbarbar',
'.model',
'+spam = eggs',
'+spamspam = eggseggs',
'+spamspamspam = eggseggseggs',
]
results = [
list(group) for key, group in it.groupby(data, lambda s: s.startswith('+'))
if key
]
pprint.pprint(results)
print '-' * 20
print results[0]
print '-' * 20
pprint.pprint(results[1])
--output:--
[['+foo = bar', '+foofoo = barbar', '+foofoofoo = barbarbar'],
['+spam = eggs', '+spamspam = eggseggs', '+spamspamspam = eggseggseggs']]
--------------------
['+foo = bar', '+foofoo = barbar', '+foofoofoo = barbarbar']
--------------------
['+spam = eggs', '+spamspam = eggseggs', '+spamspamspam = eggseggseggs']
This thing here:
it.groupby(data, lambda x: x.startswith('+')
...tells python to create groups from the strings according to their first character. If the first character is a '+', then the string gets put into a True group. If the first character is not a '+', then the string gets put into a False group. However, there are more than two groups because consecutive False strings will form a group, and consecutive True strings will form a group.
Based on your data, the first three strings:
***
**
.param
will create one False group. Then, the next strings:
+foo = bar
+foofoo = barbar
+foofoofoo = barbarbar
will create one True group. Then the next string:
'.model'
will create another False group. Then the next strings:
'+spam = eggs'
'+spamspam = eggseggs'
'+spamspamspam = eggseggseggs'
will create another True group. The result will be something like:
{
False: [strs here],
True: [strs here],
False: [strs here],
True: [strs here]
}
Then it's just a matter of picking out each True group: if key, and then converting the corresponding group to a list: list(group).
Response to comment:
where exactly does python go through data, like how does it know s is
the data it's iterating over?
groupby() works like do_stuff() below:
def do_stuff(items, func):
for item in items:
print func(item)
#Create the arguments for do_stuff():
data = [1, 2, 3]
def my_func(x):
return x + 100
#Call do_stuff() with the proper argument types:
do_stuff(data, my_func) #Just like when calling groupby(), you provide some data
#and a function that you want applied to each item in data
--output:--
101
102
103
Which can also be written like this:
do_stuff(data, lambda x: x + 100)
lambda creates an anonymous function, which is convenient for simple functions which you don't need to refer to by name.
This list comprehension:
[
list(group)
for key, group in it.groupby(data, lambda s: s.startswith('+'))
if key
]
is equivalent to this:
results = []
for key, group in it.groupby(data, lambda s: s.startswith('+') ):
if key:
results.append(list(group))
It's clearer to explicitly write a for loop, however list comprehensions execute much faster. Here is some detail:
[
list(group) #The item you want to be in the results list for the current iteration of the loop here:
for key, group in it.groupby(data, lambda s: s.startswith('+')) #A for loop
if key #Only include the item for the current loop iteration in the results list if key is True
]

I would suggest doing things step by step.
1) Grab every word from the array separately.
2) Grab the first letter of the word.
3) Look if that is a '+' or '.'
Example code:
import re
class Dark():
def __init__(self):
# Array
x = ['+Hello', '.World', '+Hobbits', '+Dwarves', '.Orcs']
xPlus = []
xDot = []
# Values
i = 0
# Look through every word in the array one by one.
while (i != len(x)):
# Grab every word (s), and convert to string (y).
s = x[i:i+1]
y = '\n'.join(s)
# Print word
print(y)
# Grab the first letter.
letter = y[:1]
if (letter == '+'):
xPlus.append(y)
elif (letter == '.'):
xDot.append(y)
else:
pass
# Add +1
i = i + 1
# Print lists
print(xPlus)
print(xDot)
#Run class
Dark()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

splicing between sequence of two defined boundaries - python

Related

Why won't len(temp_list) update its value in a loop where values of temp_list changes every iteration?

Python RegEx: how to replace each match individually

Repeating characters results in wrong repetition counts

Check intersection between two strings in python

Split list based on first character - Python

Categories

Resources