Drop Duplicate Substrings from String with NO Spaces - python

Given a Pandas DF column that looks like this:
...how can I turn it into this:
XOM
ZM
AAPL
SOFI
NKLA
TIGR
Although these strings appear to be 4 characters in length maximum, I can't rely on that, I want to be able to have a string like ABCDEFGHIJABCDEFGHIJ and still be able to turn it into ABCDEFGHIJ in one column calculation. Preferably WITHOUT for looping/iterating through the rows.

You can use regex pattern like r'\b(\w+)\1\b' with str.extract like below:
df = pd.DataFrame({'Symbol':['ZOMZOM', 'ZMZM', 'SOFISOFI',
'ABCDEFGHIJABCDEFGHIJ', 'NOTDUPLICATED']})
print(df['Symbol'].str.extract(r'\b(\w+)\1\b'))
Output:
0
0 ZOM
1 ZM
2 SOFI
3 ABCDEFGHIJ
4 NaN # <- from `NOTDUPLICATED`
Explanation:
\b is a word boundary
(w+) capture a word
\1 references to captured (w+) of the first group

An alternative approach which does involve iteration, but also regular expressions. Evaluate longest possible substrings first, getting progressively shorter. Use the substring to compile a regex that looks for the substring repeated two or more times. If it finds that, replace it with a single occurrence of the substring.
Does not handle leading or trailing characters. that are not part of the repetition.
When it performs a removal, it returns, breaking the loop. Going with longest substrings first ensures things like 'AAPLAAPL' leave the double A intact.
import re
def remove_repeated(str):
for i in range(len(str)):
substr = str[i:]
pattern = re.compile(f"({substr}){{2,}}")
if pattern.search(str):
return pattern.sub(substr, str)
return str
>>> remove_repeated('abcdabcd')
'abcd'
>>> remove_repeated('abcdabcdabcd')
'abcd'
>>> remove_repeated('aabcdaabcdaabcd')
'aabcd'
If we want to make this more flexible, a helper function to get all of the substrings in a string, starting with the longest, but as a generator expression so we don't have to actually generate more than we need.
def substrings(str):
return (str[i:i+l] for l in range(len(str), 0, -1)
for i in range(len(str) - l + 1))
>>> list(substrings("hello"))
['hello', 'hell', 'ello', 'hel', 'ell', 'llo', 'he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']
But there's no way 'hello' is going to be repeated in 'hello', so we can make this at least somewhat more efficient by looking at only substrings at most half the length of the input string.
def substrings(str):
return (str[i:i+l] for l in range(len(str)//2, 0, -1)
for i in range(len(str) - l + 1))
>>> list(substrings("hello"))
['he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']
Now, a little tweak to the original function:
def remove_repeated(str):
for s in substrings(str):
pattern = re.compile(f"({s}){{2,}}")
if pattern.search(str):
return pattern.sub(s, str)
return str
And now:
>>> remove_repeated('AAPLAAPL')
'AAPL'
>>> remove_repeated('fooAAPLAAPLbar')
'fooAAPLbar'

Related

A translator that replaces vowels with a string

For those that don't know, replacing vowels with 'ooba' has become a popular trend on https://reddit.com/r/prequelmemes . I would like to automate this process by making a program with python 2.7 that replaces vowels with 'ooba'. I have no idea where to get started
You could use a simple regular expression:
import re
my_string = 'Hello!'
my_other_string = re.sub(r'[aeiou]', 'ooba', my_string)
print(my_other_string) # Hooballooba!
Following method is suggested if the line is short. I would prefer using regex otherwise. Following assumes that your text is s.
s = ''.join(['ooba' if i in ['a', 'e', 'i', 'o', 'u'] else i for i in s])
Regex approach:
import re
s = re.sub(r'a|e|i|o|u', "ooba", s)
For a quick and simple answer, you could feed string meme into here
for i, c in enumerate(meme):
if c in ['a', 'e', 'i', 'o', 'u']:
meme[:i] = meme[:i] + 'ooba' + meme[i+1:]
It goes over each character in the string, and checks if it is a vowel. If it is, it slices around the index and inserts 'ooba' where it used to be.

Find all strings in nested brackets

How do i find string in nested brackets
Lets say I have a string
uv(wh(x(yz))
and I want to find all string in brackets (so wh, x, yz)
import re
s="uuv(wh(x(yz))"
regex = r"(\(\w*?\))"
matches = re.findall(regex, s)
The above code only finds yz
Can I modify this regex to find all matches?
To get all properly parenthesized text:
import re
def get_all_in_parens(text):
in_parens = []
n = "has something to substitute"
while n:
text, n = re.subn(r'\(([^()]*)\)', # match flat expression in parens
lambda m: in_parens.append(m.group(1)) or '', text)
return in_parens
Example:
>>> get_all_in_parens("uuv(wh(x(yz))")
['yz', 'x']
Note: there is no 'wh' in the result due to the unbalanced paren.
If the parentheses are balanced; it returns all three nested substrings:
>>> get_all_in_parens("uuv(wh(x(yz)))")
['yz', 'x', 'wh']
>>> get_all_in_parens("a(b(c)de)")
['c', 'bde']
Would a string split work instead of a regex?
s='uv(wh(x(yz))'
match=[''.join(x for x in i if x.isalpha()) for i in s.split('(')]
>>>print(match)
['uv', 'wh', 'x', 'yz']
>>> match.pop(0)
You could pop off the first element because if it was contained in a parenthesis, the first position would be blank, which you wouldn't want and if it wasn't blank that means it wasn't in the parenthesis so again, you wouldn't want it.
Since that wasn't flexible enough something like this would work:
def match(string):
unrefined_match=re.findall('\((\w+)|(\w+)\)', string)
return [x for i in unrefined_match for x in i if x]
>>> match('uv(wh(x(yz))')
['wh', 'x', 'yz']
>>> match('a(b(c)de)')
['b', 'c', 'de']
Using regex a pattern such as this might potentially work:
\((\w{1,})
Result:
['wh', 'x', 'yz']
Your current pattern escapes the ( ) and doesn't treat them as a capture group.
Well if you know how to covert from PHP regex to Python , then you can use this
\(((?>[^()]+)|(?R))*\)

Eliminating last element in array

So I am working on a small hangman text based game.
The problem I am currently dealing with is calling random words from my text file. Each word has one additional character for a new line (\n).
For instance, running through my function that separates a string's letters into individual elements I get something to the effect of:
from text file: guess
answer = arrange_word(guess)
>>>>> ['g', 'u', 'e', 's', 's', '\n']
however, when joining the array back together the following is shown:
print ''.join(arrange_word)
>>>>> guess
as you can see, it is a bit difficult to guess an element that does not show up.
For clarity here is my function for arrange_word:
def arrange_word(word):
##########
# This collects the mystery word and breaks it into an array of
# individual letters.
##########
word_length = len(word)
break_up = ["" for x in range(word_length)]
for i in range(0, word_length):
break_up[i] = word[i]
return break_up
What I am stuck on is that when trying to guess letters, the \n is impossible to guess. The win condition of my game is based on the guess being identical to the answer word. However the \n keeps that from working because they are of different length.
These answer arrays are of different length as well, since I am just pulling random lines from a text file of ~1000 words. After hours of searching I cannot seem to find out how to drop the last element of an array.
For this line here:
word_length = len(word)
Before you take the length, what you can do is this first:
word = word.strip()
Explanation:
strip removes leading and trailing whitespace.
>>> s = "bob\n"
>>> s
'bob\n'
>>> s.strip()
'bob'
With all this in mind, you don't need the rest of this code anymore:
word_length = len(word)
break_up = ["" for x in range(word_length)]
for i in range(0, word_length):
break_up[i] = word[i]
Applying the strip will give you your word without the whitespace character, then all you want to do after this to have a list of characters, is simply:
>>> s = "bob"
>>> list(s)
['b', 'o', 'b']
So your method can now simply be:
def arrange_word(word):
return list(word.strip())
Demo:
arrange_word("guess")
Output:
['g', 'u', 'e', 's', 's']
All these answers are fine for specifically stripping whitespace characters from a string, but more generally, Python lists implement standard stack/queue operations, and you can make your word into a list just by calling the list() constructor without needing to write your own function:
In [38]: letters = list('guess\n')
letters.pop()
letters
Out[38]: ['g', 'u', 'e', 's', 's']
Use List slicing
arr = [1,2,3,4]
print(arr[:-1:])
Array slicing syntax is [startindex:endindex:offset(2, means each 2 element)] So in your case you could. Which mean start at the begging of the list, to the last element -1 for every 1 element in the list.
return break_up[:-1:]
you can access last element by -1 index like:
guess[-1]
and you can delte it by:
del guess[-1]
Just strip the word:
word = 'guess\n'
word = word.strip() ## 'guess' without new line character, or spaces
Maybe first line of your arrange_word function should be
word = word.strip()
to remove all leading/trailing whitespace characters.

Longest repeating substring using for-loops and if-statements

I'm in an introductory level programming class that teaches python. I was introduced to a longest repeating substring problem for a project and I can't seem to crack it. I've looked on here for a solution, but I haven't learned suffix trees yet so I wouldn't be able to use them. So far, I've gotten here:
msg = "kalhfdlakdhfklajdf" (anything)
for i in range(len(msg)):
if msg[i] == msg[i + 1]:
reps.append(msg[i])
What this does is scan my string, msg, and check to see if the counter matches the next character in sequence. If the characters match, it appends msg[i] to the list "reps". My problem is that:
a) The function I created always appends one less than repetition amount, and
b) my function program always crashes due to msg[i+1] going out of bounds once it reaches the last spot on the list.
In essence, I want my program to find repeats, append them to a list where the highest repeating character is counted and returned to the user.
You need to use len(msg)-1 as your range but your condition will omit one character with your condition, and for getting ride of that you can add another condition to your code that check the preceding characters too :
with you'r condition you'll have 8 h in reps till there is 9 in msg:
>>> msg = "kalhfdlakdhhhhhhhhhfklajdf"
>>> reps = []
>>> for i in range(len(msg)-1):
... if msg[i] == msg[i + 1]:
... reps.append(msg[i])
...
>>> reps
['h', 'h', 'h', 'h', 'h', 'h', 'h', 'h']
And with another condition :
>>> reps=[]
>>> for i in range(len(msg)-1):
... if msg[i] == msg[i + 1] or msg[i] == msg[i - 1]:
... reps.append(msg[i])
...
>>> reps
['h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h']
For the groupby answer I alluded to on #Kasra's excellent response:
from itertools import groupby
msg = "kalhfdlakdhhhhhhhhhfklajdf"
maxcount = 0
for substring in groupby(msg):
lett, count = substring[0], len(list(substring[1]))
if count > maxlen:
maxcountlett = lett
maxcount = count
result = [maxcountlett] * maxlen
But note that this only works for substrings of length 1. msg = 'hahahaha' should give ['ha', 'ha', 'ha', 'ha'] by my understanding.
a) Think about what is happening when it makes the first match.
For example, given abcdeeef it sees that msg[4] matches msg[5]. It then goes and appends msg[4] to reps. Then msg[5] matches msg[6] and it appends msg[5] to reps. However, msg[6] does not match msg[7] so it does not append msg[6]. You are one short.
In order to fix this you need to append one extra for each string of matches. A good way to do this is to check if the character you're currently matching already exists in reps. If it does only append the current one. If it does not append it twice.
if msg[i] == msg[i+1]
if msg[i] in reps
reps.append(msg[i])
else
reps.append(msg[i])
reps.append(msg[i])
b) You need to ensure that you do not exceed your boundaries. This can be accomplished by taking 1 off of your range.
for i in (range(len(msg)-1))

Python, splitting strings on middle characters with overlapping matches using regex

In Python, I am using regular expressions to retrieve strings from a dictionary which show a specific pattern, such as having some repetitions of characters than a specific character and another repetitive part (e.g. ^(\w{0,2})o(\w{0,2})$).
This works as expected, but now I'd like to split the string in two substrings (eventually one might be empty) using the central character as delimiter. The issue I am having stems from the possibility of multiple overlapping matches inside a string (e.g. I'd want to use the previous regex to split the string room in two different ways, (r, om) and (ro, m)).
Both re.search().groups() and re.findall() did not solve this issue, and the docs on the re module seems to point out that overlapping matches would not be returned by the methods.
Here is a snippet showing the undesired behaviour:
import re
dictionary = ('room', 'door', 'window', 'desk', 'for')
regex = re.compile('^(\w{0,2})o(\w{0,2})$')
halves = []
for word in dictionary:
matches = regex.findall(word)
if matches:
halves.append(matches)
I am posting this as an answer mainly not to leave the question answered in the case someone stumbles here in the future and since I've managed to reach the desired behaviour, albeit probably not in a very pythonic way, this might be useful as a starting point from someone else. Some notes on how improve this answer (i.e. making more "pythonic" or simply more efficient would be very welcomed).
The only way of getting all the possible splits of the words having length in a certain range and a character in certain range of positions, using the characters in the "legal" positions as delimiters, both using there and the new regex modules involves using multiple regexes. This snippet allows to create at runtime an appropriate regex knowing the length range of the word, the char to be seek and the range of possible positions of such character.
dictionary = ('room', 'roam', 'flow', 'door', 'window',
'desk', 'for', 'fo', 'foo', 'of', 'sorrow')
char = 'o'
word_len = (3, 6)
char_pos = (2, 3)
regex_str = '(?=^\w{'+str(word_len[0])+','+str(word_len[1])+'}$)(?=\w{'
+str(char_pos[0]-1)+','+str(char_pos[1]-1)+'}'+char+')'
halves = []
for word in dictionary:
matches = re.match(regex_str, word)
if matches:
matched_halves = []
for pos in xrange(char_pos[0]-1, char_pos[1]):
split_regex_str = '(?<=^\w{'+str(pos)+'})'+char
split_word =re.split(split_regex_str, word)
if len(split_word) == 2:
matched_halves.append(split_word)
halves.append(matched_halves)
The output is:
[[['r', 'om'], ['ro', 'm']], [['r', 'am']], [['fl', 'w']], [['d', 'or'], ['do', 'r']], [['f', 'r']], [['f', 'o'], ['fo', '']], [['s', 'rrow']]]
At this point I might start considering using a regex just to find the to words to be split and the doing the splitting in 'dumb way' just checking if the characters in the range positions are equal char. Anyhow, any remark is extremely appreciated.
EDIT: Fixed.
Does a simple while loop work?
What you want is re.search and then loop with a 1 shift:
https://docs.python.org/2/library/re.html
>>> dictionary = ('room', 'door', 'window', 'desk', 'for')
>>> regex = re.compile('(\w{0,2})o(\w{0,2})')
>>> halves = []
>>> for word in dictionary:
>>> start = 0
>>> while start < len(word):
>>> match = regex.search(word, start)
>>> if match:
>>> start = match.start() + 1
>>> halves.append([match.group(1), match.group(2)])
>>> else:
>>> # no matches left
>>> break
>>> print halves
[['ro', 'm'], ['o', 'm'], ['', 'm'], ['do', 'r'], ['o', 'r'], ['', 'r'], ['nd', 'w'], ['d', 'w'], ['', 'w'], ['f', 'r'], ['', 'r']]

Categories

Resources