Remove repeated characters in a word - python

I am trying to clean texts for a data set, and many of the words are misspelled, for example, many times I will see the word "hellllo." I wish to remove repeated characters where a character is repeated more than twice in a row. Obviously this will not work with words such as "nooooo" because that will convert it to "noo", but I have functions to handle this written already. All I want to do is convert words such as "hellllo" to "hello".

Here is a generic function that handles arbitrary number of repetitions allowed:
def remove_multiple(s, n=2):
'''
param s: string
param n: number of max repetition allowed in the string
'''
if n < 0:
return
elif n==1:
return ''.join(sorted(set(s), key=s.index))
elif n > 1:
temp = []
temps = s + ' '*n
for i, c in enumerate(s):
if len(set(temps[i:n+1+i])) > 1:
temp.append(c)
return "".join(temp)
>>> remove_multiple('helllllllllllllllooooooooooooooo', 2)
Out: 'helloo'
>>> remove_multiple('helllllllllllllllooooooooooooooo', 5)
Out[]: 'helllllooooo'

Related

How to combine all function togetger

I want to make a function that can show first char, last char, count of char, and count of vowels from a string, so I make this
> def text(a):
> result = a[0]
> return result;
> def text (a):
> result = a[-1]
> return result
> def text(a):
> result= text.len(original_str)
> return result
> vowels = {'a','e','i','o','u'}
> text = "a"
> for a in my_string:
> if a in vowels:
> len(a)
> Text("American")
Expected out put
first_char = A
Last_char = n
num_char = 8
num_vowels = 4
how to make the function working at one time when I put the "text"
hope you all can help
You don't want to write a different function each time. Just include all of the things you want to find inside one function.
def Text(a):
first_char = a[0]
last_char = a[-1]
num_char = len(a)
num_vowels = sum([l in "aeiou" for l in a.lower()])
print("first_char:", first_char)
print("last_char:", last_char)
print("num_char:", num_char)
print("num_vowels:", num_vowels)
Text("American")
Output:
first_char: A
last_char: n
num_char: 8
num_vowels: 4
It looks like you're quite new to python, so I've made the following longer than necessary to allow for comments and clarity. The key point is to identify all the things you want your function to return and then make sure you get them, then return everything at once.
def text(a):
first = a[0] #Store first element in string
last = a[-1] #Store last element in string
length = len(a) #Store length of string
nVowel = 0 #Variable to count vowels
vowels = {'a','e','i','o','u'} #Set of vowels
for s in a.lower(): #Loop over all characters in lower case
if s in vowels: #Check if character is a vowel
nVowel += 1 #Add to count if vowel
return (first, last, length, nVowel) #Return results
print(text("American"))
This can be made significantly shorter, but I'll leave that to you as an opportunity to practice and understand what's going on.
You can execute them all in one function, and simply return all the values in a big chunk. The main problem with your original code is that not only does the code not work to well to begin with, but you are constantly overwriting your function. Also your vowels were represented with braces {} instead of square brackets [], which are used in python.
def Text(string):
# Calculate
first = string[0]
last = string[-1]
length = len(string)
vowels = sum([string.lower().count(v) for v in ["a", "e", "i", "o", "u"]])
# Return
return first, last, length, vowels
first_char, last_char, num_char, num_vowels = Text("American") # Or for any other string

Remove string character after run of n characters in string

Suppose you have a given string and an integer, n. Every time a character appears in the string more than n times in a row, you want to remove some of the characters so that it only appears n times in a row. For example, for the case n = 2, we would want the string 'aaabccdddd' to become 'aabccdd'. I have written this crude function that compiles without errors but doesn't quite get me what I want:
def strcut(string, n):
for i in range(len(string)):
for j in range(n):
if i + j < len(string)-(n-1):
if string[i] == string[i+j]:
beg = string[:i]
ends = string[i+1:]
string = beg + ends
print(string)
These are the outputs for strcut('aaabccdddd', n):
n
output
expected
1
'abcdd'
'abcd'
2
'acdd'
'aabccdd'
3
'acddd'
'aaabccddd'
I am new to python but I am pretty sure that my error is in line 3, 4 or 5 of my function. Does anyone have any suggestions or know of any methods that would make this easier?
This may not answer why your code does not work, but here's an alternate solution using regex:
import re
def strcut(string, n):
return re.sub(fr"(.)\1{{{n-1},}}", r"\1"*n, string)
How it works: First, the pattern formatted is "(.)\1{n-1,}". If n=3 then the pattern becomes "(.)\1{2,}"
(.) is a capture group that matches any single character
\1 matches the first capture group
{2,} matches the previous token 2 or more times
The replacement string is the first capture group repeated n times
For example: str = "aaaab" and n = 3. The first "a" is the capture group (.). The next 3 "aaa" matches \1{2,} - in this example a{2,}. So the whole thing matches "a" + "aaa" = "aaaa". That is replaced with "aaa".
regex101 can explain it better than me.
you can implement a stack data structure.
Idea is you add new character in stack, check if it is same as previous one or not in stack and yes then increase counter and check if counter is in limit or not if yes then add it into stack else not. if new character is not same as previous one then add that character in stack and set counter to 1
# your code goes here
def func(string, n):
stack = []
counter = None
for i in string:
if not stack:
counter = 1
stack.append(i)
elif stack[-1]==i:
if counter+1<=n:
stack.append(i)
counter+=1
elif stack[-1]!=i:
stack.append(i)
counter = 1
return ''.join(stack)
print(func('aaabbcdaaacccdsdsccddssse', 2)=='aabbcdaaccdsdsccddsse')
print(func('aaabccdddd',1 )=='abcd')
print(func('aaabccdddd',2 )=='aabccdd')
print(func('aaabccdddd',3 )=='aaabccddd')
output
True
True
True
True
The method I would use is creating a new empty string at the start of the function and then everytime you exceed the number of characters in the input string you just not insert them in the output string, this is computationally efficient because it is O(n) :
def strcut(string,n) :
new_string = ""
first_c, s = string[0], 0
for c in string :
if c != first_c :
first_c, s= c, 0
s += 1
if s > n : continue
else : new_string += c
return new_string
print(strcut("aabcaaabbba",2)) # output : #aabcaabba
Simply, to anwer the question
appears in the string more than n times in a row
the following code is small and simple, and will work fine :-)
def strcut(string: str, n: int) -> str:
tmp = "*" * (n+1)
for char in string:
if tmp[len(tmp) - n:] != char * n:
tmp += char
print(tmp[n+1:])
strcut("aaabccdddd", 1)
strcut("aaabccdddd", 2)
strcut("aaabccdddd", 3)
Output:
abcd
aabccdd
aaabccddd
Notes:
The character "*" in the line tmp = "*"*n+string[0:1] can be any character that is not in the string, it's just a placeholder to handle the start case when there are no characters.
The print(tmp[n:]) line simply removes the "*" characters added in the beginning.
You don't need nested loops. Keep track of the current character and its count. include characters when the count is less or equal to n, reset the current character and count when it changes.
def strcut(s,n):
result = '' # resulting string
char,count = '',0 # initial character and count
for c in s: # only loop once on the characters
if c == char: count += 1 # increase count
else: char,count = c,1 # reset character/count
if count<=n: result += c # include character if count is ok
return result
Just to give some ideas, this is a different approach. I didn't like how n was iterating each time even if I was on i=3 and n=2, I still jump to i=4 even though I already checked that character while going through n. And since you are checking the next n characters in the string, you method doesn't fit with keeping the strings in order. Here is a rough method that I find easier to read.
def strcut(string, n):
for i in range(len(string)-1,0,-1): # I go backwards assuming you want to keep the front characters
if string.count(string[i]) > n:
string = remove(string,i)
print(string)
def remove(string, i):
if i > len(string):
return string[:i]
return string[:i] + string[i+1:]
strcut('aaabccdddd',2)

find first not repeating character

Problem:
Given a string s consisting of small English letters, find and return the first instance of a non-repeating character in it. If there is no such character, return '_'.
Example
For s = "abacabad", the output should be
firstNotRepeatingCharacter(s) = 'c'.
This is what I have so far, but it's too slow. How can I make the run time faster? Thanks.
def firstNotRepeatingCharacter(s):
char = set(s)
for i in range(len(s)):
if s.count(s[i]) == 1:
return s[i]
return "_"
You could use collections.Counter to count the characters in linear time, and then filter the result in conjunction with next, like this:
from collections import Counter
def firstNotRepeatingCharacter(s):
counts = Counter(s)
return next((ch for ch in s if counts[ch] < 2), "_")
print(firstNotRepeatingCharacter("abacabad"))
Output
c
Or simply use a dictionary (no imports needed):
counts = {}
for ch in s:
counts[ch] = counts.get(ch, 0) + 1
return next((ch for ch in s if counts[ch] < 2), "_")
Both approaches are linear in the length of the input string, your current approach is O(k*s) where k is the number of unique characters.

Looking for matching strings of length > 4 between two text files

I'm trying to read in two text files, and then search each for strings that are common between the two, of minimum length 5.
The code I've written:
db = open("list_of_2","r").read()
lp = open("lastpass","r").read()
word = ''
length = 0
for dbchar in db:
for lpchar in lp:
if dbchar == lpchar:
word += str(dbchar)
length += 1
else:
length = 0
word = ''
if length > 4:
print(word)
The code currently prints strings like '-----' and '55555', over and over and doesn't seem to break the loop (these particular strings only appear in lp once). I also don't believe it's finding strings that aren't just the same character repeated.
How do I alter the code to:
Only make it run through and print each occurrence once, and
Not just find strings of the same character repeated?
Edit: Here are some mock text files. In these, the string 'ghtyty' appears in file1 three times, and in file2 4 times. The code should print 'ghtyty' to console once.
file1
file2
I would suggest a different approach. Split the files into words and keep only words 5 characters or greater. Use sets to find the intersection--this will be faster.
db_words = set([x for x in db.split() if len(x) > 4])
lp_words = set([x for x in lp.split() if len(x) > 4])
matches = db_words & lp_words
If you want to exclude words of all same character, you can define the list comprehension like this:
[x for x in db.split() if len(x) > 4 and x != x[0]*len(x)]
If you are looking for any consecutive sequence of characters that match, this might work better:
i_skip = set() # characters to skip if they are already in a printed word
j_skip = set()
for i in range(len(db)-4):
if i in i_skip: continue
for j in range(len(lp)-4):
if j in j_skip: continue
if db[i] == lp[j]:
word_len = 5
while db[i:i+word_len] == lp[j:j+word_len]:
if db[i:i+word_len+1] == lp[j:j+word_len+1]:
word_len += 1
else:
print(db[i:i+word_len])
i_skip.update(range(i, i+word_len))
j_skip.update(range(j, j+word_len))
break

Most common character in a string

Write a function that takes a string consisting of alphabetic
characters as input argument and returns the most common character.
Ignore white spaces i.e. Do not count any white space as a character.
Note that capitalization does not matter here i.e. that a lower case
character is equal to a upper case character. In case of a tie between
certain characters return the last character that has the most count
This is the updated code
def most_common_character (input_str):
input_str = input_str.lower()
new_string = "".join(input_str.split())
print(new_string)
length = len(new_string)
print(length)
count = 1
j = 0
higher_count = 0
return_character = ""
for i in range(0, len(new_string)):
character = new_string[i]
while (length - 1):
if (character == new_string[j + 1]):
count += 1
j += 1
length -= 1
if (higher_count < count):
higher_count = count
return (character)
#Main Program
input_str = input("Enter a string: ")
result = most_common_character(input_str)
print(result)
The above is my code. I am getting an error of string index out of bound which I can't understand why. Also the code only checks the occurrence of first character I am confused about how to proceed to the next character and take the maximum count?
The error i get when I run my code:
> Your answer is NOT CORRECT Your code was tested with different inputs.
> For example when your function is called as shown below:
>
> most_common_character ('The cosmos is infinite')
>
> ############# Your function returns ############# e The returned variable type is: type 'str'
>
> ######### Correct return value should be ######## i The returned variable type is: type 'str'
>
> ####### Output of student print statements ###### thecosmosisinfinite 19
You can use a regex patter to search for all characters. \w matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. The + after [\w] means to match one or more repetitions.
Finally, you use Counter to total them and most_common(1) to get the top value. See below for the case of a tie.
from collections import Counter
import re
s = "Write a function that takes a string consisting of alphabetic characters as input argument and returns the most common character. Ignore white spaces i.e. Do not count any white space as a character. Note that capitalization does not matter here i.e. that a lower case character is equal to a upper case character. In case of a tie between certain characters return the last character that has the most count"
>>> Counter(c.lower() for c in re.findall(r"\w", s)).most_common(1)
[('t', 46)]
In the case of a tie, it is a little more tricky.
def top_character(some_string):
joined_characters = [c for c in re.findall(r"\w+", some_string.lower())]
d = Counter(joined_characters)
top_characters = [c for c, n in d.most_common() if n == max(d.values())]
if len(top_characters) == 1:
return top_characters[0]
reversed_characters = joined_characters[::-1]
for c in reversed_characters:
if c in top_characters:
return c
>>> top_character(s)
't'
>>> top_character('the the')
'e'
In the case of your code above and your sentence "The cosmos is infinite", you can see that 'i' occurs more frequently that 'e' (the output of your function):
>>> Counter(c.lower() for c in "".join(re.findall(r"[\w]+", 'The cosmos is infinite'))).most_common(3)
[('i', 4), ('s', 3), ('e', 2)]
You can see the issue in your code block:
for i in range(0, len(new_string)):
character = new_string[i]
...
return (character)
You are iterating through a sentence and assign that letter to the variable character, which is never reassigned elsewhere. The variable character will thus always return the last character in your string.
Actually your code is almost correct. You need to move count, j, length inside of your for i in range(0, len(new_string)) because you need to start over on each iteration and also if count is greater than higher_count, you need to save that charater as return_character and return it instead of character which is always last char of your string because of character = new_string[i].
I don't see why have you used j+1 and while length-1. After correcting them, it now covers tie situations aswell.
def most_common_character (input_str):
input_str = input_str.lower()
new_string = "".join(input_str.split())
higher_count = 0
return_character = ""
for i in range(0, len(new_string)):
count = 0
length = len(new_string)
j = 0
character = new_string[i]
while length > 0:
if (character == new_string[j]):
count += 1
j += 1
length -= 1
if (higher_count <= count):
higher_count = count
return_character = character
return (return_character)
If we ignore the "tie" requirement; collections.Counter() works:
from collections import Counter
from itertools import chain
def most_common_character(input_str):
return Counter(chain(*input_str.casefold().split())).most_common(1)[0][0]
Example:
>>> most_common_character('The cosmos is infinite')
'i'
>>> most_common_character('ab' * 3)
'a'
To return the last character that has the most count, we could use collections.OrderedDict:
from collections import Counter, OrderedDict
from itertools import chain
from operator import itemgetter
class OrderedCounter(Counter, OrderedDict):
pass
def most_common_character(input_str):
counter = OrderedCounter(chain(*input_str.casefold().split()))
return max(reversed(counter.items()), key=itemgetter(1))[0]
Example:
>>> most_common_character('ab' * 3)
'b'
Note: this solution assumes that max() returns the first character that has the most count (and therefore there is a reversed() call, to get the last one) and all characters are single Unicode codepoints. In general, you might want to use \X regular expression (supported by regex module), to extract "user-perceived characters" (eXtended grapheme cluster) from the Unicode string.

Categories

Resources