How to extract words from repeating strings

How to extract words from repeating strings - python

Here I have a string in a list:
['aaaaaaappppppprrrrrriiiiiilll']
I want to get the word 'april' in the list, but not just one of them, instead how many times the word 'april' actually occurs the string.
The output should be something like:
['aprilaprilapril']
Because the word 'april' occurred three times in that string.
Well the word actually didn't occurred three times, all the characters did. So I want to order these characters to 'april' for how many times did they appeared in the string.
My idea is basically to extract words from some random strings, but not just extracting the word, instead to extract all of the word that appears in the string. Each word should be extracted and the word (characters) should be ordered the way I wanted to.
But here I have some annoying conditions; you can't delete all the elements in the list and then just replace them with the word 'april'(you can't replace the whole string with the word 'april'); you can only extract 'april' from the string, not replacing them. You can't also delete the list with the string. Just think of all the string there being very important data, we just want some data, but these data must be ordered, and we need to delete all other data that doesn't match our "data chain" (the word 'april'). But once you delete the whole string you will lose all the important data. You don't know how to make another one of these "data chains", so we can't just put the word 'april' back in the list.
If anyone know how to solve my weird problem, please help me out, I am a beginner python programmer. Thank you!

One way is to use itertools.groupby which will group the characters individually and unpack and iterate them using zip which will iterate n times given n is the number of characters in the smallest group (i.e. the group having lowest number of characters)
from itertools import groupby
'aaaaaaappppppprrrrrriiiiiilll'
result = ''
for each in zip(*[list(g) for k, g in groupby('aaaaaaappppppprrrrrriiiiiilll')]):
result += ''.join(each)
# result = 'aprilaprilapril'
Another possible solution is to create a custom counter that will count each unique sequence of characters (Please be noted that this method will work only for Python 3.6+, for lower version of Python, order of dictionaries is not guaranteed):
def getCounts(strng):
if not strng:
return [], 0
counts = {}
current = strng[0]
for c in strng:
if c in counts.keys():
if current==c:
counts[c] += 1
else:
current = c
counts[c] = 1
return counts.keys(), min(counts.values())
result = ''
counts=getCounts('aaaaaaappppppprrrrrriiiiiilll')
for i in range(counts[1]):
result += ''.join(counts[0])
# result = 'aprilaprilapril'

How about using regex?
import re
word = 'april'
text = 'aaaaaaappppppprrrrrriiiiiilll'
regex = "".join(f"({c}+)" for c in word)
match = re.match(regex, text)
if match:
# Find the lowest amount of character repeats
lowest_amount = min(len(g) for g in match.groups())
print(word * lowest_amount)
else:
print("no match")
Outputs:
aprilaprilapril
Works like a charm

Here is a more native approach, with plain iteration.
It has a time complexity of O(n).
It uses an outer loop to iterate over the character in the search key, then an inner while loop that consumes all occurrences of that character in the search string while maintaining a counter. Once all consecutive occurrences of the current letter have been consumes, it updates a the minLetterCount to be the minimum of its previous value or this new count. Once we have iterated over all letters in the key, we return this accumulated minimum.
def countCompleteSequenceOccurences(searchString, key):
left = 0
minLetterCount = 0
letterCount = 0
for i, searchChar in enumerate(key):
while left < len(searchString) and searchString[left] == searchChar:
letterCount += 1
left += 1
minLetterCount = letterCount if i == 0 else min(minLetterCount, letterCount)
letterCount = 0
return minLetterCount
Testing:
testCasesToOracles = {
"aaaaaaappppppprrrrrriiiiiilll": 3,
"ppppppprrrrrriiiiiilll": 0,
"aaaaaaappppppprrrrrriiiiii": 0,
"aaaaaaapppppppzzzrrrrrriiiiiilll": 0,
"pppppppaaaaaaarrrrrriiiiiilll": 0,
"zaaaaaaappppppprrrrrriiiiiilll": 3,
"zzzaaaaaaappppppprrrrrriiiiiilll": 3,
"aaaaaaappppppprrrrrriiiiiilllzzz": 3,
"zzzaaaaaaappppppprrrrrriiiiiilllzzz": 3,
}
key = "april"
for case, oracle in testCasesToOracles.items():
result = countCompleteSequenceOccurences(case, key)
assert result == oracle
Usage:
key = "april"
result = countCompleteSequenceOccurences("aaaaaaappppppprrrrrriiiiiilll", key)
print(result * key)
Output:
aprilaprilapril

A word will only occur as many times as the minimum letter recurrence. To account for the possibility of having repeated letters in the word (for example, appril, you need to factor this count out. Here is one way of doing this using collections.Counter:
from collections import Counter
def count_recurrence(kernel, string):
# we need to count both strings
kernel_counter = Counter(kernel)
string_counter = Counter(string)
# now get effective count by dividing the occurence in string by occurrence
# in kernel
effective_counter = {
k: int(string_counter.get(k, 0)/v)
for k, v in kernel_counter.items()
}
# min occurence of kernel is min of effective counter
min_recurring_count = min(effective_counter.values())
return kernel * min_recurring_count

Related

Replace sequence of the same letter with single one

I am trying to replace the number of letters with a single one, but seems to be either hard either I am totally block how this should be done
So example of input:
aaaabbcddefff
The output should be abcdef
Here is what I was able to do, but when I went to the last piece of the string I can't get it done. Tried different variants, but I am stucked. Can someone help me finish this code?
text = "aaaabbcddefff"
new_string = ""
count = 0
while text:
for i in range(len(text)):
l = text[i]
for n in range(len(text)):
if text[n] == l:
count += 1
continue
new_string += l
text = text.replace(l, "", count)
break
count = 0
break

Using regex
re.sub(r"(.)(?=\1+)", "", text)
>>> import re
>>> text = "aaaabbcddefff"
>>> re.sub(r"(.)(?=\1+)", "", text)
abcdeaf

Side note: You should consider building your string up in a list and then joining the list, because it is expensive to append to a string, since strings are immutable.
One way to do this is to check if every letter you look at is equal to the previous letter, and only append it to the new string if it is not equal:
def remove_repeated_letters(s):
if not s: return ""
ret = [s[0]]
for index, char in enumerate(s[1:], 1):
if s[index-1] != char:
ret.append(char)
return "".join(ret)
Then, remove_repeated_letters("aaaabbcddefff") gives 'abcdef'.
remove_repeated_letters("aaaabbcddefffaaa") gives 'abcdefa'.
Alternatively, use itertools.groupby, which groups consecutive equal elements together, and join the keys of that operation
import itertools
def remove_repeated_letters(s):
return "".join(key for key, group in itertools.groupby(s))

How to find the amount of equal characters that are next to eachother in a string?

i just started using python and im a noob.
this is an example of the string i have to work with "--+-+++----------------+-+"
The program needs to find whats the longest ++ "chain", so how many times does + appear, when they are next to eachother. I dont really know how to explain this, but i need it to find that chain of 3 + smybols, so i can print that the longest + chain contains 3 + symbols.

a = "--+-+++----------------+-+"
count = 0
most = 0
for x in range(len(a)):
if a[x] == "+":
count+=1
else:
count = 0
if count > most:
most = count
print(f"longest + chain includes {most} symbols")
there might be a better way but it's more self explanatory

Try this. It uses regular expressions and a list comprehension, so you may need to read about them.
But the idea is to find all the + chains, calculate their lengths and get the maximum length
import re
s = '+++----------------+-+'
occurs = re.findall('\++',s)
print(max([len(i) for i in occurs]))
Output:
3

You can use a regular expression to specify "one or more + characters". The character for specifying this kind of repetition in a regex is itself +, so to specify the actual + character you have to escape it.
haystack = "--+-+++----------------+-+"
needle = re.compile(r"\++")
Now we can use findall to find all the occurrences of this pattern in the original string, and max to find the longest of these.
longest = max(len(x) for x in needle.findall(haystack))
If you instead need the position of the longest sequence in the target string, you can use:
pos = haystack.index(max(needle.findall(haystack), key=len))

A simple solution is to iterate over the string one character at a time. When the character is the same as the last add one to a counter and each time the character is different to the previous the count can be restarted.
s = "--+-+++----------------+-+"
p = s[0]
max, count = 0
for c in s:
if c == p:
count = count + 1
else:
count = 0
if count > max:
max = count
p = c
s is the string, c is the character being checked, p is previous character, count is the counter, and max is the highest found value,

If the only other character in your string is a minus sign, you can split the string on the minus sign and get maximum length of the resulting substrings:
a = "--+-+++----------------+-+"
r = max(map(len,a.split('-')))
print(r) # 3

Python count max character in a string?

I am trying to find the max character count in a string using loop. So far, this is the code i have written :-
def max_char_count(string):
max_char = ''
max_count = 0
for char in string:
count = string.count(char)
if count > max_count:
max_count = count
max_char = char
return max_char
print(max_char_count('apple hellooo'))
But the issue i am running into is even though there are 3 l and 3 o. I am only getting the output as l. How can i adjust the code to show the right count for the characters? Thank you.

Your approach is inefficient because you need to iterate over the whole string to compute string.count(char) while you iterate over the entire string anyway, which gives a time complexity of O(n^2)
Instead, I suggest you calculate the counts of each character once looping through the string, and then select the one(s) with the maximum count.
def max_char_count(string):
ret = []
counter = dict() # create an empty dict to store the character counts
# Itrate over the string once to count the characters
# N iterations
for char in string:
# `.get()` returns the given default value (0) if the `char` key doesn't exist
# If it does, it returns the value for that key
# Then you increment it
counter[char] = counter.get(char, 0) + 1
# Find the max value
# (another loop over the dict, worst-case N iterations)
max_count = max(counter.values())
# Iterate over the dict one last time to get the keys that have a value == max_count
# Again, worst-case N iterations
for char, count in counter.items():
if count == max_count:
ret.append((char, count))
return ret
Now, print(max_char_count('apple hellooo')) returns [('o', 3), ('l', 3)]
If you don't want to reinvent the counting wheel, use collections.Counter instead. counter = collections.Counter(string)
Since we have three loops and none of them are nested loops, we get a time complexity of O(3*n) (or just O(n))

You can use a dictionary of characters to count the occurrences, then find the largest count and return all keys that have that count. The fromkeys() constructor will allow you to initialize the letter counts to zero thus making the counting loop much simpler. Selecting the letters with the maximum count can be done using a list comprehension.
This will compute the result in very few lines of code:
string = 'apple hellooo'
counts = dict.fromkeys(string,0) # initialize counts to zero
for c in string: counts[c] += 1 # compute characters counts
max_count = max(counts.values()) # find maximum count
result = [c for c,n in counts.items() if n==max_count] # matching characters
print(result)
['o', 'l']

Most frequent character in Python 3.3

This program lets the user enter a string and displays the character that appears most frequently in a string.
I need help explaining frequent = i.
# This program displays the character that appears most frequently in the string
def main():
# Local variables.
count = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
index = 0
frequent = 0
# Get input.
user_string = input('Enter a string: ')
for ch in user_string:
ch = ch.upper()
# Determine which letter this character is.
index = letters.find(ch)
if index >= 0:
# Increase counting array for this letter.
count[index] = count[index] + 1
# Please help me explain this entire part!
for i in range(len(count)):
if count[i] > count[frequent]:
frequent = i
print('The character that appears most frequently' \
' in the string is ', letters[frequent], '.', \
sep='')
# Call main
main()

The code snippet in question:
for i in range(len(count)):
if count[i] > count[frequent]:
frequent = i
First the for loop iterates over the length of count which is 26.
The if statement:
if count[i] > count[frequent]:
Checks to see if the current letter in the for loop is larger than the current most frequent character. If it is then it sets the new most frequent character as the index of the for loop.
For example,
If A is referenced 12 times and B is referenced 14 then on the second loop when i = 1 the if statement would look like this:
if 12 > 14:
frequent = 1
This sets frequent to 1 which can be used to find the frequency in count for ex.
count[1] == 14

There are 26 different items in the list count, and 26 letters in the charset. It iterates through the count list for each item (that's the for i in range (len(count)) part) and then sees if the value of that item is greater than the value of the current largest item it's found - simply speaking it finds the largest value in the array, but instead of getting the value it gets the index, frequent = i is setting the index of the largest value currently found as it iterates to the variable frequent. It's simpler and more pythonistic to simply do
frequent = index(max(count)
which has EXACTLY the same effect

10 ,most frequent words in a string Python

I need to display the 10 most frequent words in a text file, from the most frequent to the least as well as the number of times it has been used. I can't use the dictionary or counter function. So far I have this:
import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
words = line.split()
for word in words:
if word not in uniques:
uniques.append(word)
for word in words:
while i<len(uniques):
i+=1
if word in uniques:
cnt += 1
print cnt
Now I think I should look for every word in the array 'uniques' and see how many times it is repeated in this file and then add that to another array that counts the instance of each word. But this is where I am stuck. I don't know how to proceed.
Any help would be appreciated. Thank you

The above problem can be easily done by using python collections
below is the Solution.
from collections import Counter
data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \
# split() returns list of all the words in the string
split_it = data_set.split()
# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)

You're on the right track. Note that this algorithm is quite slow because for each unique word, it iterates over all of the words. A much faster approach without hashing would involve building a trie.
# The following assumes that we already have alice30.txt on disk.
# Start by splitting the file into lowercase words.
words = open('alice30.txt').read().lower().split()
# Get the set of unique words.
uniques = []
for word in words:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in words: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))

from string import punctuation #you will need it to strip the punctuation
import urllib
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
counter = {}
for line in txtFile:
words = line.split()
for word in words:
k = word.strip(punctuation).lower() #the The or you You counted only once
# you still have words like I've, you're, Alice's
# you could change re to are, ve to have, etc...
if "'" in k:
ks = k.split("'")
else:
ks = [k,]
#now the tally
for k in ks:
counter[k] = counter.get(k, 0) + 1
#and sorting the counter by the value which holds the tally
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]:
print word, "\t", counter[word]

import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.
word_counter = {}
for word in txtFile.split(" "): # split in every space.
if len(word) > 0 and word != '\r\n':
if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
word_counter[word] = 1
else:
word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1
for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
# sorts the dict by the values, from top to botton, takes the 10 top items,
print "%s: %s - %s"%(i+1,word,word_counter[word])
output:
1: the - 1432
2: and - 734
3: to - 703
4: a - 579
5: of - 501
6: she - 466
7: it - 440
8: said - 434
9: I - 371
10: in - 338
This methods ensures that only alphanumeric and spaces are in the counter. Doesn't matter that much tho.

Personally I'd make my own implementation of collections.Counter. I assume you know how that object works, but if not I'll summarize:
text = "some words that are mostly different but are not all different not at all"
words = text.split()
resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}
We can certainly sort that based on frequency by using the key keyword argument of sorted, and return the first 10 items in that list. However that doesn't much help you because you don't have Counter implemented. I'll leave THAT part as an exercise for you, and show you how you might implement Counter as a function rather than an object.
def counter(iterable):
d = {}
for element in iterable:
if element in d:
d[element] += 1
else:
d[element] = 1
return d
Not difficult, actually. Go through each element of an iterable. If that element is NOT in d, add it to d with a value of 1. If it IS in d, increment that value. It's more easily expressed by:
def counter(iterable):
d = {}
for element in iterable:
d.setdefault(element, 0) += 1
Note that in your use case, you probably want to strip out the punctuation and possibly casefold the whole thing (so that someword gets counted the same as Someword rather than as two separate words). I'll leave that to you as well, but I will point out str.strip takes an argument as to what to strip out, and string.punctuation contains all the punctuation you're likely to need.

You can also do it through pandas dataframes and get result in convinient form as a table: "word-its freq." ordered.
def count_words(words_list):
words_df = pn.DataFrame(words_list)
words_df.columns = ["word"]
words_df_unique = pn.DataFrame(pn.unique(words_list))
words_df_unique.columns = ["unique"]
words_df_unique["count"] = 0
i = 0
for word in pn.Series.tolist(words_df_unique.unique):
words_df_unique.iloc[i, 1] = len(words_df.word[words_df.word == word])
i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)

To do the same operation on a pandas data frame, you may use the following through Counter function from Collections:
from collections import Counter
cnt = Counter()
for text in df['text']:
for word in text.split():
cnt[word] += 1
# Find most common 10 words from the Pandas dataframe
cnt.most_common(10)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.