Counting words starting with a character - python

Write a function that accepts a string and a character as input and
returns the count of all the words in the string which start with the
given character. Assume that capitalization does not matter here. You
can assume that the input string is a sentence i.e. words are
separated by spaces and consists of alphabetic characters.
This is my code:
def count_input_character (input_str, character):
input_str = input_str.lower()
character = character.lower()
count = 0
for i in range (0, len(input_str)):
if (input_str[i] == character and input_str[i - 1] == " "):
count += 1
return (count)
#Main Program
input_str = input("Enter a string: ")
character = input("Enter character whose occurances are to be found in the given input string: ")
result = count_input_character(input_str, character)
#print(result)
The only part missing here is that how to check if the first word of the sentence is stating with the user given character. consider this output:
Your answer is NOT CORRECT Your code was tested with different inputs. > For example when your function is called as shown below:
count_input_character ('the brahman the master of the universe', 't')
####### Your function returns ############# 2 The returned variable type is: type 'int'
### Correct return value should be ######## 3 The returned variable type is: type 'int'

You function misses the first t because in this line
if (input_str[i] == character and input_str[i - 1] == " "):
when i is 0, then input_str[i - 1] is input_str[-1] which Python will resolve as the last character of the string!
To fix this, you could change your condition to
if input_str[i] == character and (i == 0 or input_str[i - 1] == " "):
Or use str.split with a list comprehension. Or a regular expression like r'(?i)\b%s', with (?i) meaning "ignore case", \b is word boundary and %s a placeholder for the character..

Instead of looking for spaces, you could split input_str on whitespace, this would produce a list of words that you could then test against character. (Pseudocode below)
function F sentence, character {
l = <sentence split by whitespace>
count = 0
for word in l {
if firstchar(word) == character {
count = count + 1
}
}
return count
}

Although it doesn't fix your specific bug, for educational purposes, please note you could rewrite your function like this using list comprehension:
def count_input_character (input_str, character):
return len([x for x in input_str.lower().split() if x.startswith(character.lower())])
or even more efficiently(thanks to tobias_k)
def count_input_character (input_str, character):
sum(w.startswith(character.lower()) for w in input_str.lower().split())

def c_upper(text, char):
text = text.title() #set leading char of words to uppercase
char = char.upper() #set given char to uppercase
k = 0 #counter
for i in text:
if i.istitle() and i == char: #checking conditions for problem, where i is a char in a given string
k = k + 1
return k

Related

How to use a list of numbers as index inputs

So I have a list of numbers (answer_index) which correlate to the index locations (indicies) of a characters (char) in a word (word). I would like to use the numbers in the list as index inputs later (indexes) on in code to replace every character except my chosen character(char) with "*" so that the final print (new_word) in this instance would be (****ee) instead of (coffee). it is important that (word) maintains it's original value while (new_word) becomes the modified version. Does anyone have a solution for turning a list into valid index inputs? I will also except easier ways to meet my goal. (Note: I am extremely new to python so I'm sure my code looks horrendous) Code below:
word = 'coffee'
print(word)
def find(string, char):
for i, c in enumerate(string):
if c == char:
yield i
string = word
char = "e"
indices = (list(find(string, char)))
answer_index = (list(indices))
print(answer_index)
for t in range(0, len(answer_index)):
answer_index[t] = int(answer_index[t])
indexes = [(answer_index)]
new_character = '*'
result = ''
for i in indexes:
new_word = word[:i] + new_character + word[i+1:]
print(new_word)
You hardly ever need to work with indices directly:
string = "coffee"
char_to_reveal = "e"
censored_string = "".join(char if char == char_to_reveal else "*" for char in string)
print(censored_string)
Output:
****ee
If you're trying to implement a game of hangman, you might be better off using a dictionary which maps characters to other characters:
string = "coffee"
map_to = "*" * len(string)
mapping = str.maketrans(string, map_to)
translated_string = string.translate(mapping)
print(f"All letters are currently hidden: {translated_string}")
char_to_reveal = "e"
del mapping[ord(char_to_reveal)]
translated_string = string.translate(mapping)
print(f"'{char_to_reveal}' has been revealed: {translated_string}")
Output:
All letters are currently hidden: ******
'e' has been revealed: ****ee
The easiest and fastest way to replace all characters except some is to use regular expression substitution. In this case, it would look something like:
import re
re.sub('[^e]', '*', 'coffee') # returns '****ee'
Here, [^...] is a pattern for negative character match. '[^e]' will match (and then replace) anything except "e".
Other options include decomposing the string into an iterable of characters (#PaulM's answer) or working with bytearray instead
In Python, it's often not idiomatic to use indexes, unless you really want to do something with them. I'd avoid them for this problem and instead just iterate over the word, read each character and and create a new word:
word = "coffee"
char_to_keep = "e"
new_word = ""
for char in word:
if char == char_to_keep:
new_word += char_to_keep
else:
new_word += "*"
print(new_word)
# prints: ****ee

How to find the largest repeating substring given character in Python?

Given some string say 'aabaaab', how would I go about finding the largest substring of a. So it should return 'aaa'. Any help would be greatly appreciated.
def sub_string(s):
best_run = 0
current_run = 0
for char in s:
if char == 'a'
current_run += 1
else:
current_letter = char
return(best_run)
I have something like the one above. Not sure where I can fix it up.
not the most efficient, but a straightforward solution:
word = "aasfgaaassaasdsddaaaaaafff"
substr_count = 0
substr_counts = []
character = "f"
for i, letter in enumerate(word):
if (letter == character):
substr_count += 1
else:
substr_counts.append(substr_count)
substr_count = 0
if (i == len(word) - 1):
substr_counts.append(substr_count)
print(max(substr_counts))
If you want a short method using standard python tools (and avoid writing loops to reconstruct the string as you iterate), you can use regex to split the string by any non-a characters than get the max() according to len:
import re
test_string = 'aabaaab'
split_string_list = re.split( '[^a]', test_string )
longest_string_subset = max( split_string_list, key=len )
print( longest_string_subset )
The re library is for regex, the '[^a]' is a regex statement for any non-a character. Basically, the 'aabaaab' is being split into a list according to any matches on the regex statement, so that it becomes [ 'aa' 'aaa' '' ]. Then, the max() statement looks for the longest string based on len (aka length).
You can read more about functions like re.split() in the docs: https://docs.python.org/2/library/re.html

Find out word at specific index

I have a string with multiple words separated by underscores like this:
string = 'this_is_my_string'
And let's for example take string[n] which will return a letter.
Now for this index I want to get the whole word between the underscores.
So for string[12] I'd want to get back the word 'string' and for string[1] I'd get back 'this'
Very simple approach using string slicing is to:
slice the list in two parts based on position
split() each part based on _.
concatenate last item from part 1 and first item from part 2
Sample code:
>>> my_string = 'this_is_my_sample_string'
# ^ index 14
>>> pos = 14
>>> my_string[:pos].split('_')[-1] + my_string[pos:].split('_')[0]
'sample'
This shuld work:
string = 'this_is_my_string'
words = string.split('_')
idx = 0
indexes = {}
for word in words:
for i in range(len(word)):
idx += 1
indexes[idx] = word
print(indexes[1]) # this
print(indexes[12]) #string
The following code works. You can change the index and string variables and adapt to new strings. You can also define a new function with the code to generalize it.
string = 'this_is_my_string'
sp = string.split('_')
index = 12
total_len = 0
for word in sp:
total_len += (len(word) + 1) #The '+1' accounts for the underscore
if index < total_len:
result = word
break
print result
A little bit of regular expression magic does the job:
import re
def wordAtIndex(text, pos):
p = re.compile(r'(_|$)')
beg = 0
for m in p.finditer(text):
#(end, sym) = (m.start(), m.group())
#print (end, sym)
end = m.start()
if pos < end: # 'pos' is within current split piece
break
beg = end+1 # advance to next split piece
if pos == beg-1: # handle case where 'pos' is index of split character
return ""
else:
return text[beg:end]
text = 'this_is_my_string'
for i in range(0, len(text)+1):
print ("Text["+str(i)+"]: ", wordAtIndex(text, i))
It splits the input string at '_' characters or at end-of-string, and then iteratively compares the given position index with the actual split position.

Need assistance with cleaning words that were counted from a text file

I have an input text file from which I have to count sum of characters, sum of lines, and sum of each word.
So far I have been able to get the count of characters, lines and words. I also converted the text to all lower case so I don't get 2 different counts for same word where one is in lower case and the other is in upper case.
Now looking at the output I realized that, the count of words is not as clean. I have been struggling to output clean data where it does not count any special characters, and also when counting words not to include a period or a comma at the end of it.
Ex. if the text file contains the line: "Hello, I am Bob. Hello to Bob *"
it should output:
2 Hello
2 Bob
1 I
1 am
1 to
Instead my code outputs
1 Hello,
1 Hello
1 Bob.
1 Bob
1 I
1 am
1 to
1 *
Below is the code I have as of now.
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
Thanks
Try replacing
words = fname.split()
With
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))
Let me explain the various parts of the code.
Starting with the first line, whenever you have a declaration of the form
function_name = lambda argument1, argument2, ..., argumentN: some_python_expression
What you're looking at is the definition of a function that doesn't have any side effects, meaning it can't change the value of variables, it can only return a value.
So get_alphabetical_characters is a function that we know due to the suggestive name, that it takes a word and returns only the alphabetical characters contained within it.
This is accomplished using the "".join(some_list) idiom which takes a list of strings and concatenates them (in other words, it producing a single string by joining them together in the given order).
And the some_list here is provided by the generator expression [char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word]
What this does is it steps through every character in the given word, and puts it into the list if it's alphebetical, or if it isn't it puts a blank string in it's place.
For example
[char if char in 'abcdefghijklmnopqrstuvwyz' else '' for char in "hello."]
Evaluates to the following list:
['h','e','l','l','o','']
Which is then evaluates by
"".join(['h','e','l','l','o',''])
Which is equivalent to
'h'+'e'+'l'+'l'+'o'+''
Notice that the blank string added at the end will not have any effect. Adding a blank string to any string returns that same string again.
And this in turn ultimately yields
"hello"
Hope that's clear!
Edit #2: If you want to include periods used to mark decimal we can write a function like this:
include_char = lambda pos, a_string: a_string[pos].isalnum() or a_string[pos] == '.' and a_string[pos-1:pos].isdigit()
words = "".join(map(include_char, fname)).split()
What we're doing here is that the include_char function checks if a character is "alphanumeric" (i.e. is a letter or a digit) or that it's a period and that the character preceding it is numeric, and using this function to strip out all the characters in the string we want, and joining them into a single string, which we then separate into a list of strings using the str.split method.
This program may help you:
#I created a list of characters that I don't want \
# them to be considered as words!
char2remove = (".",",",";","!","?","*",":")
#Received an string of the user.
string = raw_input("Enter your string: ")
#Make all the letters lower-case
string = string.lower()
#replace the special characters with white-space.
for char in char2remove:
string = string.replace(char," ")
#Extract all the words in the new string (have repeats)
words = string.split(" ")
#creating a dictionary to remove repeats
to_count = dict()
for word in words:
to_count[word]=0
#counting the word repeats.
for word in to_count:
#if there is space in a word, it is white-space!
if word.isalpha():
print word, string.count(word)
Works as below:
>>> ================================ RESTART ================================
>>>
Enter your string: Hello, I am Bob. Hello to Bob *
i 1
am 1
to 1
bob 2
hello 2
>>>
Another way is using Regex to remove all non-letter chars (to get rid off char2remove list):
import re
regex = re.compile('[^a-zA-Z]')
your_str = raw_input("Enter String: ")
your_str = your_str.lower()
regex.sub(' ', your_str)
words = your_str.split(" ")
to_count = dict()
for word in words:
to_count[word]=0
for word in to_count:
if word.isalpha():
print word, your_str.count(word)

Most common character in a string

Write a function that takes a string consisting of alphabetic
characters as input argument and returns the most common character.
Ignore white spaces i.e. Do not count any white space as a character.
Note that capitalization does not matter here i.e. that a lower case
character is equal to a upper case character. In case of a tie between
certain characters return the last character that has the most count
This is the updated code
def most_common_character (input_str):
input_str = input_str.lower()
new_string = "".join(input_str.split())
print(new_string)
length = len(new_string)
print(length)
count = 1
j = 0
higher_count = 0
return_character = ""
for i in range(0, len(new_string)):
character = new_string[i]
while (length - 1):
if (character == new_string[j + 1]):
count += 1
j += 1
length -= 1
if (higher_count < count):
higher_count = count
return (character)
#Main Program
input_str = input("Enter a string: ")
result = most_common_character(input_str)
print(result)
The above is my code. I am getting an error of string index out of bound which I can't understand why. Also the code only checks the occurrence of first character I am confused about how to proceed to the next character and take the maximum count?
The error i get when I run my code:
> Your answer is NOT CORRECT Your code was tested with different inputs.
> For example when your function is called as shown below:
>
> most_common_character ('The cosmos is infinite')
>
> ############# Your function returns ############# e The returned variable type is: type 'str'
>
> ######### Correct return value should be ######## i The returned variable type is: type 'str'
>
> ####### Output of student print statements ###### thecosmosisinfinite 19
You can use a regex patter to search for all characters. \w matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. The + after [\w] means to match one or more repetitions.
Finally, you use Counter to total them and most_common(1) to get the top value. See below for the case of a tie.
from collections import Counter
import re
s = "Write a function that takes a string consisting of alphabetic characters as input argument and returns the most common character. Ignore white spaces i.e. Do not count any white space as a character. Note that capitalization does not matter here i.e. that a lower case character is equal to a upper case character. In case of a tie between certain characters return the last character that has the most count"
>>> Counter(c.lower() for c in re.findall(r"\w", s)).most_common(1)
[('t', 46)]
In the case of a tie, it is a little more tricky.
def top_character(some_string):
joined_characters = [c for c in re.findall(r"\w+", some_string.lower())]
d = Counter(joined_characters)
top_characters = [c for c, n in d.most_common() if n == max(d.values())]
if len(top_characters) == 1:
return top_characters[0]
reversed_characters = joined_characters[::-1]
for c in reversed_characters:
if c in top_characters:
return c
>>> top_character(s)
't'
>>> top_character('the the')
'e'
In the case of your code above and your sentence "The cosmos is infinite", you can see that 'i' occurs more frequently that 'e' (the output of your function):
>>> Counter(c.lower() for c in "".join(re.findall(r"[\w]+", 'The cosmos is infinite'))).most_common(3)
[('i', 4), ('s', 3), ('e', 2)]
You can see the issue in your code block:
for i in range(0, len(new_string)):
character = new_string[i]
...
return (character)
You are iterating through a sentence and assign that letter to the variable character, which is never reassigned elsewhere. The variable character will thus always return the last character in your string.
Actually your code is almost correct. You need to move count, j, length inside of your for i in range(0, len(new_string)) because you need to start over on each iteration and also if count is greater than higher_count, you need to save that charater as return_character and return it instead of character which is always last char of your string because of character = new_string[i].
I don't see why have you used j+1 and while length-1. After correcting them, it now covers tie situations aswell.
def most_common_character (input_str):
input_str = input_str.lower()
new_string = "".join(input_str.split())
higher_count = 0
return_character = ""
for i in range(0, len(new_string)):
count = 0
length = len(new_string)
j = 0
character = new_string[i]
while length > 0:
if (character == new_string[j]):
count += 1
j += 1
length -= 1
if (higher_count <= count):
higher_count = count
return_character = character
return (return_character)
If we ignore the "tie" requirement; collections.Counter() works:
from collections import Counter
from itertools import chain
def most_common_character(input_str):
return Counter(chain(*input_str.casefold().split())).most_common(1)[0][0]
Example:
>>> most_common_character('The cosmos is infinite')
'i'
>>> most_common_character('ab' * 3)
'a'
To return the last character that has the most count, we could use collections.OrderedDict:
from collections import Counter, OrderedDict
from itertools import chain
from operator import itemgetter
class OrderedCounter(Counter, OrderedDict):
pass
def most_common_character(input_str):
counter = OrderedCounter(chain(*input_str.casefold().split()))
return max(reversed(counter.items()), key=itemgetter(1))[0]
Example:
>>> most_common_character('ab' * 3)
'b'
Note: this solution assumes that max() returns the first character that has the most count (and therefore there is a reversed() call, to get the last one) and all characters are single Unicode codepoints. In general, you might want to use \X regular expression (supported by regex module), to extract "user-perceived characters" (eXtended grapheme cluster) from the Unicode string.

Categories

Resources