Letter Frequency - Python - python

I have the following set of instructions:
Create a variable to store the given string "You can have data without information, but you cannot have information without data."
Convert the given string to lowercase
Create a list containing every lowercase letter of the English alphabet
for every letter in the alphabet list:
Create a variable to store the frequency of each letter in the string and assign it an initial value of zero
for every letter in the given string:
if the letter in the string is the same as the letter in the alphabet list
increase the value of the frequency variable by one.
if the value of the frequency variable does not equal zero:
print the letter in the alphabet list followed by a colon and the value of the frequency variable
I am currently stuck in the Bold points.
So far, my code looks as follows:
import string
sentence = "You can have data without information, but you cannot have information without data."
sentence = sentence. Lower()
alphabet_string = string.ascii_lowercase
alphabet = list(alphabet_string)
for i in alphabet:
frequency = {i: 0}
for i in sentence:
if i in frequency. Keys():
frequency[i] = frequency[i] + 1

The thing you are looking for is an extra condition statement for key - value pairs that have non zero values:
import string
sentence = "You can have data without information, but you cannot have information without data."
sentence = sentence.lower()
alphabet_string = string.ascii_lowercase
alphabet = list(alphabet_string)
for i in alphabet:
frequency = {i: 0}
for j in sentence:
if j in frequency.keys():
frequency[i] = frequency[i] + 1
if frequency[i] != 0:
print(i, ',', frequency[i])
Outputs:
a , 10
b , 1
c , 2
d , 2
e , 2
f , 2
h , 4
i , 6
m , 2
n , 7
o , 9
r , 2
t , 10
u , 5
v , 2
w , 2
y , 2

As has already been pointed out, collections.Counter is ideal for this.
However, your specification states that you're only interested in lowercase letters and the sentence contains ',', '.' and ' '
So there are two approaches... Use the traditional Counter then ignore values returned that don't fit with your rules. The other option is to write a custom class that handles those rules internally such that the output will only reveal lowercase ASCII letters having ignored irrelevant characters.
So, here's an idea:
class MyCounter:
def __init__(self, iterable):
self._iterable = iterable
self._result = None
def items(self):
if self._result is None:
self._result = {}
_alphabet = set('abcdefghijklmnopqrstuvwxyz')
for v in self._iterable:
if v in _alphabet:
self._result[v] = self._result.get(v, 0) + 1
return self._result.items()
sentence = "You can have data without information, but you cannot have information without data."
for k, v in MyCounter(sentence.lower()).items():
print(f'{k}:{v}')
Output:
y:2
o:9
u:5
c:2
a:10
n:7
h:4
v:2
e:2
d:2
t:10
w:2
i:6
f:2
r:2
m:2
b:1

As simple as adding if frequency[i] != 0: print(f'{i}: {frequency[i]}')
Since != 0 consider as truthy condition, you can write like this:
import string
sentence = "You can have data without information, but you cannot have information without data."
sentence = sentence.lower()
alphabet = list(string.ascii_lowercase)
for i in alphabet:
frequency = {i: 0}
for j in sentence:
if j in frequency: frequency[i] += 1
if frequency[i]: print(f'{i}: {frequency[i]}')
You can also change frequency.keys() to frequency and change frequency[i] = frequency[i] + 1 to frequency[i] += 1

Related

Finding a word from a text dictionary with given random letters

When a person enters a function (e.g. find_from_dict(letters)), the function searches a word from dictionary.txt that can be made from the letters that the user has inputted—a word that contains the most letters inputted).
For example, letters is input as random typing such as "BAJPPNLE" which will then find "APPLE" from the dictionary since "APPLE" has the most letters from "BAJPPNLE".
def find_from_dict(letters):
n = 0
y = 0
x = 0
dictFile = [line.rstrip('\n') for line in open("dictionary.txt")]
listLetters = list(letters)
final = []
while True:
if n < len(dictFile) and len(list(dictFile[n])) <= len(listLetters) and x < len(list(dictFile[n])) and list(dictFile[n])[x] in listLetters:
x = x + 1
elif n < len(dictFile) and len(list(dictFile[n])) <= len(listLetters) and x < len(list(dictFile[n])) and list(dictFile[n])[x] not in listLetters:
x = 0
n = n + 1
elif n < len(dictFile) and len(list(dictFile[n])) <= len(listLetters) and x == len(list(dictFile[n])):
final.append(dictFile[n])
elif n < len(dictFile) and len(list(dictFile[n])) > len(listLetters):
n = n + 1
else:
print(final)
break
I have this code at the moment, but since my dictionary.txt file is huge and the code is inefficient, it takes forever to go through..
Does anyone have any idea how I could make this code efficient?
You can speed this up by preparing a word index formed of the sorted letters in your word list. Then look for sorted combinations of the letters in that index:
for example:
from collections import defaultdict
from itertools import combinations
with open("/usr/share/dict/words","r") as wordList:
words = defaultdict(list)
for word in wordList.read().upper().split("\n"):
words[tuple(sorted(word))].append(word) # index by sorted letters
def findWords(letters):
for size in range(len(letters),2,-1): # from large to small (minimum 3 letters)
for combo in combinations(sorted(letters),size): # combinations of that size
for word in (w for w in words[combo]): # matching fords from index
yield word # return as you go (iterator)
# If you only want one, change this to: return word
testing:
while True:
letters = input("Enter letters:")
if not letters: break
for word in findWords(letters.upper()):
stop = input(word)
if stop: break
print("")
sample output:
Enter letters:BAJPPNLE
JELAB
BEJAN
LEBAN
NABLE
PEBAN
PEBAN
ALPEN
NEPAL
PANEL
PENAL
PLANE
ALPEN
NEPAL
PANEL
PENAL
PLANE
APPLE
NAPPE.
Enter letters:EPROING
PERIGON
PIGEON
IGNORE
REGION
PROGNE
OPINER.
Enter letters:
if you need a solution without using libraries, you will need to use a recursive approach that does a breadth first traversal of the combination tree:
with open("/usr/share/dict/words","r") as wordList:
words = dict()
for word in wordList.read().upper().split("\n"):
words.setdefault(tuple(sorted(word)),list()).append(word) # index by sorted letters
def findWords(letters,size=None):
if size == None:
letters = sorted(letters)
for size in range(len(letters),2,-1):
for word in findWords(letters,size): yield word
elif len(letters) == size:
for word in words.get(tuple(letters),[]): yield word
elif len(letters)>size:
for i in range(len(letters)):
for word in findWords(letters[:i]+letters[i+1:],size):
yield word
You can kind of "cheat" your way through it by pre-processing the dictionary file.
The idea is: instead of having a list of words, you have a list of groups which is determined by the sorted letters of the words.
For example, something like:
"aeegr": [
"agree",
"eager",
],
"alps": [
"alps",
"laps",
"pals",
]
Then if you wanted to just find the exact match, you could sort the letters from the input and search in the processed file.
But you want the one that matches the most letters, so what you could do is number the letters with prime numbers (I'm only considering lowercase ascii characters), so that a is 2, b is 3, c is 5, d is 7 and so on.
Then, you can get a number by multiplying all the letters, so for example for alps you'd get 2*37*53*67.
In your dictionary file you then have the numbers obtained the same way for each word.
Like:
262774: [
"alps",
"laps",
"pals",
]
You then go through your dictionary and if the initial number divided by the dictionary number has a remainder of 0, that's a possible match.
The maximum number with a remainder of 0 is the one that you want, because that's the one with the most letters present.
Keep in mind that the numbers might get very big very quickly, depending on how many letters you use.

multiplying letter of string by digits of number

I want to multiply letter of string by digits of number. For example for a word "number" and number "123"
output would be "nuummmbeerrr". How do I create a function that does this? My code is not usefull, because it doesn't work.
I have only this
def new_word(s):
b=""
for i in range(len(s)):
if i % 2 == 0:
b = b + s[i] * int(s[i+1])
return b
for new_word('a3n5z1') output is aaannnnnz .
Using list comprehension and without itertools:
number = 123
word = "number"
new_word = "".join([character*n for (n, character) in zip(([int(c) for c in str(number)]*len(str(number)))[0:len(word)], word)])
print(new_word)
# > 'nuummmbeerrr'
What it does (with more details) is the following:
number = 123
word = "number"
# the first trick is to link each character in the word to the number that we want
# for this, we multiply the number as a string and split it so that we get a list...
# ... with length equal to the length of the word
numbers_to_characters = ([int(c) for c in str(number)]*len(str(number)))[0:len(word)]
print(numbers_to_characters)
# > [1, 2, 3, 1, 2, 3]
# then, we initialize an empty list to contain the repeated characters of the new word
repeated_characters_as_list = []
# we loop over each number in numbers_to_letters and each character in the word
for (n, character) in zip(numbers_to_characters, word):
repeated_characters_as_list.append(character*n)
print(repeated_characters_as_list)
# > ['n', 'uu', 'mmm', 'b', 'ee', 'rrr']
new_word = "".join(repeated_characters_as_list)
print(new_word)
# > 'nuummmbeerrr'
This will solve your issue, feel free to modify it to fit your needs.
from itertools import cycle
numbers = cycle("123")
word = "number"
output = []
for letter in word:
output += [letter for _ in range(int(next(numbers)))]
string_output = ''.join(output)
EDIT:
Since you're a beginner This will be easier to understand for you, even though I suggest reading up on the itertools module since its the right tool for this kind of stuff.
number = "123"
word = "number"
output = []
i = 0
for letter in word:
if(i == len(number)):
i = 0
output += [letter for _ in range(int(number[i]))]
i += 1
string_output = ''.join(output)
print(string_output)
you can use zip to match each digit to its respective char in the word (using itertools.cycle for the case the word is longer), then just multiply the char by that digit, and finally join to a single string.
try this:
from itertools import cycle
word = "number"
number = 123
number_digits = [int(d) for d in str(number)]
result = "".join(letter*num for letter,num in zip(word,cycle(number_digits)))
print(result)
Output:
nuummmbeerrr

Python, split user input twice. At pair and space

I'm a Python beginner and would like to know how to split a user input at pair and at space and add it to a list.
E.g:
user = input('A1 KW')
user.split(" " ) # split on space
Then I'd like to print the input on index 0 what should be A1 and also print the alphabet and number/alphabet of each index.
E.g:
input[0] = A1
alphabet = A
number = 1
input[1] = KW
alphabet1 = K
alphabet2 = W
Then add it to a list.
list = ['A1, KW']
I hope you guys know what I mean.
Basic String manipulation.
There are lots of tutorials out there on that, go look them up.
From your question, it looks like you would want to use the isalpha() builtin.
Here's a function that should do the string manipulation like you said.
def pair(user):
user=user.split(" ")
for x in range(len(user)):
print ("\nPair part "+str(x)+":")
for char in user[x]:
if char.isalpha():
print ("Alphabet: "+char)
else:
print ("Number: "+char)
then you can call it with:
print("example pair was 'A1 KW'")
pair("A1 KW")
pair(input("\nEnter your pair: "))
output:
example pair was 'A1 KW'
Pair part 0:
Alphabet: A
Number: 1
Pair part 1:
Alphabet: K
Alphabet: W
Enter your pair: AB 3F
Pair part 0:
Alphabet: A
Alphabet: B
Pair part 1:
Number: 3
Alphabet: F

Need assistance with cleaning words that were counted from a text file

I have an input text file from which I have to count sum of characters, sum of lines, and sum of each word.
So far I have been able to get the count of characters, lines and words. I also converted the text to all lower case so I don't get 2 different counts for same word where one is in lower case and the other is in upper case.
Now looking at the output I realized that, the count of words is not as clean. I have been struggling to output clean data where it does not count any special characters, and also when counting words not to include a period or a comma at the end of it.
Ex. if the text file contains the line: "Hello, I am Bob. Hello to Bob *"
it should output:
2 Hello
2 Bob
1 I
1 am
1 to
Instead my code outputs
1 Hello,
1 Hello
1 Bob.
1 Bob
1 I
1 am
1 to
1 *
Below is the code I have as of now.
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
Thanks
Try replacing
words = fname.split()
With
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))
Let me explain the various parts of the code.
Starting with the first line, whenever you have a declaration of the form
function_name = lambda argument1, argument2, ..., argumentN: some_python_expression
What you're looking at is the definition of a function that doesn't have any side effects, meaning it can't change the value of variables, it can only return a value.
So get_alphabetical_characters is a function that we know due to the suggestive name, that it takes a word and returns only the alphabetical characters contained within it.
This is accomplished using the "".join(some_list) idiom which takes a list of strings and concatenates them (in other words, it producing a single string by joining them together in the given order).
And the some_list here is provided by the generator expression [char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word]
What this does is it steps through every character in the given word, and puts it into the list if it's alphebetical, or if it isn't it puts a blank string in it's place.
For example
[char if char in 'abcdefghijklmnopqrstuvwyz' else '' for char in "hello."]
Evaluates to the following list:
['h','e','l','l','o','']
Which is then evaluates by
"".join(['h','e','l','l','o',''])
Which is equivalent to
'h'+'e'+'l'+'l'+'o'+''
Notice that the blank string added at the end will not have any effect. Adding a blank string to any string returns that same string again.
And this in turn ultimately yields
"hello"
Hope that's clear!
Edit #2: If you want to include periods used to mark decimal we can write a function like this:
include_char = lambda pos, a_string: a_string[pos].isalnum() or a_string[pos] == '.' and a_string[pos-1:pos].isdigit()
words = "".join(map(include_char, fname)).split()
What we're doing here is that the include_char function checks if a character is "alphanumeric" (i.e. is a letter or a digit) or that it's a period and that the character preceding it is numeric, and using this function to strip out all the characters in the string we want, and joining them into a single string, which we then separate into a list of strings using the str.split method.
This program may help you:
#I created a list of characters that I don't want \
# them to be considered as words!
char2remove = (".",",",";","!","?","*",":")
#Received an string of the user.
string = raw_input("Enter your string: ")
#Make all the letters lower-case
string = string.lower()
#replace the special characters with white-space.
for char in char2remove:
string = string.replace(char," ")
#Extract all the words in the new string (have repeats)
words = string.split(" ")
#creating a dictionary to remove repeats
to_count = dict()
for word in words:
to_count[word]=0
#counting the word repeats.
for word in to_count:
#if there is space in a word, it is white-space!
if word.isalpha():
print word, string.count(word)
Works as below:
>>> ================================ RESTART ================================
>>>
Enter your string: Hello, I am Bob. Hello to Bob *
i 1
am 1
to 1
bob 2
hello 2
>>>
Another way is using Regex to remove all non-letter chars (to get rid off char2remove list):
import re
regex = re.compile('[^a-zA-Z]')
your_str = raw_input("Enter String: ")
your_str = your_str.lower()
regex.sub(' ', your_str)
words = your_str.split(" ")
to_count = dict()
for word in words:
to_count[word]=0
for word in to_count:
if word.isalpha():
print word, your_str.count(word)

How do I program bigram as a table in python?

I'm doing this homework, and I am stuck at this point.
I can't program Bigram frequency in the English language, 'conditional probability' in python?
That is, the probability of a token given the preceding token is equal to the probability of their bigram, or the co-occurrence of the two tokens , divided by the probability of the preceding token.
I have a text with many letters, then I have calculated the probability for the letters in this text, so the letter 'a' appears 0.015% compared to the letters in the text.
The letters are from ^a-zA-Z, and what I want is:
How can I make a table with the lengths of the alphabet ((alphabet)x(alphabet)), and how do I calculate the conditional probability for every single situation?
It's like:
[[(a|a),(b|a),(c|a),...,(z|a),...(Z|a)]
[(a|b),(b|b),(c|b),...,(z|b),...(Z|b)]
... ...
[(a|Z),(b|Z),(c|Z),...,(z|Z),...(Z|Z)]]
and for this I should calculate the probability, like: What's the chances that you get the letter 'a' if you at this point have an letter 'a', and so on.
I can't get started, hope you can kickstart me, and hope that it's clear what I need to solve.
Assuming your file has no other punctuation (easy enough to strip out):
import itertools
def pairwise(s):
a,b = itertools.tee(s)
next(b)
return zip(a,b)
counts = [[0 for _ in range(52)] for _ in range(52)] # nothing has occurred yet
with open('path/to/input') as infile:
for a,b in pairwise(char for line in infile for word in line.split() for char in word): # get pairwise characters from the text
given = ord(a) - ord('a') # index (in `counts`) of the "given" character
char = ord(b) - ord('a') # index of the character that follows the "given" character
counts[given][char] += 1
# now that we have the number of occurrences, let's divide by the totals to get conditional probabilities
totals = [sum(count[i] for i in range(52)) for count in counts]
for given in range(52):
if not totals[given]:
continue
for i in range(len(counts[given])):
counts[given][i] /= totals[given]
I haven't tested this, but it should be a good start
Here's a dictionary version, which should be easier to read and debug:
counts = {}
with open('path/to/input') as infile:
for a,b in pairwise(char for line in infile for word in line.split() for char in word):
given = ord(a) - ord('a')
char = ord(b) - ord('a')
if given not in counts:
counts[given] = {}
if char not in counts[given]:
counts[given][char] = 0
counts[given][char] += 1
answer = {}
for given, chardict in answer.items():
total = sum(chardict.values())
for char, count in chardict.items():
answer[given][char] = count/total
Now, answer contains the probabilities you are after. If you want the probability of 'a', given 'b', look at answer['b']['a']

Categories

Resources