How do I get the output of this code into one dictionary with total number of key:value pairs?
import re
from collections import Counter
splitfirst = open('output.txt', 'r')
input = splitfirst.read()
output = re.split('\n', input)
for line in output:
counter = Counter(line)
ab = counter.items() #gives list of tuples will be converted to dict
abdict = dict(ab)
print abdict
Here is a sample of what I get:
{' ': 393, '-': 5, ',': 1, '.': 1}
{' ': 382, '-': 4, ',': 5, '/': 1, '.': 5, '|': 1, '_': 1, '~': 1}
{' ': 394, '-': 1, ',': 2, '.': 3}
{'!': 1, ' ': 386, 'c': 1, '-': 1, ',': 3, '.': 3, 'v': 1, '=': 1, '\\': 1, '_': 1, '~': 1}
{'!': 3, ' ': 379, 'c': 1, 'e': 1, 'g': 1, ')': 1, 'j': 1, '-': 3, ',': 2, '.': 1, 't': 1, 'z': 2, ']': 1, '\\': 1, '_': 2}
I have 400 of such dictionaries, and Ideally I have to merge them together, but if I understand correctly Counter does not give them all, rather gives them all one after another.
Any help would be appreciated.
The + operator merges counters:
>>> Counter('hello') + Counter('world')
Counter({'l': 3, 'o': 2, 'e': 1, 'r': 1, 'h': 1, 'd': 1, 'w': 1})
so you can use sum to combine a collection of them:
from collections import Counter
with open('output.txt', 'r') as f:
lines = list(f)
counters = [Counter(line) for line in lines]
combined = sum(counters, Counter())
(You also don’t need to use regular expressions to split files into lines; they’re already iterables of lines.)
Here's a mcve of your problem:
import re
from collections import Counter
data = """this is the first line
this is the second one
this is the last one
"""
output = Counter()
for line in re.split('\n', data):
output += Counter(line)
print output
Applying the method in your example you'd get this:
import re
from collections import Counter
with open('output.txt', 'r') as f:
data = f.read()
output = Counter()
for line in re.split('\n', data):
output += Counter(line)
print output
Related
I hope you are doing well. I'm working on a dictionary program:
strEntry = str(input("Enter a string: ").upper())
strEntry = strEntry.replace(",", "")
strEntry = strEntry.replace(" ", "")
print('')
def freq_finder(strFinder):
dict = {}
for i in strFinder:
keys = dict.keys()
if i in keys:
dict[i] += 1
else:
dict[i] = 1
return dict
newDic = freq_finder(strEntry)
print(newDic)
newLetter = str(input("Choose a letter: ").upper())
if newLetter in newDic.values():
print("Number of occurrence of", message.count(newLetter))
newDic.pop(newLetter)
print("Dictinary after that letter removed:", newDic)
else:
print("Letter not in dictionary")
sortedDic = sorted(newDic)
print(sortedDic)
Everything works fine before this part:
newLetter = str(input("Choose a letter: ").upper())
if newLetter in newDic.values():
print("Number of occurrence of", message.count(newLetter))
newDic.pop(newLetter)
print("Dictinary after that letter removed:", newDic)
else:
print("Letter not in dictionary")
I'm trying to figure out how to check whether the letter is in the dictionary. If it is not, display the message “Letter not in dictionary”. Otherwise, display the frequency count of that letter, remove the letter from the dictionary and display the dictionary after that letter has been removed.
It should look something like this:
Enter a string: Magee, Mississippi
Dictionary: {'M': 2, 'A': 1, 'G': 1, 'E': 2, 'I': 4, 'S': 4, 'P': 2}
Choose a letter: s
Frequency count of that letter: 4
Dictionary after that letter removed: {'M': 2, 'A': 1, 'G': 1, 'E': 2, 'I': 4, 'P': 2}
Letters sorted: ['A', 'E', 'G', 'I', 'M', 'P']
I would highly appreciate it if you could tell me what's wrong and how to fix it.
Check for the keys, not the value (Because the Value is a number, not a letter) -
if newLetter in newDic: # Or if newLetter in list(newDic.keys())
print("Number of occurrence of", message.count(newLetter))
For this - Dictionary: {'M': 2, 'A': 1, 'G': 1, 'E': 2, 'I': 4, 'S': 4, 'P': 2}, you could use Collections.Counter instead
I have a list that contains dictionaries with Letters and Frequencies. Basically, I have 53 dictionaries each for every alphabet (lowercase and uppercase) and space.
adict = {'Letter':'a', 'Frequency':0}
bdict = {'Letter':'b', 'Frequency':0}
cdict = {'Letter':'c', 'Frequency':0}
If you input a word, it will scan the word and update the frequency for its corresponding letter.
for ex in range(0, len(temp)):
if temp[count] == 'a': adict['Frequency']+=1
elif temp[count] == 'b': bdict['Frequency']+=1
elif temp[count] == 'c': cdict['Frequency']+=1
For example, I enter the word "Hello", The letters H,e,l,l,o is detected and its frequencies updated. Non zero frequencies will be transferred to a new list.
if adict['Frequency'] != 0 : newArr.append(adict)
if bdict['Frequency'] != 0 : newArr.append(bdict)
if cdict['Frequency'] != 0 : newArr.append(cdict)
After this, I had the newArr sorted by Frequency and transferred to a new list called finalArr. Below is a sample list contents for the word "Hello"
{'Letter': 'H', 'Frequency': 1}
{'Letter': 'e', 'Frequency': 1}
{'Letter': 'o', 'Frequency': 1}
{'Letter': 'l', 'Frequency': 2}
Now what I want is to transfer only the key values to 2 seperate lists; letterArr and numArr. How do I do this? my desired output is:
letterArr = [H,e,o,l]
numArr = [1,1,1,2]
Why don't you just use a collections.Counter? For example:
from collections import Counter
from operator import itemgetter
word = input('Enter a word: ')
c = Counter(word)
letter_arr, num_arr = zip(*sorted(c.items(), key=itemgetter(1,0)))
print(letter_arr)
print(num_arr)
Note the use of sorted() to sort by increasing frequency. itemgetter() is used to reverse the sort order so that the sort is performed first on the frequency, and then on the letter. The sorted frequencies are then separated using zip() on the unpacked list.
Demo
Enter a word: Hello
('H', 'e', 'o', 'l')
(1, 1, 1, 2)
The results are tuples, but you can convert to lists if you want with list(letter_arr) and list(num_arr).
I have a hard time understanding your data structure choice for this problem.
Why don't you just go with a dictionary like this:
frequencies = { 'H': 1, 'e': 1, 'l': 2, 'o': 1 }
Which is even easier to implement with a Counter:
from collections import Counter
frequencies = Counter("Hello")
print(frequencies)
>>> Counter({ 'H': 1, 'e': 1, 'l': 2, 'o': 1 })
Then to add another word, you'd simply have to use the updatemethod:
frequencies.update("How")
print(frequencies)
>>> Counter({'l': 2, 'H': 2, 'o': 2, 'w': 1, 'e': 1})
Finally, to get your 2 arrays, you can do:
letterArr, numArr = zip(*frequencies.items())
This will give you tuples though, if you really need lists, just do: list(letterArr)
You wanted a simple answer without further todo like zip, collections, itemgetter etc. This does the minimum to get it done, 3 lines in a loop.
finalArr= [{'Letter': 'H', 'Frequency': 1},
{'Letter': 'e', 'Frequency': 1},
{'Letter': 'o', 'Frequency': 1},
{'Letter': 'l', 'Frequency': 2}]
letterArr = []
numArr = []
for i in range(len(finalArr)):
letterArr.append(finalArr[i]['Letter'])
numArr.append(finalArr[i]['Frequency'])
print letterArr
print numArr
Output is
['H', 'e', 'o', 'l']
[1, 1, 1, 2]
This is what I did. The questions will be at the end.
1) I first opened a .txt document using open().read() to run a function as follows:
def clean_text_passage(a_text_string):
new_passage=[]
p=[line+'\n' for line in a_text_string.split('\n')]
passage = [w.lower().replace('</b>\n', '\n') for w in p]
if len(passage[0].strip())>0:
if len(passage[1].strip())>0:
new_passage.append(passage[0])
return new_passage
2) Using the returned new_passage, I converted words into lines of words using the following command:
newone = "".join(new_passage)
3) Then, ran another function as follows:
def replace(filename):
match = re.sub(r'[^\s^\w+]risk', 'risk', filename)
match2 = re.sub(r'risk[^\s^\-]+', 'risk', match)
match3 = re.sub(r'risk\w+', 'risk', match2)
return match3
Up to this point, everything words fine. Now here is the problem. When I print match3:
i agree to the following terms regarding my employment or continued employment
with dell computer corporation or a subsidiary or affiliate of dell computer
corporation (collectively, "dell").
Looks the words are in lines. But,
4) I ran the last function by convert = count_words(match3) as follows:
def count_words(newstring):
from collections import defaultdict
word_dict=defaultdict(int)
for line in newstring:
words=line.lower().split()
for word in words:
word_dict[word]+=1
When I print word_dict, it shows as follows:
defaultdict(<type 'int'>, {'"': 2, "'": 1, '&': 4, ')': 3, '(': 3, '-': 4, ',': 4, '.': 9, '1': 7, '0': 8, '3': 2, '2': 3, '5': 2, '4': 2, '7': 2, '9': 2, '8': 1, ';': 4, ':': 2, 'a': 67, 'c': 34, 'b': 18, 'e': 114, 'd': 44, 'g': 15, 'f': 23, 'i': 71, 'h': 22, 'k': 10, 'j': 2, 'm': 31, 'l': 43, 'o': 79, 'n': 69, 'p': 27, 's': 56, 'r': 72, 'u': 19, 't': 81, 'w': 4, 'v': 3, 'y': 16, 'x': 3})
Because the objective of my codes is to count a particular word, I need words like 'risk' in lines (i.e., I like to take risk) instead of 'I', 'l', 'i'
Question: how can I make match3 contain words in the same fashion that we get by using readlines() so that I can count words in a line??
When I save match3 as a .txt file, reopen it using readlines(), and then run the count function, it works fine. I do want to know how to make it work without saving and reopening it using readlines()?
Thanks. I hope I could figure this out so that I could sleep.
try this
for line in newstring means iter one char by one
def count_words(newstring):
from collections import defaultdict
word_dict=defaultdict(int)
for line in newstring.split('\n'):
words=line.lower().split()
for word in words:
word_dict[word]+=1
tl;dr, the question is how do you split a text by lines?
Then it’s rather simple:
>>> text = '''This is a
longer text going
over multiple lines
until the string
ends.'''
>>> text.split('\n')
['This is a', 'longer text going', 'over multiple lines', 'until the string', 'ends.']
Your match3 is a string, so
for line in newstring:
iterates over the characters in newstring, not the lines. You could simply write
words = newstring.lower().split()
for word in words:
word_dict[word]+=1
or if you preferred
for line in newstring.splitlines():
words=line.lower().split()
for word in words:
word_dict[word]+=1
or whatever. [I'd use a Counter myself, but defaultdict(int) is almost as good.]
NOTE:
def replace(filename):
filename is not a filename!
This is a question from pyschools.
I did get it right, but I'm guessing that there would be a simpler method. Is this the simplest way to do this?
def countLetters(word):
letterdict={}
for letter in word:
letterdict[letter] = 0
for letter in word:
letterdict[letter] += 1
return letterdict
This should look something like this:
>>> countLetters('google')
{'e': 1, 'g': 2, 'l': 1, 'o': 2}
In 2.7+:
import collections
letters = collections.Counter('google')
Earlier (2.5+, that's ancient by now):
import collections
letters = collections.defaultdict(int)
for letter in word:
letters[letter] += 1
>>> import collections
>>> print collections.Counter("google")
Counter({'o': 2, 'g': 2, 'e': 1, 'l': 1})
Objective: Convert binary to string
Example: 0111010001100101011100110111010001100011011011110110010001100101 -> testCode (without space)
I use a dictionary and my function, i search a better way and more efficient
from textwrap import wrap
DICO = {'\x00': '00', '\x04': '0100', '\x08': '01000', '\x0c': '01100',
'\x10': '010000', '\x14': '010100', '\x18': '011000', '\x1c': '011100',
' ': '0100000', '$': '0100100', '(': '0101000', ',': '0101100', '0': '0110000',
'4': '0110100', '8': '0111000', '<': '0111100', '#': '01000000',
'D': '01000100', 'H': '01001000', 'L': '01001100', 'P': '01010000',
'T': '01010100', 'X': '01011000', '\\': '01011100', '`': '01100000',
'd': '01100100', 'h': '01101000', 'l': '01101100', 'p': '01110000',
't': '01110100', 'x': '01111000', '|': '01111100', '\x03': '011',
'\x07': '0111', '\x0b': '01011', '\x0f': '01111', '\x13': '010011',
'\x17': '010111', '\x1b': '011011', '\x1f': '011111', '#': '0100011',
"'": '0100111', '+': '0101011', '/': '0101111', '3': '0110011', '7': '0110111',
';': '0111011', '?': '0111111', 'C': '01000011', 'G': '01000111',
'K': '01001011', 'O': '01001111', 'S': '01010011', 'W': '01010111',
'[': '01011011', '_': '01011111', 'c': '01100011', 'g': '01100111',
'k': '01101011', 'o': '01101111', 's': '01110011', 'w': '01110111',
'{': '01111011', '\x7f': '01111111', '\x02': '010', '\x06': '0110',
'\n': '01010', '\x0e': '01110', '\x12': '010010', '\x16': '010110',
'\x1a': '011010', '\x1e': '011110', '"': '0100010', '&': '0100110',
'*': '0101010', '.': '0101110', '2': '0110010', '6': '0110110', ':': '0111010',
'>': '0111110', 'B': '01000010', 'F': '01000110', 'J': '01001010',
'N': '01001110', 'R': '01010010', 'V': '01010110', 'Z': '01011010',
'^': '01011110', 'b': '01100010', 'f': '01100110', 'j': '01101010',
'n': '01101110', 'r': '01110010', 'v': '01110110', 'z': '01111010',
'~': '01111110', '\x01': '01', '\x05': '0101', '\t': '01001', '\r': '01101',
'\x11': '010001', '\x15': '010101', '\x19': '011001', '\x1d': '011101',
'!': '0100001', '%': '0100101', ')': '0101001', '-': '0101101',
'1': '0110001', '5': '0110101', '9': '0111001', '=': '0111101',
'A': '01000001', 'E': '01000101', 'I': '01001001', 'M': '01001101',
'Q': '01010001', 'U': '01010101', 'Y': '01011001', ']': '01011101',
'a': '01100001', 'e': '01100101', 'i': '01101001', 'm': '01101101',
'q': '01110001', 'u': '01110101', 'y': '01111001', '}': '01111101'}
def decrypt(binary):
"""Function to convert binary into string"""
binary = wrap(binary, 8)
ch = ''
for b in binary:
for i, j in DICO.items():
if j == b:
ch += i
return ch
thank by advance,
''.join([ chr(int(p, 2)) for p in wrap(binstr, 8) ])
What this does: wrap first splits your string up into chunks of 8. Then, I iterate through each one, and convert it to an integer (base 2). Each of those converted integer now get covered to a character with chr. Finally I wrap it all up with a ''.join to smash it all together.
A bit more of a breakdown of each step of the chr(int(p, 2)):
>>> int('01101010', 2)
106
>>> chr(106)
'j'
To make it fit into your pattern above:
def decrypt(binary):
"""Function to convert binary into string"""
binary = wrap(binary, 8)
ch = ''
for b in binary:
ch += chr(int(b, 2))
return ch
or
def decrypt(binary):
"""Function to convert binary into string"""
return ''.join([ chr(int(p, 2)) for p in wrap(binary, 8) ])
This is definitely faster since it is just doing the math in place, not iterating through the dictionary over and over. Plus, it is more readable.
If execution speed it the most important for you, why not invert the roles of keys and values in your dict?! (If you also need the current dict, you could created an inverted version like this {v:k for k, v in DICO.items()})
Now, you find directly the searched translation by key instead of looping through the whole dict.
Your new function would look like this:
def decrypt2(binary):
"""Function to convert binary into string"""
binary = wrap(binary, 8)
ch = ''
for b in binary:
if b in DICO_INVERTED:
ch += DICO_INVERTED[b]
return ch
Depending on the size of your binary string, you could gain some time by changing the way you construct your output-string (see Efficient String Concatenation in Python or performance tips - string concatenation). Using join seems promising. I would give it a try: ''.join(DICO_INVERTED.get(b, '') for b in binary)
did you try
def decrypt(binary):
"""Function to convert binary into string"""
return ''.join(( chr(int(p, 2)) for p in grouper(8,binary,'') ))
where grouper is taken from here http://docs.python.org/library/itertools.html#recipes
or
def decrypt2(binary):
"""Function to convert binary into string"""
return ''.join(( DICO_INVERTED[p] for p in grouper(8,binary,'') ))
that avoids to create temporary list
EDIT
as I was choisen to be the "right" answer I have to confess that I used the other answers. The point is here not to use generator list but generator expression and iterators