How to write function that measures frequency of each line (objects) - Python - python

Write a function create_dictionary(filename) that reads the named file and returns a dictionary mapping from object names to occurrence counts (the number of times the particular object was guessed). For example, given a file mydata.txt containing the following:
abacus
calculator
modern computer
abacus
modern computer
large white thing
modern computer
So, when I enter this:
dictionary = create_dictionary('mydata.txt')
for key in dictionary:
print(key + ': ' + str(dictionary[key]))
The function must return the following dictionary format:
{'abacus': 2, 'calculator': 1, 'modern computer': 3, 'large white thing': 1}
Among other things, I know how to count the frequency of words. But how does one count the frequency of each line as above?
Here are some constraints:
You may assume the given file exists, but it may be empty (i.e.
containing no lines).
Keys must be inserted into the dictionary in the order in which they
appear in the input file.
In some of the tests we display the keys in insertion order; in others were sort the keys alphabetically.
Leading and trailing whitespace should be stripped from object names
Empty object names (e.g. blank lines or lines with only whitespace)
should be ignored.

One easier way to achieve is use the following
Let the file name a.txt
from collections import Counter
s = open('a.txt','r').read().strip()
print(Counter(s.split('\n')))
The output will be as follows:
Counter({'abacus': 2,
'calculator': 1,
'large white thing': 1,
'modern computer': 3})

Further to what #bigbounty has suggested, here what I could come up with.
from collections import Counter
def create_dictionary(filename):
"""Blah"""
keys = Counter()
s = open(filename,'r').read().strip()
keys = (Counter(s.split('\n')))
return keys
So, if I type:
dictionary = create_dictionary('mydata.txt')
for key in dictionary:
print(key + ': ' + str(dictionary[key]))
I get:
abacus: 2
calculator: 1
modern computer: 3
large white thing: 1
But I need some help with "How to print nothing if the text file is empty?"
For example: consider an empty text file ('nothing.txt'). The expected output is blank. But I dont know how to omit the default value ' : 1' for keys. Any advise?

Related

How to combine list items into a dictionary where some list items have the same key?

This is the file that I am working with called file1.txt
20
Gunsmoke
30
The Simpsons
10
Will & Grace
14
Dallas
20
Law & Order
12
Murder, She Wrote
And here is my code so far:
file = open('file1.txt')
lines = file.readlines()
print(lines)
new_list=[]
for i in lines:
new = i.strip()
new_list.append(new)
print(new_list)
new_dict = {}
for i in range(0,len(new_list),2):
new_dict[new_list[i]]=new_list[i+1]
if i in new_dict:
i[key] = i.values()
new_dict = dict(sorted(new_dict.items()))
print(new_dict)
file_2 = open('output_keys.txt', 'w')
for x, y in new_dict.items():
print(x, y)
file_2.write(x + ': ')
file_2.write(y)
file_2.write('\n')
file_2.close()
file_3 = open('output_titles.txt', 'w')
new_list2 = []
for x, y in new_dict.items():
new_list2.append(y)
new_list2.sort()
print(new_list2)
print(new_list2)
for i in new_list2:
file_3.write(i)
file_3.write('\n')
print(i)
file_3.close()
The instructions state:
Write a program that first reads in the name of an input file and then reads the input file using the file.readlines() method. The input file contains an unsorted list of number of seasons followed by the corresponding TV show. Your program should put the contents of the input file into a dictionary where the number of seasons are the keys, and a list of TV shows are the values (since multiple shows could have the same number of seasons).
Sort the dictionary by key (least to greatest) and output the results to a file named output_keys.txt. Separate multiple TV shows associated with the same key with a semicolon (;), ordering by appearance in the input file. Next, sort the dictionary by values (alphabetical order), and output the results to a file named output_titles.txt.
So the part I am having trouble with 2 parts:
First is "Separate multiple TV shows associated with the same key with a semicolon (;)".
What I have written so far just replaces the new item in the dictionary.
for i in range(0,len(new_list),2):
new_dict[new_list[i]]=new_list[i+1]
if i in new_dict:
i[key] = i.values()
The 2nd part is that in the Zybooks program it seems to add onto output_keys.txt and output_title.txt every time it iterates. But my code does not seem to add to output_keys and output_title. For example, if after I run file1.txt I then try to run file2.txt, it replaces output_keys and output_title instead of adding to it.
Try to break down the problem into smaller sub-problems. Right now, it seems like you're trying to solve everything at once. E.g., I'd suggest you omit the file input and output and focus on the basic functionality of the program. Once that is set, you can go for the I/O.
You first need to create a dictionary with numbers of seasons as keys and a list of tv shows as values. You almost got it; here's a working snippet (I renamed some of your variables: it's always a good idea to have meaningful variable names):
lines = file.readlines()
# formerly "new_list"
clean_lines = []
for line in lines:
line = line.strip()
clean_lines.append(line)
# formerly "new_dict"
seasons = {}
for i in range(0, len(clean_lines), 2):
season_num = int(clean_lines[i])
series = clean_lines[i+1]
# there are only two options: either
# the season_num is already in the dict...
if season_num in seasons:
# append to the existing entry
seasons[season_num].append(series)
# ...or it isn't
else:
# make a new entry with a list containing
# the series
seasons[season_num] = [series]
Here's how you can print the resulting dictionary with the tv shows separated by semicolon using join. Adapt to your needs:
for season_num, series in seasons.items():
print(season_num, '; '.join(series))
Output:
20 Gunsmoke; Law & Order
30 The Simpsons
10 Will & Grace
14 Dallas
12 Murder, She Wrote
as I see you try to check if the key already exists in dictionary but it seems there is a mistake over there, you should check the value instead the index if it exists in dictionary and also you must check before putting into the dictionary and if it exits you can update current value by adding ; end the current value
for i in range(0,len(new_list),2):
if not new_list[i] in new_edict.keys():
new_edict[new_list[i]] = new_list[i+1]
else:
Update it hereā€¦ like
new_list[new_list[i]] = new_list[new_list[i]] +";"+ new_list[i+1]

loop through Pandas DF and append values to a list which is a value of a dictionary where conditional value is the key

Very hard to make a short but descriptive title for this but I have a dataframe where each row is for a character's line, with the entire corpus being the entire show. I to create a dictionary where the keys are a list of the top characters, loop through the DF and append each dialogue line to their keys value, which I want as a list
I have a column called 'Character' and a column called 'dialogue':
Character dialogue
PICARD 'You will agree Data that Starfleets
order are...'
DATA 'Difficult? Simply solve the mystery of
Farpoint Station.'
PICARD 'As simple as that.'
TROI 'Farpoint Station. Even the name sounds
mysterious.'
And so on and so on... There are many minor characters so I just want the top 10 characters by dialogue count so I have a list of them called major_chars. I want a final dictionary where each character is the key and the value is a huge list of all their lines.
I don't know how to append to an empty list set up as the value for each key. My code thus far is:
char_corpuses = {}
for label, row in df.iterrows():
for char in main_chars:
if row['Character'] == char:
char_corpuses[char] = [row['dialogue']]
But the end result is only the last line each Character says in the corpus:
{'PICARD': [' so five card stud nothing wild and the skys the limit'],
'DATA': [' would you care to deal sir'],
'TROI': [' you were always welcome'],
'WORF': [' agreed'],
'Q': [' youll find out in any case ill be watching and if youre very lucky ill drop by to say hello from time to time see you out there'],
'RIKER': [' of course have a seat'],
'WESLEY': [' i will bye mom'],
'CRUSHER': [' you know i was thinking about what the captain told us about the future about how we all changed and drifted apart why would he want to tell us whats to come'],
'LAFORGE': [' sure goes against everything weve heard about not polluting the time line doesnt it'],
'GUINAN': [' thank you doctor this looks like a great racquet but er i dont play tennis never have']}
How do I get it to not clear out each line before and only take the last line for each character
Try something like this ^^
char_corpuses = {}
for char in main_chars:
char_corpuses[char] = df[df.name == char]['dialogue'].values
This line char_corpuses[char] = [row['dialogue']] overwrites the contents of the list with current dialogue line each time the loop runs. It writes a single element rather than appending.
For a 'vanilla' dictionary try:
import pandas
d = {'Character': ['PICARD', 'DATA', 'PICARD'], 'dialogue': ['You will agree Data that Starfleets order are...', 'Difficult? Simply solve the mystery of Farpoint Station.', 'As simple as that.']}
df = pandas.DataFrame(data=d)
main_chars = ['PICARD', 'DATA']
char_corpuses = {}
for label, row in df.iterrows():
for char in main_chars:
if row['Character'] == char:
try:
# Try to append the current dialogue line to array
char_corpuses[char].append(row['dialogue'])
except KeyError:
# The key doesn't exist yet, create empty list for the key [char]
char_corpuses[char] = []
char_corpuses[char].append(row['dialogue'])
Output
{'PICARD': ['You will agree Data that Starfleets order are...', 'As simple as that.'], 'DATA': ['Difficult? Simply solve the mystery of Farpoint Station.']}
TopHowmany = 10 # This you can change as you want.
subDF = df[df.Charactar.isin(df.Charactar.value_counts()[0:TopHowmany].index)]
char_corpuses = {}
for x in subDF.index:
char = subDF.loc[x,'Charactar']
dialogue = subDF.loc[x,'Dialogue']
if subDF.loc[x,'Charactar'] in char_corpuses:
char_corpuses[char].append('dialogue')
else:
char_corpuses[char] = [dialogue]

Python Splitting A String to make several keys for a dictionary

so I am trying to write a function that will read a text file, extract the information it needs from a line of text, and then assign that information to a key in a python dictionary. However here is a problem i have.
def read_star_names(filename):
"""
Given the name of a file containing a star catalog in CSV format, produces a dictionary
where the keys are the names of the stars and the values are Henry Draper numbers as integers.
If a star has more than one name, each name will appear as a key
in the dictionary. If a star does not have a name it will not be
represented in this dictionary.
example return: {456: 'BETA', 123: 'ALPHA', 789: 'GAMMA;LITTLE STAR'}
"""
result_name = {}
starfile = open(filename, 'r')
for dataline in starfile:
items = dataline.strip().split(',')
draper = int(items[3])
name = str(items[6])
result_name[name] = draper
starfile.close()
return result_name
This is attempting to read this:
0.35,0.45,0,123,2.01,100,ALPHA
-0.15,0.25,0,456,3.2,101,BETA
0.25,-0.1,0,789,4.3,102,GAMMA;LITTLE STAR
The problem I am having is that what it returns is this:
{'ALPHA': 123, 'GAMMA;LITTLE STAR': 789, 'BETA': 456}
I want the GAMMA and the LITTLE STAR, to be seperate keys, but still refer to the same number, 789.
How should I proceed?
I tried splitting the line of text at the semicolon but then that added indexes and I had a hard time managing them.
Thanks.
You already have isolated the part that contains all the names, all you need to do is separate the names and make separate keys for each of them, as so
for i in name.split(";"):
result_name[i] = draper

retrieving name from number ID

I have a code that takes data from online where items are referred to by a number ID, compared data about those items, and builds a list of item ID numbers based on some criteria. What I'm struggling with is taking this list of numbers and turning it into a list of names. I have a text file with the numbers and corresponding names but am having trouble using it because it contains multi-word names and retains the \n at the end of each line when i try to parse the file in any way with python. the text file looks like this:
number name\n
14 apple\n
27 anjou pear\n
36 asian pear\n
7645 langsat\n
I have tried split(), as well as replacing the white space between with several difference things to no avail. I asked a question earlier which yielded a lot of progress but still didn't quite work. The two methods that were suggested were:
d = dict()
f=open('file.txt', 'r')
for line in f:
number, name = line.split(None,1)
d[number] = name
this almost worked but still left me with the \n so if I call d['14'] i get 'apple\n'. The other method was:
import re
f=open('file.txt', 'r')
fr=f.read()
r=re.findall("(\w+)\s+(.+)", fr)
this seemed to have gotten rid of the \n at the end of every name but leaves me with the problem of having a tuple with each number-name combo being a single entry so if i were to say r[1] i would get ('14', 'apple'). I really don't want to delete each new line command by hand on all ~8400 entries...
Any recommendations on how to get the corresponding name given a number from a file like this?
In your first method change the line ttn[number] = name to ttn[number] = name[:-1]. This simply strips off the last character, and should remove your \n.
names = {}
with open("id_file.txt") as inf:
header = next(inf, '') # skip header row
for line in inf:
id, name = line.split(None, 1)
names[int(id)] = name.strip()
names[27] # => 'anjou pear'
Use this to modify your first approach:
raw_dict = dict()
cleaned_dict = dict()
Assuming you've imported file to dictionary:
raw_dict = {14:"apple\n",27:"anjou pear\n",36 :"asian pear\n" ,7645:"langsat\n"}
for keys in raw_dict:
cleaned_dict[keys] = raw_dict[keys][:len(raw_dict[keys])-1]
So now, cleaned_dict is equal to:
{27: 'anjou pear', 36: 'asian pear', 7645: 'langsat', 14: 'apple'}
*Edited to add first sentence.

best way to compare sequence of letters inside file?

I have a file, that have lots of sequences of letters.
Some of these sequences might be equal, so I would like to compare them, all to all.
I'm doing something like this but this isn't exactly want I wanted:
for line in fl:
line = line.split()
for elem in line:
if '>' in elem:
pass
else:
for el in line:
if elem == el:
print elem, el
example of the file:
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
>2
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
So what I want if to known if any sequence is totally equal to 1, or to 2, and so on.
If the goal is to simply group like sequences together, then simply sorting the data will do the trick. Here is a solution that uses BioPython to parse the input FASTA file, sorts the collection of sequences, uses the standard Python itertools.groupby function to merge ids for equal sequences, and outputs a new FASTA file:
from itertools import groupby
from Bio import SeqIO
records = list(SeqIO.parse(file('spoo.fa'),'fasta'))
def seq_getter(s): return str(s.seq)
records.sort(key=seq_getter)
for seq,equal in groupby(records, seq_getter):
ids = ','.join(s.id for s in equal)
print '>%s' % ids
print seq
Output:
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>2,5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
In general for this type of work you may want to investigate Biopython which has lots of functionality for parsing and otherwise dealing with sequences.
However, your particular problem can be solved using a dict, an example of which Manoj has given you.
Comparing long sequences of letters is going to be pretty inefficient. It will be quicker to compare the hash of the sequences. Python offers two built in data types that use hash: set and dict. It's best to use dict here as we can store the line numbers of all the matches.
I've assumed the file has identifiers and labels on alternate lines, so if we split the file text on new lines we can take one line as the id and the next as the sequence to match.
We then use a dict with the sequence as a key. The corresponding value is a list of ids which have this sequence. By using defaultdict from collections we can easily handle the case of a sequence not being in the dict; if the key hasn't be used before defaultdict will automatically create a value for us, in this case an empty list.
So when we've finished working through the file the values of the dict will effectively be a list of lists, each entry containing the ids which share a sequence. We can then use a list comprehension to pull out the interesting values, i.e. entries where more than one id is used by a sequence.
from collections import defaultdict
lines = filetext.split("\n")
sequences = defaultdict(list)
while (lines):
id = lines.pop(0)
data = lines.pop(0)
sequences[data].append(id)
results = [match for match in sequences.values() if len(match) > 1]
print results
The following script will return a count of sequences. It returns a dictionary with the individual, distinct sequences as keys and the numbers (the first part of each line) where these sequences occur.
#!/usr/bin/python
import sys
from collections import defaultdict
def count_sequences(filename):
result = defaultdict(list)
with open(filename) as f:
for index, line in enumerate(f):
sequence = line.replace('\n', '')
line_number = index + 1
result[sequence].append(line_number)
return result
if __name__ == '__main__':
filename = sys.argv[1]
for sequence, occurrences in count_sequences(filename).iteritems():
print "%s: %s, found in %s" % (sequence, len(occurrences), occurrences)
Sample output:
etc#etc:~$ python ./fasta.py /path/to/my/file
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA: 1, found in ['4']
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA: 1, found in ['3']
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA: 2, found in ['2', '5']
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA: 1, found in ['7']
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA: 1, found in ['1']
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG: 1, found in ['6']
Update
Changed code to use dafaultdict and for loop. Thanks #KennyTM.
Update 2
Changed code to use append rather than +. Thanks #Dave Webb.

Categories

Resources