Sorting and Organizing a Dictionary - python

I have a dictionary with many, many key/value pairs.
The keys are dates and the values are worldwide top-level domains.
I want to output the dictionary to a text file so that it counts and alpha sorts similar values but only within the same key
for example:
*key: value1:count value2:count*
date1: au:4 be:12 com:44
date2: az:4 com:14 net:5
Code:
with open('access_logshort.txt','rU') as f:
for line in f:
list1 = re.search(r'(?P<Date>[0-9]{2}/[a-zA-Z]{3}/[0-9]{4})(.+)(GET|POST)\s(http://|https://)([a-zA-Z.]+)(\.)(?P<tld>[a-zA-Z]+)(/).+?"\s200',line)
if list1 != None:
print list1.groupdict()
one_tuple = list1.group(1,7)
my_dict[one_tuple[0]]=one_tuple[1]
output:
print my_dict
{'09/Mar/2004': 'hu'}
{'09/Mar/2004': 'hu'}
{'09/Mar/2004': 'com'}
{'09/Mar/2004': 'ru'}
{'09/Mar/2004': 'ru'}
{'09/Mar/2004': 'com'}
T

This should suit your case.
from collections import defaultdict
from dateutil.parser import parse
import csv
import re
data = defaultdict(lambda: defaultdict(int))
with open('access_logshort.txt','rU') as f:
for line in f:
list1 = re.search(r'(?P<Date>[0-9]{2}/[a-zA-Z]{3}/[0-9]{4})(.+)(GET|POST)\s(http://|https://)([a-zA-Z.]+)(\.)(?P<tld>[a-zA-Z]+)(/).+?"\s200',line)
if list1 is not None:
date, domain = list1.group(1,7)
data[date.lower()][domain.lower()] += 1
with open('my_data.csv', 'wb') as ofile:
# add delimiter='\t' to the argument list of csv.writer if you want
# tsv rather than csv
writer = csv.writer(ofile)
for key, value in sorted(data.iteritems(), key=lambda x: parse(x[0])):
domains = sorted(value.iteritems())
writer.writerow([key] + ['{}:{}'.format(*d) for d in domains])
Output:
10/Mar/2004,com:2,hu:2,ru:2
09/Mar/2004,com:2,hu:2,ru:2

Related

Updating dictionary values of keys from a .txt file

So I've created a dictionary off the txt file "PM.txt" where the key is the player and the value is their penalty minutes. I know dictionary keys have to be unique so how would I update the value of the key adding it to the current value, as the key exists more than once in the txt file.
"PM.txt"
Neil,2
Paul,5
Neil,10
Santos,2
Neil,2
Santos,10
Paul,2
Alex,2
So far I have this which returns:
{'Alex': 2, 'Santos': 10, 'Paul': 2, 'Neil': 2}
def pm_dict(filename):
f = open(filename, 'r')
dict = {}
for line in f:
x = line.split(",")
player = x[0]
minutes = x[1]
c = len(minutes)-2
minutes = minutes[0:c]
dict[player] = minutes
return dict
But how would I create a function or a helper for it to return:
{'Alex': 2, 'Santos': 12, 'Paul': 7, 'Neil': 14}
A couple of standard libraries make this problem straightforward. A defaultdict creates a key if it doesn't already exist of its default type, so you can use D[key] += value even when the key doesn't exist yet. The csv module automatically parses .csv files. the default separator is comma. Also make sure not to use dict as a variable name. It overwrites the dict type.
from collections import defaultdict
import csv
def pm_dict(filename):
D = defaultdict(int)
with open(filename, 'r', newline='') as f:
r = csv.reader(f)
for key,value in r:
D[key] += int(value)
return dict(D) # converts back to a standard dict, but not required.
print(pm_dict('PM.txt'))
Output:
{'Neil': 14, 'Paul': 7, 'Alex': 2, 'Santos': 12}
The values make more sense as numbers, but if you want value strings as in your example the last line of the function can be the following to convert values back to strings. This is a dictionary comprehension.
return {k:str(v) for k,v in D.items()}
First convert your minute to int:
minutes = int(x[1])
Then you add it to your dictionary:
if player in dict:
dict[player] += minutes
else:
dict[player] = minutes
in your for loop, instead of dict[player] = minutes:
if player not in dict:
dict[player] = 0
dict[player] += minutes
Also, instead of:
minutes = x[1]
c = len(minutes)-2
minutes = minutes[0:c]
You can do:
minutes = x[1].strip()
Just use defaultdict:
def pm_dict(filename):
f = open(filename, 'r')
dict = defaultdict(int)
for line in f:
x = line.split(",")
player = x[0]
minutes = x[1]
c = len(minutes)-2
minutes = minutes[0:c]
dict[player] += int(minutes)
return dict
After refactoring:
import csv
from collections import defaultdict
...
def get_penalty_minutes_dict(filename):
result_dict = defaultdict(int)
with open(filename, 'r') as f:
for player, minutes in csv.reader(f):
result_dict[player] += int(minutes)
return result_dict

Check key, value of nested dictionary in python?

I'm generating a nested dictionary in my program. After generating, I want to iterate through that dictionary, and check for the dictionary key and value.
Program-Code
This is the dictionary I want to iterate whose value contains another dictionary.
main_dict = {101: {1234: [11111,11111],5678: [44444,44444]},
102: {9100: [55555,55555],1112: [77777,88888]}}
I'm reading a csv file and storing contents in this dictionary. Like this :
Input.csv -
lineno,item,total
101,1234,11111
101,1234,11111
101,5678,44444
101,5678,44444
102,9100,55555
102,9100,55555
102,1112,77777
102,1112,88888
This is input csv file. I'm reading this csv file and I want to know for one unique item total is how many times repeating?
For that stuff I'm doing like this :
for line in reader:
if line[0] in main_dict:
if line[1] in main_dict[line[0]]:
main_dict[line[0]][line[1]].append(line[2])
else:
main_dict[line[0]].update({line[1]:[line[2]]})
else:
main_dict[line[0]] = {line[1]:[line[2]]}
print main_dict
Output of above program :
{101: {1234: [11111,11111],5678: [44444,44444]},
102: {9100: [55555,55555],1112: [77777,88888]}}
but I'm facing following error in this line-
if line[1] in main_dict[line[0]]:
IndexError: list index out of range
Iteration of main_dict-
for key,value in main_dict.iteritems():
f1 = open(outputfile + op_directory +'/'+ key+'.csv', 'w')
writer1 = csv.DictWriter(f1, delimiter=',', fieldnames = fieldname)
writer1.writeheader()
if type(value) == type({}):
for k,v in value.iteritems():
if type(v) == type([]):
set1 = set(v)
for se in set1:
writer1.writerow({'item':k,'total':se,'total_count':v.count(se)})
I want to know best way to iterate this type of dictionary?
Sometimes I'm getting correct result just like above dictionary but many a times I face this error, what is that I'm missing?
Thanks in advance!
As the comments pointed out, you are not checking if line is of length 3:
for line in reader:
if not len(line) == 3:
continue
Concerning your algorithm, I would use nested defaultdict to avoid the if/else lines.
EDIT: I added a new defaultdict and the csv writing part after the question edit:
from collections import defaultdict
import csv
counter = defaultdict(lambda: defaultdict(list))
main_dict= defaultdict(lambda: defaultdict(lambda: defaultdict(dict)))
fieldnames=['item', 'total', 'total_count']
# we suppose reader is a cvs.reader object
with open('input.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
if not len(line) == 3:
continue
# Remove unwanted spaces
lineno, item, total = [el.strip() for el in line]
# Do not deal with non digit entries (title for example)
if not lineno.isdigit():
continue
counter[lineno][item].append(total)
csvdict = {'item': item,
'total': total,
'total_count': counter[lineno][item].count(total)}
main_dict[lineno][item][total].update(csvdict)
# The writing part
for lineno in sorted(main_dict):
itemdict = main_dict[lineno]
output = 'output_%s.csv' % lineno
with open(output, 'wb') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=',')
writer.writeheader()
for totaldict in itemdict.values():
for csvdict in totaldict.values():
writer.writerow(csvdict)
You can then use the following function to print a readable representation of the result:
def myprint(obj, ntab=0):
if isinstance(obj, (dict, defaultdict)):
for k in sorted(obj):
myprint('%s%s'%(ntab*' ', k), ntab+1)
myprint(obj[k], ntab+1)
else:
print('%s%s'%(ntab*' ', obj))
myprint(main_dict)
But if you want to count the item totals, I would use another defaultdict with the total as the key and a tuple (lineno, item) as the value:
from collections import defaultdict
import csv
total_dict = defaultdict(list)
# we suppose reader is a cvs.reader object
with open('input.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
if not len(line) == 3:
continue
# Remove unwanted spaces
lineno, item, total = [el.strip() for el in line]
# Do not deal with non digit entries (title for example)
if not lineno.isdigit():
continue
total_dict[total].append((lineno, item))
You can have the number of each total very easily:
>>> print len(total_dict['55555'])
2

Count unique values per unique keys in python dictionary

I have dictionary like this:
yahoo.com|98.136.48.100
yahoo.com|98.136.48.105
yahoo.com|98.136.48.110
yahoo.com|98.136.48.114
yahoo.com|98.136.48.66
yahoo.com|98.136.48.71
yahoo.com|98.136.48.73
yahoo.com|98.136.48.75
yahoo.net|98.136.48.100
g03.msg.vcs0|98.136.48.105
in which I have repetitive keys and values. And what I want is a final dictionary with unique keys (ips) and count of unique values (domains). I have laready below code:
for dirpath, dirs, files in os.walk(path):
for filename in fnmatch.filter(files, '*.txt'):
with open(os.path.join(dirpath, filename)) as f:
for line in f:
if line.startswith('.'):
ip = line.split('|',1)[1].strip('\n')
semi_domain = (line.rsplit('|',1)[0]).split('.',1)[1]
d[ip]= semi_domains
if ip not in d:
key = ip
val = [semi_domain]
domains_per_ip[key]= val
but this is not working properly. Can somebody help me out with this?
Use a defaultdict:
from collections import defaultdict
d = defaultdict(set)
with open('somefile.txt') as thefile:
for line in the_file:
if line.strip():
value, key = line.split('|')
d[key].add(value)
for k,v in d.iteritems(): # use d.items() in Python3
print('{} - {}'.format(k, len(v)))
you can use zip function to separate the ips and domains in tow list , then use set to get the unique entries !
>>>f=open('words.txt','r').readlines()
>>> zip(*[i.split('|') for i in f])
[('yahoo.com', 'yahoo.com', 'yahoo.com', 'yahoo.com', 'yahoo.com', 'yahoo.com', 'yahoo.com', 'yahoo.com', 'yahoo.net', 'g03.msg.vcs0'), ('98.136.48.100\n', '98.136.48.105\n', '98.136.48.110\n', '98.136.48.114\n', '98.136.48.66\n', '98.136.48.71\n', '98.136.48.73\n', '98.136.48.75\n', '98.136.48.100\n', '98.136.48.105')]
>>> [set(dom) for dom in zip(*[i.split('|') for i in f])]
[set(['yahoo.com', 'g03.msg.vcs0', 'yahoo.net']), set(['98.136.48.71\n', '98.136.48.105\n', '98.136.48.100\n', '98.136.48.105', '98.136.48.114\n', '98.136.48.110\n', '98.136.48.73\n', '98.136.48.66\n', '98.136.48.75\n'])]
and then with len you can find the number of unique objects ! all in one line with list comprehension :
>>> [len(i) for i in [set(dom) for dom in zip(*[i.split('|') for i in f])]]
[3, 9]

Python - make a dictionary from a csv file with multiple categories

I am trying to make a dictionary from a csv file in python, but I have multiple categories. I want the keys to be the ID numbers, and the values to be the name of the items. Here is the text file:
"ID#","name","quantity","price"
"1","hello kitty","4","9999"
"2","rilakkuma","3","999"
"3","keroppi","5","1000"
"4","korilakkuma","6","699"
and this is what I have so far:
txt = open("hk.txt","rU")
file_data = txt.read()
lst = [] #first make a list, and then convert it into a dictionary.
for key in file_data:
k = key.split(",")
lst.append((k[0],k[1]))
dic = dict(lst)
print(dic)
This just prints an empty list though. I want the keys to be the ID#, and then the values will be the names of the products. I will make another dictionary with the names as the keys and the ID#'s as the values, but I think it will be the same thing but the other way around.
Use the csv module to handle your data; it'll remove the quoting and handle the splitting:
results = {}
with open('hk.txt', 'r', newline='') as txt:
reader = csv.reader(txt)
next(reader, None) # skip the header line
for row in reader:
results[row[0]] = row[1]
For your sample input, this produces:
{'4': 'korilakkuma', '1': 'hello kitty', '3': 'keroppi', '2': 'rilakkuma'}
You can use csv DictReader:
import csv
result={}
with open('/tmp/test.csv', 'r', newline='') as f:
for d in csv.DictReader(f):
result[d['ID#']]=d['name']
print(result)
# {'1': 'hello kitty', '3': 'keroppi', '2': 'rilakkuma', '4': 'korilakkuma'}
You can use a dictionary directly:
dictionary = {}
file_data.readline() # skip the first line
for key in file_data:
key = key.replace('"', '').strip()
k = key.split(",")
dictionary[k[0]] = k[1]
try this or use any library to read the file.
txt = open("hk.txt","rU")
file_data = txt.read()
file_lines = file_data.split("\n")
lst = [] #first make a list, and then convert it into a dictionary.
for linenumber in range(1,len(file_lines)):
k = file_lines[linenumber].split(",")
lst.append((k[0][1:len(k[0])-1],k[1][1:len(k[1])-1]))
dic = dict(lst)
print(dic)
but you can use the dict directly as well.

Building count dictionary from statistics file

I have a statistics file like this:
dict-count.txt
apple 15
orange 12
mango 10
apple 1
banana 14
mango 4
I need to count the number of each element and create a dictionary like this: {'orange': 12, 'mango': 14, 'apple': 16, 'banana': 14}. I do the following to achieve this:
from __future__ import with_statement
with open('dict-count.txt') as f:
lines = f.readlines()
output = {}
for line in lines:
key, val = line.split('\t')
output[key] = output.get(key, 0) + int(val)
print output
I am particularly concerned about this part:
key, val = line.split('\t')
output[key] = output.get(key, 0) + int(val)
Is there a better way to do this? Or this is the only way?
Thanks.
For a small file, you can use .readlines(), but that will slurp the entire contents of the file into memory in one go. You can write this using the file object f as an iterator; when you iterate it, you get one line of input at a time.
So, the easiest way to write this is to use a defaultdict as #Amber already showed, but my version doesn't build a list of input lines; it just builds the dictionary as it goes.
I used terse variable names, like d for the dict instead of output.
from __future__ import with_statement
from collections import defaultdict
from operator import itemgetter
d = defaultdict(int)
with open('dict-count.txt') as f:
for line in f:
k, v = line.split()
d[k] += int(v)
lst = d.items()
# sort twice: once for alphabetical order, then for frequency (descending).
# Because the Python sort is "stable", we will end up with descending
# frequency, but alphabetical order for any frequency values that are equal.
lst.sort(key=itemgetter(0))
lst.sort(key=itemgetter(1), reverse=True)
for key, value in lst:
print("%10s| %d" % (key, value))
Use a defaultdict:
from __future__ import with_statement
from collections import defaultdict
output = defaultdict(int)
with open('dict-count.txt') as f:
for line in f:
key, val = line.split('\t')
output[key] += int(val)
print output

Categories

Resources