Python: dictionary to collection - python

I have a file with 2 columns:
Anzegem Anzegem
Gijzelbrechtegem Anzegem
Ingooigem Anzegem
Aalst Sint-Truiden
Aalter Aalter
The first column is a town and the second column is the district of that town.
I made a dictionary of that file like this:
def readTowns(text):
input = open(text, 'r')
file = input.readlines()
dict = {}
verzameling = set()
for line in file:
tmp = line.split()
dict[tmp[0]] = tmp[1]
return dict
If I set a variable 'writeTowns' equal to readTowns(text) and do writeTown['Anzegem'], I want to get a collection of {'Anzegem', 'Gijzelbrechtegem', 'Ingooigem'}.
Does anybody know how to do this?

I think you can just create another function that can create appropriate data structure for what you need. Because, at the end you will end up writing code which basically manipulates the dictionary returned by readTowns to generate data as per your requirement. Why not keep the code clean and create another function for that. You Just create a name to list dictionary and you are all set.
def writeTowns(text):
input = open(text, 'r')
file = input.readlines()
dict = {}
for line in file:
tmp = line.split()
dict[tmp[1]] = dict.get(tmp[1]) or []
dict.get(tmp[1]).append(tmp[0])
return dict
writeTown = writeTowns('file.txt')
print writeTown['Anzegem']
And if you are concerned about reading the same file twice, you can do something like this as well,
def readTowns(text):
input = open(text, 'r')
file = input.readlines()
dict2town = {}
town2dict = {}
for line in file:
tmp = line.split()
dict2town[tmp[0]] = tmp[1]
town2dict[tmp[1]] = town2dict.get(tmp[1]) or []
town2dict.get(tmp[1]).append(tmp[0])
return dict2town, town2dict
dict2town, town2dict = readTowns('file.txt')
print town2dict['Anzegem']

You could do something like this, although, please have a look at #ubadub's answer, there are better ways to organise your data.
[town for town, region in dic.items() if region == 'Anzegem']

It sounds like you want to make a dictionary where the keys are the districts and the values are a list of towns.
A basic way to do this is:
def readTowns(text):
with open(text, 'r') as f:
file = input.readlines()
my_dict = {}
for line in file:
tmp = line.split()
if tmp[1] in dict:
my_dict[tmp[1]].append(tmp[0])
else:
my_dict[tmp[1]] = [tmp[0]]
return dict
The if/else blocks can also be achieved using python's defaultdict subclass (docs here) but I've used the if/else statements here for readability.
Also some other points: the variables dict and file are python types so it is bad practice to overwrite these with your own local variable (notice I've changed dict to my_dict in the code above.

If you build your dictionary as {town: district}, so the town is the key and the district is the value, you can't do this easily*, because a dictionary is not meant to be used in that way. Dictionaries allow you to easily find the values associated with a given key. So if you want to find all the towns in a district, you are better of building your dictionary as:
{district: [list_of_towns]}
So for example the district Anzegem would appear as {'Anzegem': ['Anzegem', 'Gijzelbrechtegem', 'Ingooigem']}
And of course the value is your collection.
*you could probably do it by iterating through the entire dict and checking where your matches occur, but this isn't very efficient.

Related

How to parse very big files in python?

I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("\t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])
eval.close()
return evalIDs
It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file
several issues here:
testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...
why not use a collections.defaultdict instead?
why not use csv module?
overriding eval built-in (well, not really an issue seeing how dangerous it is)
my proposal:
import csv, collections
def readEvalFileAsDictInverse(evalFile):
with open(evalFile, "r") as handle:
evalIDs = collections.defaultdict(list)
cr = csv.reader(handle,delimiter='\t')
for ids in cr:
evalIDs[ids[0]].append(ids[1]]
the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if
I don't think it could be faster with default libraries, but a pandas solution probably would.
Some suggestions:
Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().
dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:
from collections import defaultdict
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = defaultdict(list)
for row in eval:
ids = row.split("\t")
evalIDs[ids[0]].append(ids[1])
eval.close()
If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.
Something along the lines of
awk -F $'\t' '{print > $1}' file1
will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front
If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.
After splitting the data, load the files as lists again:
Create testfile:
with open ("file.txt","w") as w:
w.write("""
1\ttata\ti
2\tyipp\ti
3\turks\ti
1\tTTtata\ti
2\tYYyipp\ti
3\tUUurks\ti
1\ttttttttata\ti
2\tyyyyyyyipp\ti
3\tuuuuuuurks\ti
""")
Code:
# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
"""In case your keys contain non-filename-characters, make it a valid name"""
return k # assuming k is a valid file name else modify it
evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
for line in eval_file:
if not line.strip():
continue
key,value, *rest = line.split("\t") # omit ,*rest if you only have 2 values
fn = files.setdefault(key, make_filename(key))
# this wil open and close files _a lot_ you might want to keep file handles
# instead in your dict - but that depends on the key/data/lines ratio in
# your data - if you have few keys, file handles ought to be better, if
# have many it does not matter
with open(fn,"a") as f:
f.write(value+"\n")
# create your list data from your files:
data = {}
for key,fn in files.items():
with open(fn) as r:
data[key] = [x.strip() for x in r]
print(data)
Output:
# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'],
'2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
'3': ['urks', 'UUurks', 'uuuuuuurks']}
Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.
Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.
Maybe, you can make it somewhat faster; change it:
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])
to
evalIDs.setdefault(ids[0],[]).append(ids[1])
The 1st solution searches 3 times in the "evalID" dictionary.

Smaller program will print out key from values in dictionary, but stops when incorporated into larger function?

So I have a problem.
I am wanting to do something similar to this, where I call out a value, and it prints out the keys associated with that value. And I can even get it working:
def test(pet):
dic = {'Dog': ['der Hund', 'der Katze'] , 'Cat' : ['der Katze'] , 'Bird': ['der Vogel']}
items = dic.items()
key = dic.keys()
values = dic.values()
for x, y in items:
for item in y:
if item == pet:
print x
However, when I incorporate this same code format into a larger program it stops working:
def movie(movie):
file = open('/Users/Danrex/Desktop/Text.txt' , 'rt')
read = file.read()
list = read.split('\n')
actorList=[]
for item in list:
actorList = actorList + [item.split(',')]
actorDict = dict()
for item in actorList:
if item[0] in actorDict:
actorDict[item[0]].append(item[1])
else:
actorDict[item[0]] = [item[1]]
items = actorDict.items()
for x, y in items:
for item in y:
if item == movie:
print x
I have print(ed) out actorDict, items, x, y, and item and they all seem to follow the same format as the previous code so I can't figure out why this isn't working! So confused. And, please, when you explain it to me do it as if I am a complete idiot, which I probably am.
Cleaning up the code with some more idiomatic Python will sometimes clarify things. This is how I would write it in Python 2.7:
from collections import defaultdict
def movie(movie):
actorDict = defaultdict(list)
movie_info_filename = '/Users/Danrex/Desktop/Text.txt'
with open(movie_info_filename, 'rt') as fin:
for line_item in fin:
split_items = line_item.split(',')
actorDict[split_items[0]].append(split_items[1])
for actor, actor_info in actorDict.items():
for info_item in actor_info:
if info_item == movie:
print actor
In this case, what mostly boiled out were temporary objects created for making the actorDict. defaultdict creates a dictionary-like object that allows one to specify a function to generate the default value for a key that isn't currently present. See the collections documentation for more info.
What it looks like you're trying to do is print out some actor value for each time they are listed with a particular movie in your text file.
If you're going to check more than one movie, make the actorDict once and reference your movies against that existing actorDict. This will save you trips to disk.
from collections import defaultdict
def make_actor_dict():
actorDict = defaultdict(list)
movie_info_filename = '/Users/Danrex/Desktop/Text.txt'
with open(movie_info_filename, 'rt') as fin:
for line_item in fin:
split_items = line_item.split(',')
actorDict[split_items[0]].append(split_items[1])
def movie(movie, actorDict):
for actor, actor_info in actorDict.items():
for info_item in actor_info:
if info_item == movie:
print actor
def main():
actorDict = make_actor_dict()
movie('Star Wars', actorDict)
movie('Indiana Jones', actorDict)
If you only care that the actor was in that movie, you don't have to iterate through the movie list manually, you can just check that movie is in actor_info:
def movie(movie, actorDict):
for actor in actorDict:
if movie in actorDict[actor]:
print actor
Of course, you already figure out that the problem was the movie name not being an exact match to the text you read from the file. If you want to allow less-than-exact matches, you should consider normalizing your movie string and your data strings from the file. The string methods strip() and lower() can be really helpful there.

Creating a dictionary of lists from a file

I have a list in the following format in a txt file :
Shoes, Nike, Addias, Puma,...other brand names
Pants, Dockers, Levis,...other brand names
Watches, Timex, Tiesto,...other brand names
how to put these into dictionary like this format:
dictionary={Shoes: [Nike, Addias, Puma,.....]
Pants: [Dockers, Levis.....]
Watches:[Timex, Tiesto,.....]
}
How to do this in a for loop rather than manual input.
i have tried
clothes=open('clothes.txt').readlines()
clothing=[]
stuff=[]
for line in clothes:
items=line.replace("\n","").split(',')
clothing.append(items[0])
stuff.append(items[1:])
Clothing:{}
for d in clothing:
Clothing[d]= [f for f in stuff]
Here's a more concise way to do things, though you'll probably want to split it up a bit for readability
wordlines = [line.split(', ') for line in open('clothes.txt').read().split('\n')]
d = {w[0]:w[1:] for w in wordlines}
How about:
file = open('clothes.txt')
clothing = {}
for line in file:
items = [item.strip() for item in line.split(",")]
clothing[items[0]] = items[1:]
Try this, it will remove the need for replacing line breaks and is quite simple, but effective:
clothes = {}
with open('clothes.txt', 'r', newline = '/r/n') as clothesfile:
for line in clothesfile:
key = line.split(',')[0]
value = line.split(',')[1:]
clothes[key] = value
The 'with' statement will make sure the file reader is closed after your code to implement the dictionary is executed. From there you can use the dictionary to your heart's content!
Using list comprehension you could do:
clothes=[line.strip() for line in open('clothes.txt').readlines()]
clothingDict = {}
for line in clothes:
arr = line.split(",")
clothingDict[arr[0]] = [arr[i] for i in range(1,len(arr))]

Python Replacing Words from Definitions in Text File

I've got an old informix database that was written for cobol. All the fields are in code so my SQL queries look like.
SELECT uu00012 FROM uu0001;
This is pretty hard to read.
I have a text file with the field definitions like
uu00012 client
uu00013 date
uu00014 f_name
uu00015 l_name
I would like to swap out the code for the more english name. Run a python script on it maybe and have a file with the english names saved.
What's the best way to do this?
If each piece is definitely a separate word, re.sub is definitely the way to go here:
#create a mapping of old vars to new vars.
with open('definitions') as f:
d = dict( [x.split() for x in f] )
def my_replace(match):
#if the match is in the dictionary, replace it, otherwise, return the match unchanged.
return d.get( match.group(), match.group() )
with open('inquiry') as f:
for line in f:
print re.sub( r'\w+', my_replace, line )
Conceptually,
I would probably first build a mapping of codings -> english (in memory or o.
Then, for each coding in your map, scan your file and replace with the codes mapped english equivalent.
infile = open('filename.txt','r')
namelist = []
for each in infile.readlines():
namelist.append((each.split(' ')[0],each.split(' ')[1]))
this will give you a list of key,value pairs
i dont know what you want to do with the results from there though, you need to be more explicit
dictionary = '''uu00012 client
uu00013 date
uu00014 f_name
uu00015 l_name'''
dictionary = dict(map(lambda x: (x[1], x[0]), [x.split() for x in dictionary.split('\n')]))
def process_sql(sql, d):
for k, v in d.items():
sql = sql.replace(k, v)
return sql
sql = process_sql('SELECT f_name FROM client;', dictionary)
build dictionary:
{'date': 'uu00013', 'l_name': 'uu00015', 'f_name': 'uu00014', 'client': 'uu00012'}
then run thru your SQL and replace human readable values with coded stuff. The result is:
SELECT uu00014 FROM uu00012;
import re
f = open("dictfile.txt")
d = {}
for mapping in f.readlines():
l, r = mapping.split(" ")
d[re.compile(l)] = r.strip("\n")
sql = open("orig.sql")
out = file("translated.sql", "w")
for line in sql.readlines():
for r in d.keys():
line = r.sub(d[r], line)
out.write(line)

how to use string as list's indices in Python

for line in f.readlines():
(addr, vlanid, videoid, reqs, area) = line.split()
if vlanid not in dict:
dict[vlanid] = []
video_dict = dict[vlanid]
if videoid not in video_dict:
video_dict[videoid] = []
video_dict[videoid].append((addr, vlanid, videoid, reqs, area))
Here is my code, I want to use videoid as indices to creat a list. the real data of videoid are different strings like this : FYFSYJDHSJ
I got this error message:
video_dict[videoid] = []
TypeError: list indices must be integers, not str
But now how to add identifier like 1,2,3,4 for different strings in this case?
Use a dictionary instead of a list:
if vlanid not in dict:
dict[vlanid] = {}
P.S. I recommend that you call dict something else so that it doesn't shadow the built-in dict.
Don't use dict as a variable name. Try this (d instead of dict):
d = {}
for line in f.readlines():
(addr, vlanid, videoid, reqs, area) = line.split()
video_dict = d.setdefault(vlanid, {})
video_dict.setdefault(videoid, []).append((addr, vlanid, videoid, reqs, area))
As suggested above, creating dictionaries would be the most ideal code to implement. (Although you should avoid calling them dict, as that means something important to Python.
Your code may look something like what #aix had already posted above:
for line in f.readlines():
d = dict(zip(("addr", "vlanid", "videoid", "reqs", "area"), tuple(line.split())))
You would be able to do something with the dictionary d later in your code. Just remember - iterating through this dictionary will mean that, if you don't use d until after the loop is complete, you'll only get the last values from the file.

Categories

Resources