The task I have at hand is to parse a large text (several 100K rows) file and accumulate some statistics based which will be then visualized in plots. Each row contains results of some prior analysis.
I wrote a custom class to define the objects that are to be accumulated. The class contains 2 string fields, 3 sets and 2 integer counters. As such there is an __init__(self, name) which initializes a new object with name and empty fields, and a method called addRow() which adds information into the object. The sets accumulate data to be associated with this object and the counters keep track of a couple of conditions.
My original idea was to iterate over the rows of the file and call a method like parseRow() in main
reader = csv.reader(f)
acc = {} # or set()
for row in reader:
parseRow(row,acc)
which would look something like:
parseRow(row, acc):
if row[id] is not in acc: # row[id] is the column where the object names/ids are
a = MyObj(row[id])
else:
a = acc.get(row[id]) # or equivalent
a.addRow(...)
The issue here is that the accumulating collection acc cannot be a set since sets are apparently not indexable in Python. Edit: for clarification, by indexable I didn't mean getting the nth element but rather being able to retrieve a specific element.
One workaround would be to have a dict that has {obj_name : obj} mapping but it feels like an ugly solution. Considering the elegance of the language otherwise, I guess there is a better solution to this. It's surely not a particularly rare situation...
Any suggestions?
You could also try an ordered-set. Which is a set AND ordered.
Related
I have the following code that works perfectly. It searches a txt file for an ID number, and if it exists, returns the first and last name.
full listing: https://repl.it/Jau3/0
import csv
#==========Search by ID number. Return Just the Name Fields for the Student
with open("studentinfo.txt","r") as f:
studentfileReader=csv.reader(f)
id=input("Enter Id:")
for row in studentfileReader:
for field in row:
if field==id:
currentindex=row.index(id)
print(row[currentindex+1]+" "+row[currentindex+2])
File contents
001,Joe,Bloggs,Test1:99,Test2:100,Test3:33
002,Ash,Smith,Test1:22,Test2:63,Test3:99
For teaching and learning purposes, I would like to know if there are any other methods to achieve the same thing (elegant, simple, pythonic) or if perhaps this is indeed the best solution?
My question arises from the fact that it seems possible that there may be an inbuilt method or some function that more efficiently retrieves the current index and searches for fields.....perhaps not though.
Thanks in advance for the discussion and any explanations that I will accept as answers.
If the list keeps this format you could access the field of the row by index to condense it a bit.
for row in studentfileReader:
if row[0]==id:
print(row[1]+" "+row[2])
it also avoids a match if the ID is not in the beginning but somewhere in between e.g. "Test1:002"
I don't really know if there is such thing as a "pythonic" way of finding a record on a matching key, but here is an example that adds a couple interesting things over your own example and the other answers, like the use of generators, and comprehension. Besides, what is more pythonic than a one-liner.
any is a python built-in, it might interest you to know that it exists since it does exactly what you do.
with open("studentinfo.txt","r") as f:
sid=input("Enter Id:")
print any((line.split(",")[0] == sid for line in f.readlines()))
You should probably consider using csv.DictReader for this usage, since you have tabular data with consistent columns.
If you only want to retrieve data once then you can simply iterate through the file until the first occurrence of the desired id, as follows;
import csv
def search_by_student_id(id):
with open('studentinfo.txt','r') as f:
reader = csv.DictReader(f, ['id', 'surname', 'first_name'],
restkey='results')
for line in reader:
if line['id'] == id:
return line['surname'], line['first_name']
print(search_by_student_id('001'))
# ('Joe', 'Bloggs')
If however, you plan on looking up entries from this data multiple times it would pay to create a dictionary, which is more expensive to create, but significantly reduces lookup times. Then you could look up data like this;
def build_student_id_dict():
with open('studentinfo.txt','r') as f:
reader = csv.DictReader(f, ['id', 'surname', 'first_name'],
restkey='results')
student_id_dict = {}
for line in reader:
student_id_dict[line['id']] = line['surname'], line['first_name']
return student_id_dict
student_by_id_dict = build_student_id_dict()
print(student_by_id_dict['002'])
# ('Ash', 'Smith')
You could read it into a list or even better a dictionary in terms of look up time, and then simply use the following:
if in l or if in d (l or d being the list / dictionary respectively)
An interesting discussion however, on whether this would be the simplest method, or if your existing solution is.
Dictionaries:
1 # retrieve the value for a particular key
2 value = d[key]
A note on time complexity and efficiency in the use of dictionaries:
Python mappings must be able to, given a particular key object, determine which (if any) value object is associated with a given key. One simple approach would be to store a LIST of (key, value) pairs, and then search the list sequentially every time a value was requested. Immediately you can see that this would be very slow with a large number of items - in complexity terms, this algorithm would be O(n), where n is referring to the number of items in the mapping.
Python's dictionary is the answer here, although it isn't always te best solution - the implementation reduces the average complexity of dictionary lookups to O(1) by requiring that key objects provide a "hash" function. In your case and because structurally the data you are dealing with is not terribly complex, it may be easiest to stick to your existing solution, although a dictionary should certainly be considered if it is time efficiency you are after.
I have a large list of objects in Python that I'm storing in a text file (for lack of knowledge of how to use any other database for the present).
Currently there are 40,000 but I expect the list length eventually may exceed 1,000,000. I'm trying to remove duplicates, where duplicates are defined as different objects having the same value for a text string attribute, but keep the most recent version of that object (defined as having the highest value in another attribute).
What I want to make is a function that returns only objects 2 and 3 from the following list, reliably:
Object 1: text="hello" ID=1
Object 2: text="hello" ID=2
Object 3: text="something else" ID=3
Doing this manually (looping through the list each time for each object) is too slow already and will get slower with O(l^2), so I need a smarter way to do it. I have seen hashing the objects and using the set function recommended multiple times, but I have two questions about this that I haven't found satisfactory answers to:
How does hashing improve the efficiency to the degree it does?
How can I do this and retain only the most recent such object? The examples I have seen all use the set function and I'm not sure how that would return only the most recent one.
EDIT: I can probably find good answers to question 1 elsewhere, but I am still stuck on question 2. To take another stab at explaining it, hashing the objects above on their text and using the set function will return a set where the objects chosen from duplicates are randomly chosen from each group of duplicates (e.g., above, either a set of (Object 2, Object 3) or (Object 1, Object 3) could be returned; I need (Object 2, Object 3)).
change to using a database ...
import sqlite3
db = sqlite3.connect("my.db")
db.execute("CREATE TABLE IF NOT EXISTS my_items (text PRIMARY KEY, id INTEGER);")
my_list_of_items = [("test",1),("test",2),("asdasd",3)]
db.execute_many("INSERT OR REPLACE INTO my_items (text,id) VALUES (?,?)",my_list_of_items)
db.commit()
print(db.execute("SELECT * FROM my_items").fetchall())
this may have maginally higher overhead in terms of time ... but you will save in RAM
Could use a dict with the text as key and the newest object for each key as value.
Setting up some demo data:
>>> from collections import namedtuple
>>> Object = namedtuple('Object', 'text ID')
>>> objects = Object('foo', 1), Object('foo', 2), Object('bar', 4), Object('bar', 3)
Solution:
>>> unique = {}
>>> for obj in objects:
if obj.text not in unique or obj.ID > unique[obj.text].ID:
unique[obj.text] = obj
>>> unique.values()
[Object(text='foo', ID=2), Object(text='bar', ID=4)]
Hashing is a well-researched subject in Computer Science. One of the standard uses is for implementing what Python calls a dictionary. (Perl calls the same thing a hash, for some reason. ;-) )
The idea is that for some key, such as a string, you can compute a simple numeric function - the hash value - and use that number as a quick way to look up the associated value stored in the dictionary.
Python has the built-in function hash() that returns the standard computation of this value. It also supports the __hash__() function, for objects that wish to compute their own hash value.
In a "normal" scenario, one way to determine if you have seen a field value before would be to use the field value as part of a dictionary. For example, you might stored a dictionary that maps the field in question to the entire record, or a list of records that all share the same field value.
In your case, your data is too big (according to you), so that would be a bad idea. Instead, you might try something like this:
seen_before = {} # Empty dictionary to start with.
while ... something :
info = read_next_record() # You figure this out.
fld = info.fields[some_field] # The value you care about
hv = hash(fld) # Compute hash value for field.
if hv in seen_before:
print("This field value has been seen before")
else:
seen_before[hv] = True # Never seen ... until NOW!
I'm new to R and need to keep a dataset that contains for each observation (let's say - a user) a list of classes (let's say events).
for example - for each user_ID I hold a list of events, every event class contains the fields: name, time, type.
My question is - what is the optimal way to hold such data in R? I have several millions of such observations so I need to hold it in optimal manner (in terms of space).
In addition, after I decide how to hold it, I need create it from within python, as my original data is in python dict. What is the best way to do it?
Thanks!
You can save your dict as a .csv using the csv module for Python.
mydict = {"a":1, "b":2, "c":3}
with open("test.csv", "wb") as myfile:
w = csv.writer(myfile)
w.writerows(mydict.items())
Then just load it into R with read.csv.
Of course, depending on what your Python dict looks like, you may need some more post processing, but without a reproducible example it's hard to say what that would be.
I have used dictionaries in python before but I am still new to python. This time I am using a dictionary of a dictionary of a dictionary... i.e., a three layer dict, and wanted to check before programming it.
I want to store all the data in this three-layer dict, and was wondering what'd be an nice pythonic way to initialize, and then read a file and write to such data structure.
The dictionary I want is of the following type:
{'geneid':
{'transcript_id':
{col_name1:col_value1, col_name2:col_value2}
}
}
The data is of this type:
geneid\ttx_id\tcolname1\tcolname2\n
hello\tNR432\t4.5\t6.7
bye\tNR439\t4.5\t6.7
Any ideas on how to do this in a good way?
Thanks!
First, let's start with the csv module to handle parsing the lines:
import csv
with open('mydata.txt', 'rb') as f:
for row in csv.DictReader(f, delimiter='\t'):
print row
This will print:
{'geneid': 'hello', 'tx_id': 'NR432', 'col_name1': '4.5', 'col_name2': 6.7}
{'geneid': 'bye', 'tx_id': 'NR439', 'col_name1': '4.5', 'col_name2': 6.7}
So, now you just need to reorganize that into your preferred structure. This is almost trivial, except that you have to deal with the fact that the first time you see a given geneid you have to create a new empty dict for it, and likewise for the first time you see a given tx_id within a geneid. You can solve that with setdefault:
import csv
genes = {}
with open('mydata.txt', 'rb') as f:
for row in csv.DictReader(f, delimiter='\t'):
gene = genes.setdefault(row['geneid'], {})
transcript = gene.setdefault(row['tx_id'], {})
transcript['colname1'] = row['colname1']
transcript['colname2'] = row['colname2']
You can make this a bit more readable with defaultdict:
import csv
from collections import defaultdict
from functools import partial
genes = defaultdict(partial(defaultdict, dict))
with open('mydata.txt', 'rb') as f:
for row in csv.DictReader(f, delimiter='\t'):
genes[row['geneid']][row['tx_id']]['colname1'] = row['colname1']
genes[row['geneid']][row['tx_id']]['colname2'] = row['colname2']
The trick here is that the top-level dict is a special one that returns an empty dict whenever it first sees a new key… and that empty dict it returns is itself an empty dict. The only hard part is that defaultdict takes a function that returns the right kind of object, and a function that returns a defaultdict(dict) has to be written with a partial, lambda, or explicit functions. (There are recipes on ActiveState and modules on PyPI that will give you an even more general version of this that creates new dictionaries as needed all the way down, if you want.)
I was also trying to find alternatives and came up with this also great answer in stackoverflow:
What's the best way to initialize a dict of dicts in Python?
Basically in my case:
class AutoVivification(dict):
"""Implementation of perl's autovivification feature."""
def __getitem__(self, item):
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
I have to do this routinely in coding for my research. You'll want to use the defaultdict package because it lets you add key:value pairs at any level by simple assignment. I'll show you after answering your question. This is sourced directly from one of my programs. Focus on the last 4 lines (that aren't comments) and trace the variables back up through the rest of the block to see what it's doing:
from astropy.io import fits #this package handles the image data I work with
import numpy as np
import os
from collections import defaultdict
klist = ['hdr','F','Ferr','flag','lmda','sky','skyerr','tel','telerr','wco','lsf']
dtess = []
for file in os.listdir(os.getcwd()):
if file.startswith("apVisit"):
meff = fits.open(file, mode='readonly', ignore_missing_end=True)
hdr = meff[0].header
oid = str(hdr["OBJID"]) #object ID
mjd = int(hdr["MJD5"].strip(' ')) #5-digit observation date
for k,v in enumerate(klist):
if k==0:
dtess = dtess+[[oid,mjd,v,hdr]]
else:
dtess=dtess+[[oid,mjd,v,meff[k].data]]
#header extension works differently from the rest of the image cube
#it's not relevant to populating dictionaries
#HDUs in order of extension no.: header, flux, flux error, flag mask,
# wavelength, sky flux, error in sky flux, telluric flux, telluric flux errors,
# wavelength solution coefficients, & line-spread function
dtree = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
for s,t,u,v in dtess:
dtree[s][t][u].append(v)
#once you've added all the keys you want to your dictionary,
#set default_factory attribute to None
dtree.default_factory = None
Here's the digest version.
First, for an n-level dictionary, you have to sort and dump
everything into a list of (n+1)-tuples in the form [key_1, key_2,
... , key_n, value].
Then, to initialize the n-level dictionary,
you just type "defaultdict(lambda: " (minus the quotes) n-1 times,
stick "defaultdict(list)" (or some other data type) at the end, and
close the parentheses.
Append to the list with a for loop. *Note: when you go to access data values
at the lowest level, you will probably have to type my_dict[key_1][key_2]
[...][key_n][0] to get actual values and not just descriptions of the data
type therein.
*Edit: When your dictionary is as big as you want to make it, set the
default_factory attribute to None.
If you haven't set default_factory to None, you can add to your nested dictionary later by either typing something like my_dict[key_1][key_2][...][new_key]=new_value, or using an append() command. You can even add additional dictionaries as long as the ones you add by these forms of assignment aren't nested themselves.
*
WARNING! The newly-added last line of that code snippet, where you set the default_factory attribute to None, is super important. Your PC needs to know when you're done adding to your dictionary, or else it may continue allocating memory in the background to prevent buffer overflow, eating up your RAM until the program grinds to a halt. This is a type of memory leak. I learned this the hard way a while after I wrote this answer. This problem plagued me for several months, and I don't even think I was the one to figure it out in the end because I didn't understand anything about memory allocation.
I'm wondering what are the pros/cons of nested dicts versus hashing a tuple in Python?
The context is a small script to assign "workshops" to attendees at a conference.
Each workshop has several attributes (e.g. week, day, subject etc.). We use four of these attributes in order to assign workshops to each attendee - i.e. each delegate will also have a week/day/subject etc.
Currently, I'm using a nested dictionary to store my workshops:
workshops = defaultdict(lambda: defaultdict(lambda:defaultdict(lambda:defaultdict(dict))))
with open(input_filename) as input_file:
workshop_reader = csv.DictReader(input_file, dialect='excel')
for row in workshop_reader:
workshops[row['Week ']][row['Stream']][row['Strand']][row['Day']] = row
return workshops
I can then use the above data-structure to assign each attendee their workshop.
The issue is later, I need to iterate through every workshop, and assign an ID (this ID is stored in a different system), which necessitates unwrapping the structure layer by layer.
First question - is there some other way of creating a secondary index to the same values, using a string (workshop name) as a key? I.e. I'd still have the four-level nested dicts, but I could also search for individual entries based on just a name.
Secondly - how can I achieve a similar effect using tuples as the key? Are there any advantages you can think of using this approach? Would it be much cleaner, or easier to use? (This whole unwrapping this is a bit of a pain, and I don't think it's very approachable).
Or are there any other data structures that you could recommend that might be superior/easier to access/manipulate?
Thanks,
Victor
class Workshop(object):
def __init__(self, week, stream, strand, day, name):
self.week = week
self.stream = stream
self.day = day
self.strand = strand
self.name = name
...
for row in workshop_reader:
workshops['name'] = Workshop(...)
That is only if the name is the unique attribute of the workshops (that is, no workshops with repeating names).
Also, you'll be able to easily assign an ID to each Workshop object in the workshops dictionary.
Generally, if you need to nest dictionaries, you should use classes instead. Nested dictionaries gets complicated to keep track of when you are debugging. With classes you can tell different types of dictionaries apart, so it gets easier. :)