How to write dictionary comprehension in this complicated case? - python

An example is artificial, but I had similar problems many times.
db_file_names = ['f1', 'f2'] # list of database files
def make_report(filename):
# read the database and prepare some report object
return report_object
Now I want to construct a dictionary: db_version -> number_of_tables. The report object contains all the information I need.
The dictionary comprehension could look like:
d = {
make_report(filename).db_version: make_report(filename).num_tables
for filename in db_file_names
}
This approach sometimes works, but is very inefficient: the report is prepared twice for each database.
To avoid this inefficiency I usually use one of the following approaches:
Use temporary storage:
reports = [make_report(filename) for filename in db_file_names]
d = {r.db_version: r.num_tables for r in reports}
Or use some adaptor-generator:
def gen_data():
for filename in db_file_names:
report = make_report(filename)
yield report.db_version, report.num_tables
d = {dat[0]: dat[1] for dat in gen_data()}
But it's usually only after I write some wrong comprehension, think over and realize, that clean and simple comprehension isn't possible in this case.
The question is, is there a better way to create required dictionary in such situations?
Since yesterday (when I decided to post this question) I invented one more approach, which I like more then all others:
d = {
report.db_version: report.num_tables
for filename in db_file_names
for report in [make_report(filename), ]
}
but even this one looks not very good.

You can use:
d = {
r.db_version: r.num_tables
for r in map(make_report, db_file_names)
}
Note that in Python 3, map gives an iterator, thus there is no unnecessary storage cost.

Here's a functional way:
from operator import attrgetter
res = dict(map(attrgetter('db_version', 'num_tables'),
map(make_report, db_file_names)))
Unfortunately, functional composition is not part of the standard library, but the 3rd party toolz does offer this feature:
from toolz import compose
foo = compose(attrgetter('db_version', 'num_tables'), make_report)
res = dict(map(foo, db_file_names))
Conceptually, you can think of these functional solutions outputting an iterable of tuples, which can then be fed directly to dict.

Related

How do I build multiple list automatically

I have hundereds of dataframe, let say the name is df1,..., df250, I need to build list by a column of those dataframe. Usually I did manually, but today data is to much, and to prone to mistakes
Here's what I did
list1 = df1['customer_id'].tolist()
list2 = df2['customer_id'].tolist()
..
list250 = df250['customer_id'].tolist()
This is so manual, can we make this in easier way?
The easier way is to take a step back and make sure you put your dataframes in a collection such as list or dict. You can then perform operations easily in a scalable way.
For example:
dfs = {1: df1, 2: df2, 3: df3, ... , 250: df250}
lists = {k: v['customer_id'].tolist() for k, v in dfs.items()}
You can then access the results as lists[1], lists[2], etc.
There are other benefits. For example, you are no longer polluting the namespace, you save the effort of explicitly defining variable names, you can easily store and transport related collections of objects.
Using exec function enables you to execute python code stored in a string:
for i in range(1,251):
s = "list"+str(i)+" = df"+str(i)+"['customer_id'].tolist()"
exec(s)
I'd use next code. In this case there's no need to manually create list of DataFrames.
cust_lists = {'list{}'.format(i): globals()['df{}'.format(i)]['customer_id'].tolist()
for i in range(1, 251)}
Now you can access you lists from cust_lists dict by the name, like this:
`cust_lists['list1']`
or
`list1`

Python complete ordering of a nested dict

I have some sort of trie composed of OrderedDicts (but in the wrong order) which looks like this:
test = {
'ab':{
'1':{},
'2':{
'002':{},
'001':{}}},
'aa':{
'02':{
'ac':{},
'01':{},
'ca':{},
'ab':{}},
'01':{
'b':{},
'z':{
'0':{},
'1':{}}}}
}
How can I get a complete ordering of this dict in all subsequent levels?
If I use collections.OrderedDict(sorted(test.iteritems())) I get it sorted only for the first level.
I feel that I need to create a function that it will somehow call itself recursively until the deepest level, but after I spent many hours trying different ways to solve the problem, I am still stuck here.
Eventually it has to look like this:
test = {
'aa':{
'01':{
'b':{},
'z':{
'0':{},
'1':{}}},
'02':{
'01':{},
'ab':{},
'ac':{},
'ca':{}}},
'ab':{
'1':{},
'2':{
'001':{},
'002':{}}}
}
With recursion, remember there are two cases: the branch, and the leaf. Be sure to account for both.
def make_ordered(d):
if isinstance(d, dict):
return OrderedDict(sorted((key, make_ordered(value)) for key, value in d.iteritems()))
else:
return d
If you can afford an extra dependency I would recommend blist package. It provides many sorted containers including sorteddict. Then your dictionary would just always stay sorted.
Check the sorteddict class docs for exact usage. The package itself is production quality and BSD licence, so not a problem to use in any proprietary code.
from collections import OrderedDict as OD
def order(X):
retval = OD()
# The standard iterator of a dict is its keys.
for k in sorted(X):
# Incase it something we can't handle.
if isinstance(X[k], dict):
# I want my children dicts in order.
retval[k] = order(X[k])
else:
retval[k] = X[k]
return retval

Elegantly Generalising Sorting into Dictionaries in Python?

The list comprehension is a great structure for generalising working with lists in such a way that the creation of lists can be managed elegantly. Is there a similar tool for managing Dictionaries in Python?
I have the following functions:
# takes in 3 lists of lists and a column specification by which to group
def custom_groupby(atts, zmat, zmat2, col):
result = dict()
for i in range(0, len(atts)):
val = atts[i][col]
row = (atts[i], zmat[i], zmat2[i])
try:
result[val].append(row)
except KeyError:
result[val] = list()
result[val].append(row)
return result
# organises samples into dictionaries using the groupby
def organise_samples(attributes, z_matrix, original_z_matrix):
strucdict = custom_groupby(attributes, z_matrix, original_z_matrix, 'SecStruc')
strucfrontdict = dict()
for k, v in strucdict.iteritems():
strucfrontdict[k] = custom_groupby([x[0] for x in strucdict[k]],
[x[1] for x in strucdict[k]], [x[2] for x in strucdict[k]], 'Front')
samples = dict()
for k in strucfrontdict:
samples[k] = dict()
for k2 in strucfrontdict[k]:
samples[k][k2] = dict()
samples[k][k2] = custom_groupby([x[0] for x in strucfrontdict[k][k2]],
[x[1] for x in strucfrontdict[k][k2]], [x[2] for x in strucfrontdict[k][k2]], 'Back')
return samples
It seems like this is unwieldy. There being elegant ways to do almost everything in Python, I'm inclined to think I'm using Python wrongly.
More importantly, I'd like to be able to generalise this function better so that I can specify how many "layers" should be in the dictionary (without using several lambdas and approaching the problem in a Lisp style). I would like a function:
# organises samples into a dictionary by specified columns
# number of layers could also be assumed by number of criterion
def organise_samples(number_layers, list_of_strings_for_column_ids)
Is this possible to do in Python?
Thank you! Even if there isn't a way to do it elegantly in Python, any suggestions towards making the above code more elegant would be really appreciated.
::EDIT::
For context, the attributes object, z_matrix, and original_zmatrix are all lists of Numpy arrays.
Attributes might look like this:
Type,Num,Phi,Psi,SecStruc,Front,Back
11,181,-123.815,65.4652,2,3,19
11,203,148.581,-89.9584,1,4,1
11,181,-123.815,65.4652,2,3,19
11,203,148.581,-89.9584,1,4,1
11,137,-20.2349,-129.396,2,0,1
11,163,-34.75,-59.1221,0,1,9
The Z-matrices might both look like this:
CA-1, CA-2, CA-CB-1, CA-CB-2, N-CA-CB-SG-1, N-CA-CB-SG-2
-16.801, 28.993, -1.189, -0.515, 118.093, 74.4629
-24.918, 27.398, -0.706, 0.989, 112.854, -175.458
-1.01, 37.855, 0.462, 1.442, 108.323, -72.2786
61.369, 113.576, 0.355, -1.127, 111.217, -69.8672
Samples is a dict{num => dict {num => dict {num => tuple(attributes, z_matrix)}}}, having one row of the z-matrix.
The list comprehension is a great structure for generalising working with lists in such a way that the creation of lists can be managed elegantly. Is there a similar tool for managing Dictionaries in Python?
Have you tries using dictionary comprehensions?
see this great question about dictionary comperhansions

Python 3 - cumulative functions alternatives

I was wondering if there was a more pythonic, or alternative, way to do this. I want to compare results out of cumulative functions. Each functions modifies the output of the previous and I would like to see, after each of the functions, what the effect is. Beware that in order to get the actual results after running the main functions, one last function is needed to calculate something. In code, the thing looks like this (just kind of pseudocode):
for textfile in path:
data = doStuff1(textfile)
calculateandPrint()
for textfile in path:
data = doStuff1(textfile)
data = doStuff2(data )
calculateandPrint()
for textfile in path:
data = doStuff1(textfile)
data = doStuff2(data )
data = doStuff3(data )
calculateandPrint()
As you can see, for n functions I would need 1/2(n(n+1)) manually made loops. Is there, like I said, something more pythonic (for example a list with functions?) that would clean up the code and make it much shorter and manageable when added more and more functions?
The actual code, where documents is a custom object:
for doc in documents:
doc.list_strippedtext = prepareData(doc.text)
bow = createBOW(documents)
for doc in documents:
doc.list_strippedtext = prepareData(doc.text)
doc.list_strippedtext = preprocess(doc.list_strippedtext)
bow = createBOW(documents)
for doc in documents:
doc.list_strippedtext = prepareData(doc.text)
doc.list_strippedtext = preprocess(doc.list_strippedtext)
doc.list_strippedtext = abbreviations(doc.list_strippedtext)
bow = createBOW(documents)
while this is only a small part, more functions need to be added.
You could define a set of chains, applied with functools.reduce()
from functools import reduce
chains = (
(doStuff1,),
(doStuff1, doStuff2),
(doStuff1, doStuff2, doStuff3),
)
for textfile in path:
for chain in chains:
data = reduce(lambda data, func: func(data), chain, textfile)
calculateandPrint(data)
The reduce() call effectively does func3(func2(func1(textfile)) if chain contained 3 functions.
I assumed here that you wanted to apply calculateandPrint() per textfile in path after the chain of functions has been applied.
Each iteration of the for chain in chains loop represents one of your doStuffx loop bodies in your original example, but we only loop through for textfile in path once.
You can also swap the loops; adjusting to your example:
for chain in chains:
for doc in documents:
doc.list_strippedtext = reduce(lambda data, func: func(data), chain, doc.text)
bow = createBOW(documents)

SQL Alchemy ORM returning a single column, how to avoid common post processing

I'm using SQL Alchemy's ORM and I find when I return a single column I get the results like so:
[(result,), (result_2,)] # etc...
With a set like this I find that I have to do this often:
results = [r[0] for r in results] # So that I just have a list of result values
This isn't that "bad" because my result sets are usually small, but if they weren't this could add significant overhead. The biggest thing is I feel it clutters the source, and missing this step is a pretty common error I run into.
Is there any way to avoid this extra step?
A related aside: This behaviour of the orm seems inconvenient in this case, but another case where my result set was, [(id, value)] it ends up like this:
[(result_1_id, result_1_val), (result_2_id, result_2_val)]
I then can just do:
results = dict(results) # so I have a map of id to value
This one has the advantage of making sense as a useful step after returning the results.
Is this really a problem or am I just being a nitpick and the post processing after getting the result set makes sense for both cases? I'm sure we can think of some other common post processing operations to make the result set more usable in the application code. Is there high performance and convenient solutions across the board or is post processing unavoidable, and merely required for varying application usages?
When my application can actually take advantage of the objects that are returned by SQL Alchemy's ORM it seems extremely helpful, but in cases where I can't or don't, not so much. Is this just a common problem of ORMs in general? Am I better off not using the ORM layer in cases like this?
I suppose I should show an example of the actual orm queries I'm talking about:
session.query(OrmObj.column_name).all()
or
session.query(OrmObj.id_column_name, OrmObj.value_column_name).all()
Of course, in a real query there'd normally be some filters, etc.
One way to decrease the clutter in the source is to iterate like this:
results = [r for (r, ) in results]
Although this solution is one character longer than using the [] operator, I think it's easier on the eyes.
For even less clutter, remove the parenthesis. This makes it harder when reading the code, to notice that you're actually handling tuples, though:
results = [r for r, in results]
Python's zip combined with the * inline expansion operator is a pretty handy solution to this:
>>> results = [('result',), ('result_2',), ('result_3',)]
>>> zip(*results)
[('result', 'result_2', 'result_3')]
Then you only have to [0] index in once. For such a short list your comprehension is faster:
>>> timeit('result = zip(*[("result",), ("result_2",), ("result_3",)])', number=10000)
0.010490894317626953
>>> timeit('result = [ result[0] for result in [("result",), ("result_2",), ("result_3",)] ]', number=10000)
0.0028390884399414062
However for longer lists zip should be faster:
>>> timeit('result = zip(*[(1,)]*100)', number=10000)
0.049577951431274414
>>> timeit('result = [ result[0] for result in [(1,)]*100 ]', number=10000)
0.11178708076477051
So it's up to you to determine which is better for your situation.
I struggled with this too until I realized it's just like any other query:
for result in results:
print result.column_name
Starting from version 1.4 SQLAlchemy provides a method to retrieve results for a single column as a list of values:
# ORM
>>> session.scalars(select(User.name)).all()
['ed', 'wendy', 'mary', 'fred']
# or
>>> query = session.query(User.name)
>>> session.scalars(query).all()
['ed', 'wendy', 'mary', 'fred']
# Core
>>> with engine.connect() as connection:
... result = connection.execute(text("select name from users"))
... result.scalars().all()
...
['ed', 'wendy', 'mary', 'fred']
See the SQLAlchemy documentation.
I found the following more readable, also includes the answer for the dict (in Python 2.7):
d = {id_: name for id_, name in session.query(Customer.id, Customer.name).all()}
l = [r.id for r in session.query(Customer).all()]
For the single value, borrowing from another answer:
l = [name for (name, ) in session.query(Customer.name).all()]
Compare with the built-in zip solution, adapted to the list:
l = list(zip(*session.query(Customer.id).all())[0])
which in my timeits provides only about 4% speed improvements.
My solution looks like this ;)
def column(self):
for column, *_ in Model.query.with_entities(Model.column).all():
yield column
NOTE: py3 only.
Wow, guys, why strain? There are method steeper way, faster and more elegant)
>>> results = [('result',), ('result_2',), ('result_3',)]
>>> sum(results, tuple())
('result', 'result_2', 'result_3')
Speed:
>>> timeit('result = zip(*[("result",), ("result_2",), ("result_3",)])', number=10000)
0.004222994000883773
>>> timeit('result = sum([("result",), ("result_2",), ("result_3",)], ())', number=10000)
0.0038205889868550003
But if more elements in list - use only zip. Zip more speed.

Categories

Resources