I mainly program in Java and I found that for data analysis python is more convenient.
I am looking for a way to pipe operations in a way that is equivalent to java streams. For example, I would like to do something like (I'm mixing java and python syntax).
(key, value) = Files.lines(Paths.get(path))
.map(line -> new Angle(line))
.filter(angle -> foo(angle))
.map(angle -> (angle, cosine(angle)))
.max(Comparator.comparing(Pair::getValue)
Here I take a list of lines from a file, convert each line into an Angle object, filter the angles by some parameter, then create a list of pairs and finally find the maximal pair. There may be multiple additional operations in addition, but the point is that this is one pipe passing the output of one operation into the next.
I know about python list comprehensions, however they seem to be limited to a single "map" and a single "filter". If I need to pipe several maps using comprehension, the expression soon becomes complicated (I need to put one comprehension inside another comprehension)
Is there a syntax construct in python that allows adding multiple operations in one command?
It is not difficult to achieve it by yourself, for example:
class BasePipe:
def __init__(self, data):
self.data = data
def filter(self, f):
self.data = [d for d in self.data if f(d)]
return self
def map(self, f):
self.data = [*map(f, self.data)]
return self
def __iter__(self):
yield from self.data
def __str__(self):
return str(self.data)
def max(self):
return max(self.data)
def min(self):
return min(self.data)
value = (
BasePipe([1, 2, 3, 4]).
map(lambda x: x * 2).
filter(lambda x: x > 4).
max()
)
And Gives:
8
Related
I am looking to build fairly detailed annotations for methods in a Python class. These to be used in troubleshooting, documentation, tooltips for a user interphase, etc. However it's not clear how I can keep these annotations associated to the functions.
For context, this is a feature engineering class, so two example methods might be:
def create_feature_momentum(self):
return self.data['mass'] * self.data['velocity'] *
def create_feature_kinetic_energy(self):
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
For example:
It'd be good to tell easily what core features were used in each engineered feature.
It'd be good to track arbitrary metadata about each method
It'd be good to embed non-string data as metadata about each function. Eg. some example calculations on sample dataframes.
So far I've been manually creating docstrings like:
def create_feature_kinetic_energy(self)->pd.Series:
'''Calculate the non-relativistic kinetic energy.
Depends on: ['mass', 'velocity']
Supports NaN Values: False
Unit: Energy (J)
Example:
self.data= pd.DataFrame({'mass':[0,1,2], 'velocity':[0,1,2]})
self.create_feature_kinetic_energy()
>>> pd.Series([0, 0.5, 4])
'''
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
And then I'm using regex to get the data about a function by inspecting the __doc__ attribute. However, is there a better place than __doc__ where I could store information about a function? In the example above, it's fairly easy to parse the Depends on list, but in my use case it'd be good to also embed some example data as dataframes somehow (and I think writing them as markdown in the docstring would be hard).
Any ideas?
I ended up writing an class as follows:
class ScubaDiver(pd.DataFrame):
accessed = None
def __getitem__(self, key):
if self.accessed is None:
self.accessed = set()
self.accessed.add(key)
return pd.Series(dtype=float)
#property
def columns(self):
return list(self.accessed)
The way my code is writen, I can do this:
sd = ScubbaDiver()
foo(sd)
sd.columns
and sd.columns contains all the columns accessed by foo
Though this might not work in your codebase.
I also wrote this decorator:
def add_note(notes: dict):
'''Adds k:v pairs to a .notes attribute.'''
def _(f):
if not hasattr(f, 'notes'):
f.notes = {}
f.notes |= notes # Summation for dicts
return f
return _
You can use it as follows:
#add_note({'Units':'J', 'Relativity':False})
def create_feature_kinetic_energy(self):
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
and then you can do:
create_feature_kinetic_energy.notes['Units'] # J
I'm looking for a fast, clean and pythonic way of slicing custom made objects while preserving their type after the operation.
To give you some context, I have to deal with a lot of semi-unstructured data and handle it I work with lists of dictionaries. To streamline some operations I have created an "ld" object, that inherits from "list". Amongst its many capabilities it checks that the data was provided on the correct format. Let's simplify it by saying it ensures that all entries of the list are dictionaries containing some key "a", as shown bellow:
class ld( list):
def __init__(self, x):
list.__init__(self, x)
self.__init_check()
def __init_check(self):
for record in self:
if isinstance( record, dict) and "a" in record:
pass
else:
raise TypeError("not all entries are dictionaries or have the key 'a'")
return
This behaves correctly when the data is as desired and initialises ld:
tt = ld( [{"a": 1, "b":2}, {"a":4}, {"a":6, "c":67}])
type( tt)
It is also does the right thing when the data is incorrect:
ld( [{"w":1}])
ld( [1,2,3])
However the problems comes when I proceed to slice the object:
type( tt[:2])
tt[:2] is a list and no longer as all the methods and attributes that I created in the full-fledged ld object. I could reconvert the slice into an ld but that means that it would have to go through the entire initial data checking process again, slowing down computations a lot.
Here is the solution I came up with to speed things up:
class ld( list):
def __init__(self, x, safe=True):
list.__init__(self, x)
self.__init_check( safe)
def __init_check(self, is_safe):
if not is_safe:
return
for record in self:
if isinstance( record, dict) and "a" in record:
pass
else:
raise TypeError("not all entries are dictionaries or have the key 'a'")
return
def __getslice__(self, i, j):
return ld( list.__getslice__( self, i, j), safe=False)
Is there a cleaner and more pythonic way of going about it?
Thanks in advance for you help.
I don't think subclassing list to verify the shape or type of its contents in general is the right approach. The list pointedly doesn't care about its contents, and implementing a class whose constructor behavior varies based on flags passed to it is messy. If you need a constructor that verifies inputs, just do your check logic in a function that returns a list.
def make_verified_list(items):
"""
:type items: list[object]
:rtype: list[dict]
"""
new_list = []
for item in items:
if not verify_item(item):
raise InvalidItemError(item)
new_list.append(item)
return new_list
def verify_item(item):
"""
:type item: object
:rtype: bool
"""
return isinstance(item, dict) and "a" in item
Take this approach and you won't find yourself struggling with the behavior of core data structures.
I have a class that wraps around python deque from collections. When I go and create a deque x=deque(), and I want to reference the first variable....
In[78]: x[0]
Out[78]: 0
My question is how can use the [] for referencing in the following example wrapper
class deque_wrapper:
def __init__(self):
self.data_structure = deque()
def newCustomAddon(x):
return len(self.data_structure)
def __repr__(self):
return repr(self.data_structure)
Ie, continuing from above example:
In[75]: x[0]
Out[76]: TypeError: 'deque_wrapper' object does not support indexing
I want to customize my own referencing, is that possible?
You want to implement the __getitem__ method:
class DequeWrapper:
def __init__(self):
self.data_structure = deque()
def newCustomAddon(x):
return len(self.data_structure)
def __repr__(self):
return repr(self.data_structure)
def __getitem__(self, index):
# etc
Whenever you do my_obj[x], Python will actually call my_obj.__getitem__(x).
You may also want to consider implementing the __setitem__ method, if applicable. (When you write my_obj[x] = y, Python will actually run my_obj.__setitem__(x, y).
The documentation on Python data models will contain more information on which methods you need to implement in order to make custom data structures in Python.
I'm implementing a disjoint set system in Python, but I've hit a wall. I'm using a tree implementation for the system and am implementing Find(), Merge() and Create() functions for the system.
I am implementing a rank system and path compression for efficiency.
The catch is that these functions must take the set of disjoint sets as a parameter, making traversing hard.
class Node(object):
def __init__(self, value):
self.parent = self
self.value = value
self.rank = 0
def Create(values):
l = [Node(value) for value in values]
return l
The Create function takes in a list of values and returns a list of singular Nodes containing the appropriate data.
I'm thinking the Merge function would look similar to this,
def Merge(set, value1, value2):
value1Root = Find(set, value1)
value2Root = Find(set, value2)
if value1Root == value2Root:
return
if value1Root.rank < value2Root.rank:
value1Root.parent = value2Root
elif value1Root.rank > value2Root.rank:
value2Root.parent = value1Root
else:
value2Root.parent = value1Root
value1Root.rank += 1
but I'm not sure how to implement the Find() function since it is required to take the list of Nodes and a value (not just a node) as the parameters. Find(set, value) would be the prototype.
I understand how to implement path compression when a Node is taken as a parameter for Find(x), but this method is throwing me off.
Any help would be greatly appreciated. Thank you.
Edited for clarification.
The implementation of this data structure becomes simpler when you realize that the operations union and find can also be implemented as methods of a disjoint set forest class, rather than on the individual disjoint sets.
If you can read C++, then have a look at my take on the data structure; it hides the actual sets from the outside world, representing them only as numeric indices in the API. In Python, it would be something like
class DisjSets(object):
def __init__(self, n):
self._parent = range(n)
self._rank = [0] * n
def find(self, i):
if self._parent[i] == i:
return i
else:
self._parent[i] = self.find(self._parent[i])
return self._parent[i]
def union(self, i, j):
root_i = self.find(i)
root_j = self.find(j)
if root_i != root_j:
if self._rank[root_i] < self._rank[root_j]:
self._parent[root_i] = root_j
elif self._rank[root_i] > self._rank[root_j]:
self._parent[root_j] = root_i
else:
self._parent[root_i] = root_j
self._rank[root_j] += 1
(Not tested.)
If you choose not to follow this path, the client of your code will indeed have to have knowledge of Nodes and Find must take a Node argument.
Clearly merge function should be applied to pair of nodes.
So find function should take single node parameter and look like this:
def find(node):
if node.parent != node:
node.parent = find(node.parent)
return node.parent
Also wikipedia has pseudocode that is easily translatable to python.
Find is always done on an item. Find(item) is defined as returning the set to which the item belongs. Merger as such must not take nodes, merge always takes two items/sets. Merge or union (item1, item2) must first find(item1) and find(item2) which will return the sets to which each of these belong. After that the smaller set represented by an up-tree must be added to the taller. When a find is issued, always retrace the path and compress it.
A tested implementation with path compression is here.
class Matrix:
def __init__(self, data):
self.data = data
def __repr__(self):
return repr(self.data)
def __add__(self, other):
data = []
for j in range(len(self.data)):
for k in range(len(self.data[0])):
data.append([self.data[k] + other.data[k]])
data.append([self.data[j] + other.data[j]])
data = []
return Matrix(data)
x = Matrix([[1,2,3],[2,3,4]])
y = Matrix([[10,10,10],[10,10,10]])
print(x + y,x + x + y)
I was able to get Matrices to add for 1 row by n columns, but when I tried to improve it for all n by n matrices by adding in a second loop I got this error.
Traceback (most recent call last):
line 24, in <module>
print(x + y,x + x + y)
line 15, in __add__
data.append([self.data[k] + other.data[k]])
IndexError: list index out of range
How about this:
class Matrix:
def __init__(self, data):
self.data = data
def __repr__(self):
return repr(self.data)
def __add__(self, other):
data = []
for j in range(len(self.data)):
data.append([])
for k in range(len(self.data[0])):
data[j].append(self.data[j][k] + other.data[j][k])
return Matrix(data)
Your code has a few problems... the first is basic logic of the addition algorithm
data.append([self.data[k] + other.data[k]])
this statement is highly suspect... data is a bidimensional matrix but here your are accessing it with a single index. data[k] is therefore a whole row and using + you are concatenating rows (probably not what you wanted, correct?). Probably the solution of highBandWidth is what you were looking for.
The second problem is more subtle, and is about the statement
self.data = data
This may be a problem because Python uses the so-called "reference semantic". Your matrix will use the passed data parameter for the content but without copying it. It will store a reference to the same data list object you passed to the constructor.
May be this is intentional but may be it's not... this is not clear. Is it ok for you that if you build two matrices from the same data and then change the content of a single element in the first also the content of the second changes? If this is not the case then you should copy the elements of data and not just assign the data member for example using
self.data = [row[:] for row in data]
or using copy.deepcopy from the standard copy module.
A third problem is that you are using just two spaces for indenting. This is not smart... when working in python you should use 4 spaces indenting and never use hard tabs chracters. Note that I said that doing this (using two spaces) is not smart, not that you are not smart so please don't take this personally (I even did the very same stupid error myself when starting with python). If you really want to be different then do so by writing amazing bug-free software in python and not by just using a bad indenting or choosing bad names for function or variables. Focus on higher-level beauty.
One last problem is that (once you really understand why your code didn't work) you should really read about python list comprehensions, a tool that can greatly simplify your code if used judiciously. You addition code could for example become
return Matrix([[a + b for a, b in zip(my_row, other_row)]
for my_row, other_row in zip(self.data, other.data)])
To a trained eye this is easier to read than your original code (and it's also faster).