Sorting list in python - Give priority - python

I have two lists, each is made up of objects having a date. I am trying to combine them and then order by date:
combined = invoices + payments
combined.sort(key = lambda x: x.date)
All well and good. However, if there is both an invoice object and payment object on the same day, I want the payment to be placed in the list before the invoice.

Just do this instead:
combined = payments + invoices
python iterable.sort method is guaranteed to be stable. (See python docs on standar types, 5.6.4 note 9)
That means if there are 2 elements a and b on your list such that key(a) == key(b), then they'll keep their relative order (that means, if a was placed before b on the unsorted list, it'll still be like that after it's sorted).

You should be able to do something like this to get the sorting you want:
combined.sort(key = lambda x: (x.date, 1 if x in invoices else 0))
The idea being that, as long as the objects are distinct, you can create a sorting tuple that includes an indicator of which list the object came from. That'll make it sort by the dates first, then fall over to the 2nd field if the dates match.

In addition to key=, you can also use cmp= in the sort function.
class Invoice(object):
P = 1
def __init__(self, date):
self.date = date
class Payment(object):
P = 0
def __init__(self, date):
self.date = date
l = [Invoice(10), Payment(10), Invoice(10)]
def xcmp(x, y):
c0 = cmp(x.date, y.date)
return c0 if c0 != 0 else cmp(x.__class__.P, y.__class__.P)
l.sort(cmp=xcmp)

Related

for loop with same dataframe on both side of the operator

I have defined 10 different DataFrames A06_df, A07_df , etc, which picks up six different data point inputs in a daily time series for a number of years. To be able to work with them I need to do some formatting operations such as
A07_df=A07_df.fillna(0)
A07_df[A07_df < 0] = 0
A07_df.columns = col # col is defined
A07_df['oil']=A07_df['oil']*24
A07_df['water']=A07_df['water']*24
A07_df['gas']=A07_df['gas']*24
A07_df['water_inj']=0
A07_df['gas_inj']=0
A07_df=A07_df[['oil', 'water', 'gas','gaslift', 'water_inj', 'gas_inj', 'bhp', 'whp']]
etc for a few more formatting operations
Is there a nice way to have a for loop or something so I don’t have to write each operation for each dataframe A06_df, A07_df, A08.... etc?
As an example, I have tried
list=[A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
for i in list:
i=i.fillna(0)
But this does not do the trick.
Any help is appreciated
As i.fillna() returns a new object (an updated copy of your original dataframe), i=i.fillna(0) will update the content of ibut not of the list content A06_df, A07_df,....
I suggest you copy the updated content in a new list like this:
list_raw = [A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
list_updated = []
for i in list_raw:
i=i.fillna(0)
# More code here
list_updated.append(i)
To simplify your future processes I would recommend to use a dictionary of dataframes instead of a list of named variables.
dfs = {}
dfs['A0'] = ...
dfs['A1'] = ...
dfs_updated = {}
for k,i in dfs.items():
i=i.fillna(0)
# More code here
dfs_updated[k] = i

Django - Iterate over queryset and add static values

In the following queryset I am filtering planned hours per week (displayval is my week in this queryset) by employee. I would like to add an item for planned hours = 0 when the employee has no hours planned for a week I'm filtering by.
What's the easiest way to achieve this?
def DesignHubR(request):
emp3_list = Projectsummaryplannedhours.objects.values_list('displayval', 'employeename')
.filter(businessunit='a')
.filter(billinggroup__startswith='PLS - Project')
.filter(Q(displayval=sunday2)|Q(displayval=sunday))
.annotate(plannedhours__sum=Sum('plannedhours'))
emp3 = map(lambda x: {'date': x[0], 'employee_name': x[1], 'planned_hours': x[2]}, emp3_list)
context = {'sunday': sunday, 'sunday2': sunday2, 'emp3': emp3}
return render(request,'department_hub_ple.html', context)
I think that you can use the Coalesce(*expressions, **extra) function to solve your problem.
Accepts a list of at least two field names or expressions and returns the first non-null value (note that an empty string is not considered a null value).
So your query will be looking like:
from django.db.models import Sum, Value
from django.db.models.functions import Coalesce
emp3_list = \
Projectsummaryplannedhours.objects.\
filter(
Q(businessunit='a') &
Q(billinggroup__startswith='PLS - Project') &
(Q(displayval=sunday2) | Q(displayval=sunday))
).\
annotate(plannedhours__sum=Coalesce(
Sum('plannedhours'), Value(0)
)
).\
values_list('displayval', 'employeename')
See https://docs.djangoproject.com/en/1.9/ref/models/database-functions/#coalesce for more information.
This will help you to get plannedhours__sum = 0 if no entries to sum exists. If you also want to add additional parameter to each entry where plannedhours__sum = 0 you can use Django conditional expression.Read about Case expression for more information (https://docs.djangoproject.com/en/1.11/ref/models/conditional-expressions/#case).
Case() accepts any number of When() objects as individual arguments. Other options are provided using keyword arguments. If none of the conditions evaluate to TRUE, then the expression given with the default keyword argument is returned. If a default argument isn’t provided, None is used.
from django.db.models import Sum, Value, IntegerField
from django.db.models.functions import Coalesce
emp3_list = \
Projectsummaryplannedhours.objects.\
filter(
Q(businessunit='a') &
Q(billinggroup__startswith='PLS - Project') &
(Q(displayval=sunday2) | Q(displayval=sunday))
).\
annotate(plannedhours__sum=Coalesce(
Sum('plannedhours'), Value(0)
),
x=Case(When(plannedhours__sum=0, then=Value(0)),
output_field=IntegerField())
).\
values_list('displayval', 'employeename')
This will give you additional parameter x equals 0 if planned hours = 0 and None elsewhere. You can also filter emp3_list by annotated values.
As a result you can pass your queryset to a template context = {'sunday': sunday, 'sunday2': sunday2, 'emp3': emp3_list}, iterate over it there and get the attributes you need:
for q in emp3_list:
print(q[0], q[1], q[2])
Hope it will help you.

Python built-in max(), when 2 values are the same which one gets picked?

I have a code that processes data based on some dates.
Lets say:
case1:
values1 with date1 = '2002-02-01'
values2 with date2 = '2004-02-01'
case2:
values1 with date1 ='2001-01-01'
values2 with date2 ='2001-01-01'
I need to get the most recent record. Everything works fine when my values have different dates, but when records have the same dates max(date1, date2).
Question. Which max value is returned when the values are equal, like in case 2?
For multiple values that are all the maximum, the first such value is returned:
>>> class Equal:
... def __init__(self, id):
... self.id = id
... def __repr__(self):
... return f"Equal({self.id!r})"
... def __gt__(self, other):
... return False
...
>>> max([Equal(1), Equal(2), Equal(3)])
Equal(1)
This is explicitly documented:
If multiple items are maximal, the function returns the first one encountered.
"If multiple items are maximal, the function returns the first one encountered."
Source: https://docs.python.org/3/library/functions.html#max

Iterating through a list of Pandas DF's to then iterate through each DF's row

This may be a slightly insane question...
I've got a single Pandas DF of articles which I have then split into multiple DF's so each DF only contains the articles from a particular year. I have then put these variables into a list called box_of_years.
indexed_df = article_db.set_index('date')
indexed_df = indexed_df.sort_index()
year_2004 = indexed_df.truncate(before='2004-01-01', after='2004-12-31')
year_2005 = indexed_df.truncate(before='2005-01-01', after='2005-12-31')
year_2006 = indexed_df.truncate(before='2006-01-01', after='2006-12-31')
year_2007 = indexed_df.truncate(before='2007-01-01', after='2007-12-31')
year_2008 = indexed_df.truncate(before='2008-01-01', after='2008-12-31')
year_2009 = indexed_df.truncate(before='2009-01-01', after='2009-12-31')
year_2010 = indexed_df.truncate(before='2010-01-01', after='2010-12-31')
year_2011 = indexed_df.truncate(before='2011-01-01', after='2011-12-31')
year_2012 = indexed_df.truncate(before='2012-01-01', after='2012-12-31')
year_2013 = indexed_df.truncate(before='2013-01-01', after='2013-12-31')
year_2014 = indexed_df.truncate(before='2014-01-01', after='2014-12-31')
year_2015 = indexed_df.truncate(before='2015-01-01', after='2015-12-31')
year_2016 = indexed_df.truncate(before='2016-01-01', after='2016-12-31')
box_of_years = [year_2004, year_2005, year_2006, year_2007,
year_2008, year_2009, year_2010, year_2011,
year_2012, year_2013, year_2014, year_2015,
year_2016]
I've written various functions to tokenize, clean up and convert the tokens into a FreqDist object and wrapped those up into a single function called year_prep(). This works fine when I do
year_2006 = year_prep(year_2006)
...but is there a way I can iterate across every year variable, apply the function and have it transform the same variable, short of just repeating the above for every year?
I know repeating myself would be the simplest way, but not necessarily the cleanest. I may perhaps have this backwards and do the slicing later on but at that point I feel like the layers of lists will be out of hand as I'm going from a list of years to a list of years, containing a list of articles, containing a list of every word in the article.
I think you can use groupby by year with custom function:
import pandas as pd
start = pd.to_datetime('2004-02-24')
rng = pd.date_range(start, periods=30, freq='50D')
df = pd.DataFrame({'Date': rng, 'a':range(30)})
#print (df)
def f(x):
print (x)
#return year_prep(x)
#some custom output
return x.a + x.Date.dt.month
print (df.groupby(df['Date'].dt.year).apply(f))

How can I do windowed query on multiple columns primary key?

Based on example found here but I guess I'm not understanding it. This works for single column primary keys but fails on multiple ones.
This is my code
#classmethod
def column_windows(cls, q, columns, windowsize, where = None):
"""Return a series of WHERE clauses against
a given column that break it into windows.
Result is an iterable of tuples, consisting of
((start, end), whereclause), where (start, end) are the ids.
Requires a database that supports window functions,
i.e. Postgresql, SQL Server, Oracle.
Enhance this yourself ! Add a "where" argument
so that windows of just a subset of rows can
be computed.
"""
#Here is the thing... how to compare...
def int_for_range(start_id, end_id):
if end_id:
return and_(
columns>=start_id,
columns<end_id
)
else:
return columns>=start_id
if isinstance(columns, Column):
columns_k=(columns,)
else:
columns_k=tuple(columns)
q2=None
cols=()
for c in columns:
cols = cols + (c,)
if not q2:
q2=q.session.query(c)
else:
q2=q2.add_column(c)
q2 = q2.add_column(func.row_number().over(order_by=columns_k).label('rownum'))
q2=q2.filter(q._criterion).from_self(cols)
if windowsize > 1:
q2 = q2.filter("rownum %% %d=1" % windowsize)
for res in q2:
print res
intervals = [id for id, in q2]
while intervals:
start = intervals.pop(0)
if intervals:
end = intervals[0]
else:
end = None
yield int_for_range(start, end)
#classmethod
def windowed_query(cls, q, columns, windowsize):
""""Break a Query into windows on a given column."""
for whereclause in cls.column_windows(q,columns, windowsize):
for row in q.filter(whereclause).order_by(columns):
yield row
Now I have the problem when comparing the set of columns of the primary key. Well I guess kind of recursive clause generating function should do it... Let's try it...
Well, result is not what expected but got it to work: Now it really windows any query keeping all in place, multi column unique ordering, and so on:
Here is my code, hope it may be usefull for someone else:
#classmethod
def window_query(cls, q, windowsize, windows=None):
"""
q=Query object we want to window results
windowsize=The number of elements each window has
windows=The window, or window list, numbers: 1-based to query
"""
windowselect=False
if windows:
if not isinstance(windows,list):
windows=list(windows)
windowselect=True
#Appending u_columns to ordered counting subquery will ensure unique ordering
u_columns=list([col for col in cls.getBestUniqueColumns()])
#o_columns is the list of order by columns for the query
o_columns=list([col for col in q._order_by])
#we append columns from u_columns not in o_columns to ensure unique ordering but keeping the desired one
sq_o_columns=list(o_columns)
for col in u_columns:
if not col in sq_o_columns:
sq_o_columns.append(col)
sub=None
#we select unique columns in subquery that we'll need to join in parent query
for col in u_columns:
if not sub:
sub=q.session.query(col)
else:
sub=sub.add_column(col)
#Generate a tuple from sq_o_columns list (I don't know why over() won't accept list itself TODO: more elegant
sq_o_col_tuple=()
for col in sq_o_columns:
sq_o_col_tuple=sq_o_col_tuple + (col,)
#we add row counting column, counting on generated combined ordering+unique columns tuple
sub = sub.add_column(func.row_number().over(order_by=sq_o_col_tuple).label('rownum')).filter(q._criterion)
#Prepare sub query to use as subquery (LOL)
sub=sub.subquery('lacrn')
#Prepare join ON clauses epxression comparing unique columns defined by u_columns
joinclause=expression.BooleanClauseList()
for col in u_columns:
joinclause=joinclause.__and__(col == sub.c[col.key])
#Make the joining
q=q.join(sub,joinclause
)
i=-1
while True:
#We try to query windows defined by windows list
if windowselect:
#We want selected-windows-results to returned
if windows:
i=windows.pop(0)-1
else:
break
else:
#We want all-windows-results to be returned
i=i+1
res=q.filter(and_(sub.c.rownum > (i*windowsize), sub.c.rownum <= ((i+1)*windowsize))).all()
if not (res or windowselect):
#We end an all-windows-results because of no more results, we must check if is selected-window-query
#because of selected-window-results may not exist and the are unordered
#EX: [1,2,9999999999999,3] : Assuming the third page required has no results it will return pages 1, 2, and 3
break
for row in res:
yield row

Categories

Resources