query for values based on date w/ Django ORM - python

I have a bunch of objects that have a value and a date field:
obj1 = Obj(date='2009-8-20', value=10)
obj2 = Obj(date='2009-8-21', value=15)
obj3 = Obj(date='2009-8-23', value=8)
I want this returned:
[10, 15, 0, 8]
or better yet, an aggregate of the total up to that point:
[10, 25, 25, 33]
I would be best to get this data directly from the database, but otherwise I can do the totaling pretty easily with a forloop.
I'm using Django's ORM and also Postgres
edit:
Just to note, that my example only covers a few days, but in practice, I have hundreds of objects covering a couple decades... What I'm trying to do is create a line graph showing how the sum of all my objects has grown over time (a very long time)

This one isn't tested, since it's a bit too much of a pain to set up a Django table to test with:
from datetime import date, timedelta
# http://www.ianlewis.org/en/python-date-range-iterator
def datetimeRange(from_date, to_date=None):
while to_date is None or from_date <= to_date:
yield from_date
from_date = from_date + timedelta(days = 1)
start = date(2009, 8, 20)
end = date(2009, 8, 23)
objects = Obj.objects.filter(date__gte=start)
objects = objects.filter(date__lte=end)
results = {}
for o in objects:
results[o.date] = o.value
return [results.get(day, 0) for day in datetimeRange(start, end)]
This avoids running a separate query for every day.

result_list = []
for day in range(20,24):
result = Obj.objects.get(date=datetime(2009, 08, day))
if result:
result_list.append(result.value)
else:
result_list.append(0)
return result_list
If you have more than one Obj per date, you'll need to check len(obj) and iterate over them in case it's more than 1.

If you loop through a Obj.objects.get 100 times, you're doing 100 SQL queries. Obj.objects.filter will return the results in one SQL query, but you also select all model fields. The right way to do this is to use Obj.objects.values_list, which will do this with a single query, and only select the 'values' field.
start_date = date(2009, 8, 20)
end_date = date(2009, 8, 23)
objects = Obj.objects.filter(date__range=(start_date,end_date))
# values_list and 'value' aren't related. 'value' should be whatever field you're querying
val_list = objects.values_list('value',flat=True)
# val_list = [10, 15, 8]
To do a running aggregate of val_list, you can do this (not certain that this is the most pythonic way)
for i in xrange(len(val_list)):
if i > 0:
val_list[i] = val_list[i] + val_list[i-1]
# val_list = [10,25,33]
EDIT: If you need to account for missing days, #Glenn Maynard's answer is actually pretty good, although I prefer the dict() syntax:
objects = Obj.objects.filter(date__range=(start_date,end_date)).values('date','value')
val_dict = dict((obj['date'],obj['value']) for obj in objects)
# I'm stealing datetimeRange from #Glenn Maynard
val_list = [val_dict.get(day, 0) for day in datetimeRange(start_date, end_date)]
# val_list = [10,15,0,8]

Related

Django filter query using Q

Can anyone help me. Im trying to use a django filter with Q.
This is my function
def get_first_time_customer_ids(year: int, month: int) -> QuerySet:
return Customer.objects.filter(
Q(bookings__status=Booking.STATUS.completed, bookings__pickup_time__year=year, bookings__pickup_time__month=month) &
~Q(bookings__status=Booking.STATUS.completed, bookings__pickup_time__lt=date(year, month, 1))
).distinct().values_list('id', flat=True)
What im trying to achieve is to yield all the customer id that have the first time booking for any given year and month.
But its failing on my test case.
My test case :
def test_get_first_time_customer_ids(self) -> None:
customer_1 = Customer.objects.create(name="Customer 1")
customer_2 = Customer.objects.create(name="Customer 2")
Booking.objects.bulk_create([
Booking(number="A", customer=customer_1, price=100_000, status=Booking.STATUS.completed,
pickup_time=dt(2023, 2, 4, 12), route_id=1, vehicle_category_id=1),
Booking(number="B", customer=customer_1, price=100_000, status=Booking.STATUS.completed,
pickup_time=dt(2023, 1, 5, 12), route_id=1, vehicle_category_id=1),
Booking(number="E", customer=customer_2, price=100_000, status=Booking.STATUS.completed,
pickup_time=dt(2023, 2, 10, 12), route_id=1, vehicle_category_id=1)
])
ids = get_first_time_customer_ids(2023, 2)
self.assertTrue(customer_2.id in ids)
self.assertFalse(customer_1.id in ids)
Its failing in the last line. The customer id for customer_1 included in query, it shouldnt have. Any help is appreciated
def get_first_time_customer_ids(year: int, month: int) -> QuerySet:
qs1 = Customer.objects.filter(
bookings_status=Booking.STATUS.completed,
bookings__pickup_time__year=year,
bookings__pickup_time__month=month,
).distinct().values("id")
qs2 = Customer.objects.filter(
bookings_status=Booking.STATUS.completed,
bookings__pickup_time__lt=date(year, month, 1),
).distinct().values("id")
return qs1.exclude(id__in=qs2).values_list('id', flat=True)
Try this code.
I add distinct operations for both of querysets (qs1, qs2)
But in original code, distinct operation is only at the end.

How to parallelize for loops in Pyspark?

I am trying to convert some Pandas code to Pyspark, which will run on an EMR cluster. This is my first time working with Pyspark, and I am not sure what is the optimal way to code the objective. The job is trying to achieve the following:
There is a base dataframe with schema like so:
institution_id, user_id, st_date
For every unique institution_id, get all users
For every user for the institution_id, take all unique st_dates in sorted order, get the difference between pairs of consecutive st_dates and output a dictionary
Here is what the code looks like as of now:
def process_user(current_user, inst_cycles):
current_user_dates = np.sort(current_user.st_date.unique())
if current_user_dates.size > 1:
prev_date = pd.to_datetime(current_user_dates[0]).date()
for current_datetime in current_user_dates[1:]:
current_date = pd.to_datetime(current_datetime).date()
month = current_date.month
delta = current_date - prev_date
cycle_days = delta.days
inst_cycles[month][cycle_days] += 1
prev_date = current_date
return inst_cycles
def get_inst_monthly_distribution(current_inst):
inst_cycles = defaultdict(lambda: defaultdict(int))
inst_user_ids = current_inst.select('user_id').distinct().collect()
for _, user_id in enumerate(inst_user_ids):
user_id_str = user_id[0]
current_user = current_inst.filter(current_inst.user_id == user_id_str)
inst_cycles = process_user(current_user, inst_cycles)
return inst_cycles
def get_monthly_distributions(inst_ids, df):
cycles = {}
for _, inst_id_str in enumerate(inst_ids.keys()):
current_inst = df.filter(df.inst_id == inst_id_str)
cycles[inst_id_str] = get_inst_monthly_distribution(current_inst)
return cycles
def execute():
df = load_data() # df is a Spark dataframe
inst_names = get_inst_names(df)
monthly_distributions = get_monthly_distributions(inst_names, df)
I think this code is not taking advantage of the parallelism of Spark, and can be coded in a much better way without the for loops. Is that correct?

Adding rows in bulk to PyTables array

I have a script that collects data from an experiment and adds it to a PyTables table. The script gets data in batches (say, groups of 10). It's a little cumbersome in the code to add one row at a time via the normal method, e.g.:
data_batch = experiment.read()
last_time = time.time()
for data_row in data_batch:
row = table.row
row['timestamp'] = last_time
last_time += dt
row['column1'] = data_row[0]
row['column2'] = data_row[1]
row.append()
table.flush()
I would much rather do something like this:
data_batch = experiment.read()
start_index = len(table)
num_rows = len(data_batch)
table.append_n_rows(num_rows)
table.cols.timestamp[start_index:] = last_time + np.arange(num_rows) * dt
last_time += dt * num_rows
table.cols.column1[start_index:] = data_batch[:, 0]
table.cols.column2[start_index:] = data_batch[:, 1]
table.flush()
Does anyone know if there is some function that does the table.append_n_rows. Right now, all I can do is [table.row for i in range(num_rows)], which I feel is hacky and inefficient.
You are on the right track. In table.append(rows), the rows argument can be any object that can be converted to a structured array. This includes: "NumPy structured arrays, lists of tuples or array records, and a string or Python buffer". (I prefer NumPy arrays because I routinely work with them. Your answer shows how to use a list of tuples.)
There is a significant performance advantage adding data in batches instead of 1 row at a time. I ran some tests and posted to SO a few years ago. I/O performance is primarily related to number of batches, and not the batch size. Take a look at this answer for details: pytables writes much faster than h5py
Also, if you are going to create a large table, consider setting expectedrows parameter when you create the table. This will also improve I/O performance. This has the side benefit of setting an appropriate chunksize.
Recommended approach with your data.
data_batch = experiment.read()
last_time = time.time()
row_list = []
for data_row in data_batch:
row_list.append( (last_time, data_row[0], data_row[1] ) )
last_time += dt
your_table.append( row_list )
your_table.flush()
There is an example in the source code
I'm going to paste it here to avoid a dead link in the future.
import tables as tb
class Particle(tb.IsDescription):
name = tb.StringCol(16, pos=1) # 16-character String
lati = tb.IntCol(pos=2) # integer
longi = tb.IntCol(pos=3) # integer
pressure = tb.Float32Col(pos=4) # float (single-precision)
temperature = tb.FloatCol(pos=5) # double (double-precision)
fileh = tb.open_file('test4.h5', mode='w')
table = fileh.create_table(fileh.root, 'table', Particle,
"A table")
# Append several rows in only one call
table.append([("Particle: 10", 10, 0, 10 * 10, 10**2),
("Particle: 11", 11, -1, 11 * 11, 11**2),
("Particle: 12", 12, -2, 12 * 12, 12**2)])
fileh.close()

How to append rows to pandas DataFrame efficiently

I am trying to create a dummy file to make some ML predictions afterwards. The input are about 2000 'routes' and I want to create a dummy that contains year-month-day-hour combinations for 7 days, meaning 168 rows per route, about 350k rows in total.
The problem that I am facing is that pandas becomes terribly slow in appending rows at a certain size.
I am using the following code:
DAYS = [0, 1, 2, 3, 4, 5, 6]
HODS = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
ISODOW = {
1: "monday",
2: "tuesday",
3: "wednesday",
4: "thursday",
5: "friday",
6: "saturday",
7: "sunday"
}
def createMyPredictionDummy(start=datetime.datetime.now(), sourceFile=(utils.mountBasePath + 'routeProperties.csv'), destFile=(utils.outputBasePath + 'ToBePredictedTTimes.csv')):
'''Generate a dummy file that can be used for predictions'''
data = ['route', 'someProperties']
dataFile = data + ['yr', 'month', 'day', 'dow', 'hod']
# New DataFrame with all required columns
file = pd.DataFrame(columns=dataFile)
# Old data frame that has only the target columns
df = pd.read_csv(sourceFile, converters=convert, delimiter=',')
df = df[data]
# Counter - To avoid constant lookup for length of the DF
ix = 0
routes = df['route'].drop_duplicates().tolist()
# Iterate through all routes and create a row for every route-yr-month-day-hour combination for 7 day --> about 350k rows
for no, route in enumerate(routes):
print('Current route is %s which is no. %g out of %g' % (str(route), no+1, len(routes)))
routeDF = df.loc[df['route'] == route].iloc[0].tolist()
for i in range(0, 7):
tmpDate = start + datetime.timedelta(days=i)
day = tmpDate.day
month = tmpDate.month
year = tmpDate.year
dow = ISODOW[tmpDate.isoweekday()]
for hod in HODS:
file.loc[ix] = routeDF + [year, month, day, dow, hod] # This is becoming terribly slow
ix += 1
file.to_csv(destFile, index=False)
print('Wrote file')
I think the main problem lies in appending the row with .loc[] - Is there any way to append a row more efficiently?
If you have any other suggestions, I am happy to hear them all!
Thanks and best,
carbee
(this is more of a long comment than an answer, sorry but without example data I can't run much...)
Since it seems to me that you're adding rows one at a time sequentially (i.e. the dataframe is indexed by integers accessed sequentially) and you always know the order of the columns, you're probably much better off creating a list of lists and then transforming it to a DataFrame, that is, define something like file_list = [] and then replace the line file.loc[ix] = ... by:
file_list.append(routeDF + [year, month, day, dow, hod])
In the end, you can then define
file = pd.DataFrame(file_list, columns=dataFile)
If furthermore all your data is of a fixed type (e.g. int, depending on what is your routeDF and by not converting dow until after creating the dataframe) you might be even better off by pre-allocating a numpy array and writing into it, but I'm quite sure that adding elements to a list will not be the bottleneck of your code, so this is probably excessive optimization.
Another alternative to minimize changes in your code, simply preallocate enough space by creating a DataFrame full of NaN instead of a DataFrame with no lines, i.e. change the definition of file to (after moving the line with drop_duplicates up):
file = pd.DataFrame(columns=dataFile, index=range(len(routes)*168))
I'm quite sure this is faster than your code, but it might still be slower than the list of lists approach above since it won't know which data types to expect until you fill in data (it might e.g. convert your ints to float which is not ideal). But again, once you get rid of the continuous reallocations due to expanding a DataFrame at each step, this will probably not be your bottleneck anymore (the double loop will likely be.)
You create an empty dataframe named file and then you fill it by appending rows this seems the problem. If you just
def createMyPredictionDummy(...):
...
# make it yield a dict of attributes from the for loop
for hod in HODS:
yield data
# then use this to create the *file* dataframe outside that function
newDF = pd.DataFrame([r for r in createMyPredictionDummy()])
newDF.to_csv(destFile, index=False)
print('Wrote file')

combine 2 tables and aggregate data

users
Could you please help with following?
I need to extract data from a mysql database and aggregate them.
There are two tables in the database, both of them have data in a dfferent timestep.
I need now to make one new table (txt), where all data of table 1 are combined with table 2 data.
I so only need the data of table 2 with most coresponding time to timesteps of table 1.
for better understanding, see an example of the tables here:
https://www.dropbox.com/s/mo2q0hj72ilx05n/data%20aggregation.xlsx?dl=0
I already have a python-code which extracts the hexadecimal data and makes table 2.
I also have a code which makes table 1.
I need to combine both now.
Thank you very much for your advices!
After copying your data tables into Python lists, I had to split up the values in table 2 back into independent series. Overall you may be able to skip the step where you consolidate these values into the single table Table2.
The key to solving this is to write a simple class that implements __getitem__, taking a single key argument and returning the corresponding value. For instance, in the case of a regular Python dict, then __getitem__ returns the dict entry that exactly matches the key, or a KeyError if there is no match. In your case, I implemented __getitem__ to just return the entry with the minimum difference of the entry's timestamp from the given timestamp, in this line:
closest = min(self.data, key=lambda x: abs(x[0]-keyts))
(Left as an exercise to the OP - how to handle the case where the key falls exactly between two entries.) If you need to adjust the lookup logic, just change the implementation of __getitem__ - everything else in the code will remain the same.
Here is my sample implementation:
# t1 and t2 are lists of tab-delimited strings copy-pasted
# from the OP's spreadsheet
TAB = '\t'
t1data = [t.split(TAB) for t in t1]
t2data = [t.split(TAB) for t in t2]
# split each parameter into individual time,value pairs
readings = {'A':[], 'B':[], 'C':[]}
for parm in "ABC":
for trec in t2data:
t,a,b,c = trec
t = int(t)
if a: readings['A'].append((t,int(a)))
if b: readings['B'].append((t,int(b)))
if c: readings['C'].append((t,int(c)))
# define class for retrieving value with "closest" key if
# there is not an exact match
class LookupClosest(object):
def __init__(self, pairs):
self.data = pairs
def __getitem__(self, key):
# implement logic here to find closest matching item in series
# TODO - what if key is exactly between two different values?
closest = min(self.data, key=lambda x: abs(x[0]-key))
return closest[1]
# convert each data series to LookupClosest
for key in "ABC":
readings[key] = LookupClosest(readings[key])
# extract and display data
for vals in t1data:
t = int(vals[0])
gps = vals[1]
a = readings['A'][t]
b = readings['B'][t]
c = readings['C'][t]
rec = t,gps,a,b,c
print rec
prints: (I modified the Table1 data so that you can tell the difference from one record to the next):
( 1, 'x01', 1, 10, 44)
(10, 'x10', 2, 11, 47)
(21, 'x21', 4, 12, 45)
(30, 'x30', 3, 12, 44)
(41, 'x41', 4, 12, 47)
(52, 'x52', 2, 10, 48)

Categories

Resources