Creating histogram bins from Django queries - python

I'm trying to create bins with the count of prices to be used for a histogram.
I want the bins to be 0-1000, 1000-2000, 2000-3000 and so forth. If I just do group by I get way to many different bins.
The code I've written seems to end in a infinite loop (or at least the script is still running after an hour). I'm not sure how to do it correctly. Here is the code I wrote:
from itertools import zip_longest
def price_histogram(area_id, agency_id):
# Get prices and total count for competitors
query = HousePrice.objects.filter(area_id=area_id, cur_price__range=(1000,30000)).exclude(agency_id=agency_id)
count = query.values('cur_price').annotate(count=Count('cur_price')).order_by('cur_price')
total = query.count()
# Get prices and total count for selected agency
query_agency = HousePrice.objects.filter(area_id=area_id, agency_id=agency_id, cur_price__range=(1000,30000))
count_agency = query_agency.values('cur_price').annotate(count=Count('cur_price')).order_by('cur_price')
total_agency = query_agency.count()
# Make list for x and y values
x_comp = []
y_comp = []
x_agency = []
y_agency = []
bin_start = 0
bin_end = 1000
_count_comp = 0
_count_agency = 0
for row_comp, row_agency in zip_longest(count, count_agency, fillvalue={}):
while bin_start < int(row_comp['cur_price']) < bin_end:
_count_comp += row_comp['count']
_count_agency += row_agency.get('count', 0)
bin_start += 1000
bin_end += 1000
x_comp.append(str(bin_start) + "-" + str(bin_end) + " USD")
x_agency.append(str(bin_start) + "-" + str(bin_end) + " USD")
y_comp.append(_count_comp/total)
y_agency.append(_count_agency/total_agency)
return {'x_comp': x_comp, 'y_comp': y_comp, 'x_agency': x_agency, 'y_agency': y_agency}
I'm using Python 3.5 and Django 1.10.

I'm a little late, but maybe the django-pivot library does what you want.
from django_pivot.histogram import histogram
query = HousePrice.objects.filter(area_id=area_id, cur_price__range=(1000,30000)).exclude(agency_id=agency_id
hist = histogram(query, cur_price, bins=[1000:30000:1000])

Related

Is it possible to dynamically assign condition statements to a list in Python?

I am trying to create a list of conditions to use the numpy select statement to create a 'Week #' column depending on the date a record was created. However, it doesn't quite seem to work. Any suggestions?
#Creating list for start dates
weekStartDay = []
weekValues = []
weekConditions = []
counter = 1
demoStartDate = min(demographic['Date'])
demoEndDate = max(demographic['Date'])
while demoStartDate <= demoEndDate:
weekStartDay.append(demoStartDate)
demoStartDate += timedelta(days=7)
weekStartDay.append(demoStartDate)
while counter <= len(weekConditions):
weekValues.append(counter+1)
counter += 1
#Assigning condition statement for numpy conditions
for i in range(len(weekStartDay)):
weekConditions.append( (demographic['Date'] >= weekStartDay[i]) & (demographic['Date'] < weekStartDay[i+1]) )
#Creating week value assignment column
demographic['Week'] = np.select(weekConditions,weekValues)
I believe I've found a solution to the problem.
#Creating list for start dates
weekStartDay = []
weekValues = []
weekConditions = []
counter = 1
i = 0
demoStartDate = min(demographic['Date'])
demoEndDate = max(demographic['Date'])
while demoStartDate <= demoEndDate:
weekStartDay.append(demoStartDate)
demoStartDate += timedelta(days=7)
weekStartDay.append(demoStartDate)
while counter <= len(weekStartDay):
weekValues.append(counter)
counter += 1
#Assigning condition statement for numpy conditions
while i != len(weekStartDay):
for i in range(len(weekStartDay)):
weekConditions.append( (demographic['Date'] >= weekStartDay[i-1]) & (demographic['Date'] < weekStartDay[i]) )
i += 1
#Creating week value assignment column
demographic['Week'] = np.select(weekConditions,weekValues)

Get seperate two values and add together in every hour when the code is running using python

Here I have two values 40 and 50 in the csv file. Then I want to decrement the value in every hour separately and this will continuously running the code up to 24 hours.
So what I need to do is , I want to read that two changing values in every hour and do summation and read it as one output. I need this happened in every hour up to 24 hours.
So here I wrote the code for to read that two values and decrement in every one hour upto 24 hours.
50 will decrement in every hour 2.567
40 will decrement in every hour 1.234
Here is the code for the decrement process of two values. But unfortunately I have no idea how I can add that two values from taking that running code and read it as one input .
Can anyone help me to solve this process?
x = data[data['a'] == 50]
x now = x.iloc[0].loc['a']
last_x_value = 0
current_time = x.iloc[0].loc['date']
x_decrement_constant = math.exp((-2.567 * 1))
required_hours_of_generated_data = 24 # Generate data for this many hours.
X' = [{'date':current_time, 'a':x now }]
while True:
next_record_time = current_time + timedelta(0,3600)
if(last_x_value < len(x)-1):
if(next_record_time < x.iloc[last_x_value + 1].loc['date']):
new_x = (x_now* x_decrement_constant)
else:
new_x = (x_now* x_decrement_constant) + x.iloc[last_x_value + 1].loc['a']
last_x_value = last_x_value + 1
else:
break
y = data[data['b'] == 40]
y now = x.iloc[0].loc['b']
last_y_value = 0
current_time = x.iloc[0].loc['date']
x_decrement_constant = math.exp((-1.234* 1))
required_hours_of_generated_data = 24 # Generate data for this many hours.
Y' = [{'date':current_time, 'b':y now }]
while True:
next_record_time = current_time + timedelta(0,3600)
if(last_y_value < len(y)-1):
if(next_record_time < y.iloc[last_y_value + 1].loc['date']):
new_y = (y_now* y_decrement_constant)
else:
new_y = (y_now* y_decrement_constant) + y.iloc[last_y_value + 1].loc['b']
last_y_value = last_y_value + 1
else:
break

Increasing speed for comparative analysis using Python

I have been thinking about this problem for a while now and I was hoping that someone here would have a suggestion to considerably increase the speed of this analysis using Python.
I basically have two files. File (1) contains coordinates composed of a letter, a start and an end: e.g. "a 1000 1100" and file (2) a dataset in which each datapoint in composed of a letter and a coordinate: e.g. "p 1350". What I am trying to do with the script is to count how many datapoints fall within the borders of the coordinates, but only if the letter of the datapoint from file (2) and the coordinate from file (1) are equal. In real life datasets file (1) contains > 50K coordinates and file (2) > 50 million datapoints. Increasing the amount of datapoints exponentially increases the time my script requires to run. So I wonder if someone could come up with a more time-efficient way.
Thanks!
My script starts at # script strategy, but I first simulate a minimal example dataset:
import numpy as np
import random
import string
# simulate data
c_size = 10
d_size = 1000000
# letters
letters = list(string.ascii_lowercase)
# coordinates
c1 = np.random.randint(low=100000, high=2000000, size=c_size)
c2 = np.random.randint(low=100, high=1000, size=c_size)
# data
data = np.random.randint(low=100000, high=2000000, size=d_size)
# script strategy
# create coordinates and count dict
c_dict = {}
count_dict = {}
for start,end in zip(c1,c2):
end = start + end
c_l = random.choice(letters)
ID = c_l + '_' + str(start) + '_' + str(end)
count_dict[ID] = 0
if c_l not in c_dict:
c_dict[c_l] = [[start,end]]
else:
c_dict[c_l].append([start,end])
# count how many datapoints (x) are within the borders of the coordinates
for i in range(d_size):
d_l = random.choice(letters)
x = data[i]
if d_l in c_dict:
# increasing speed by only comparing data and coordinates with identical letter identifier
for coordinates in c_dict[d_l]:
start = coordinates[0]
end = coordinates[1]
ID = d_l + '_' + str(start) + '_' + str(end)
if x >= start and x <= end:
count_dict[ID] += 1
# print output
for ID in count_dict:
count = count_dict[ID]
print(ID + '\t' + str(count))

How to optimize an O(N*M) to be O(n**2)?

I am trying to solve USACO's Milking Cows problem. The problem statement is here: https://train.usaco.org/usacoprob2?S=milk2&a=n3lMlotUxJ1
Given a series of intervals in the form of a 2d array, I have to find the longest interval and the longest interval in which no milking was occurring.
Ex. Given the array [[500,1200],[200,900],[100,1200]], the longest interval would be 1100 as there is continuous milking and the longest interval without milking would be 0 as there are no rest periods.
I have tried looking at whether utilizing a dictionary would decrease run times but I haven't had much success.
f = open('milk2.in', 'r')
w = open('milk2.out', 'w')
#getting the input
farmers = int(f.readline().strip())
schedule = []
for i in range(farmers):
schedule.append(f.readline().strip().split())
#schedule = data
minvalue = 0
maxvalue = 0
#getting the minimums and maximums of the data
for time in range(farmers):
schedule[time][0] = int(schedule[time][0])
schedule[time][1] = int(schedule[time][1])
if (minvalue == 0):
minvalue = schedule[time][0]
if (maxvalue == 0):
maxvalue = schedule[time][1]
minvalue = min(schedule[time][0], minvalue)
maxvalue = max(schedule[time][1], maxvalue)
filled_thistime = 0
filled_max = 0
empty_max = 0
empty_thistime = 0
#goes through all the possible items in between the minimum and the maximum
for point in range(minvalue, maxvalue):
isfilled = False
#goes through all the data for each point value in order to find the best values
for check in range(farmers):
if point >= schedule[check][0] and point < schedule[check][1]:
filled_thistime += 1
empty_thistime = 0
isfilled = True
break
if isfilled == False:
filled_thistime = 0
empty_thistime += 1
if (filled_max < filled_thistime) :
filled_max = filled_thistime
if (empty_max < empty_thistime) :
empty_max = empty_thistime
print(filled_max)
print(empty_max)
if (filled_max < filled_thistime):
filled_max = filled_thistime
w.write(str(filled_max) + " " + str(empty_max) + "\n")
f.close()
w.close()
The program works fine, but I need to decrease the time it takes to run.
A less pretty but more efficient approach would be to solve this like a free list, though it is a bit more tricky since the ranges can overlap. This method only requires looping through the input list a single time.
def insert(start, end):
for existing in times:
existing_start, existing_end = existing
# New time is a subset of existing time
if start >= existing_start and end <= existing_end:
return
# New time ends during existing time
elif end >= existing_start and end <= existing_end:
times.remove(existing)
return insert(start, existing_end)
# New time starts during existing time
elif start >= existing_start and start <= existing_end:
# existing[1] = max(existing_end, end)
times.remove(existing)
return insert(existing_start, end)
# New time is superset of existing time
elif start <= existing_start and end >= existing_end:
times.remove(existing)
return insert(start, end)
times.append([start, end])
data = [
[500,1200],
[200,900],
[100,1200]
]
times = [data[0]]
for start, end in data[1:]:
insert(start, end)
longest_milk = 0
longest_gap = 0
for i, time in enumerate(times):
duration = time[1] - time[0]
if duration > longest_milk:
longest_milk = duration
if i != len(times) - 1 and times[i+1][0] - times[i][1] > longest_gap:
longes_gap = times[i+1][0] - times[i][1]
print(longest_milk, longest_gap)
As stated in the comments, if the input is sorted, the complexity could be O(n), if that's not the case we need to sort it first and the complexity is O(nlog n):
lst = [ [300,1000],
[700,1200],
[1500,2100] ]
from itertools import groupby
longest_milking = 0
longest_idle = 0
l = sorted(lst, key=lambda k: k[0])
for v, g in groupby(zip(l[::1], l[1::1]), lambda k: k[1][0] <= k[0][1]):
l = [*g][0]
if v:
mn, mx = min(i[0] for i in l), max(i[1] for i in l)
if mx-mn > longest_milking:
longest_milking = mx-mn
else:
mx = max((i2[0] - i1[1] for i1, i2 in zip(l[::1], l[1::1])))
if mx > longest_idle:
longest_idle = mx
# corner case, N=1 (only one interval)
if len(lst) == 1:
longest_milking = lst[0][1] - lst[0][0]
print(longest_milking)
print(longest_idle)
Prints:
900
300
For input:
lst = [ [500,1200],
[200,900],
[100,1200] ]
Prints:
1100
0

How python [internally] retrieves elements from array and finds minimum

For this question http://www.spoj.com/problems/ACPC10D/ on SPOJ, I wrote a python solution as below:
count = 1
while True:
no_rows = int(raw_input())
if no_rows == 0:
break
grid = [[None for x in range(3)] for y in range(2)]
input_arr = map(int, raw_input().split())
grid[0][0] = 10000000
grid[0][1] = input_arr[1]
grid[0][2] = input_arr[1] + input_arr[2]
r = 1
for i in range(0, no_rows-1):
input_arr = map(int, raw_input().split())
_r = r ^ 1
grid[r][0] = input_arr[0] + min(grid[_r][0], grid[_r][1])
grid[r][1] = input_arr[1] + min(min(grid[_r][0], grid[r][0]), min(grid[_r][1], grid[_r][2]))
grid[r][2] = input_arr[2] + min(min(grid[_r][1], grid[r][1]), grid[_r][2])
r = _r
print str(count) + ". " + str(grid[(no_rows -1) & 1][1])
count += 1
The above code exceeds time limit. However, when I change the line
grid[r][2] = input_arr[2] + min(min(grid[_r][1], grid[r][1]), grid[_r][2])
to
grid[r][2] = input_arr[2] + min(min(grid[_r][1], grid[_r][2]), grid[r][1])
the solution is accepted. If you notice the difference, the first line compares, grid[_r][1], grid[r][1] for minimum (i.e. the row number are different) and second line compares grid[_r][1], grid[_r][2] for minimum(i.e. the row number are same)
This is a consistent behaviour. I want to understand, how python is processing those two lines - so that one results in exceeding time limit, while other is fine.

Categories

Resources