Let's say I have a model like this:
+-----------+--------+--------------+
| Name | Amount | Availability |
+-----------+--------+--------------+
| Milk | 100 | True |
+-----------+--------+--------------+
| Chocolate | 200 | False |
+-----------+--------+--------------+
| Honey | 450 | True |
+-----------+--------+--------------+
Now in a second model I want to have a field (also named 'Amount') which is always equal to the sum of the amounts of the rows which have Availability = True. For example like this:
+-----------+-----------------------------------------------+
| Inventory | Amount |
+-----------+-----------------------------------------------+
| Groceries | 550 #this is the field I want to be dependent |
+-----------+-----------------------------------------------+
Is that possible? Or is there a better way of doing this?
Of course that is possible: i would recommend one of two things:
Do this "on the fly" as one person commented. then store in django cache mechanisim so that it only calculates once in awhile (saving database/computation resources).
create a database view that does the summation; again it will let the database cache the results/etc. to save resources.
That said, I only think #1 or 2 is needed on a very large record set on a very busy site.
Related
I am fairly new in python. I am working with the nycflights13 df.
My goal is to determine the most common and least common origin for every destiny.
+---------+--------------------+-----------+---------------------+------------+
| destiny | most_common_origin | max_count | least_common_origin | min_count |
+---------+--------------------+-----------+---------------------+------------+
| MIA | LGA | 5781 | EWR | 2633 |
+---------+--------------------+-----------+---------------------+------------+
I managed to do it for one destiny but I´m having trouble in the loop and also the binding the rows because sometimes I get a series.
My working example
from nycflights13 import flights
flights_to_MIA = flights[flights['dest']=='MIA']
flights_to_MIA.groupby(flights_to_MIA['origin']).size().reset_index(name='counts')
Note that I used size and then reset.index to make it a df.
You may check agg + value_counts
s=df.groupby('dest').agg(most_common_origin=('origin',lambda x : x.mode()[0]),
max_count=('origin',lambda x : x.value_counts().sort_values().iloc[-1]),
least_common_origin=('origin',lambda x : x.value_counts().sort_values().iloc[[0]].index),
min_count=('origin',lambda x : x.value_counts().sort_values().iloc[0]))
I made a grid search that contains 36 models.
For each model the confusion matrix is available with :
grid_search.get_grid(sort_by='a_metrics', decreasing=True)[index].confusion_matrix(valid=valid_set)
My problematic is I only want to access some parts of this confusion matrix in order to make my own ranking, which is not natively available with h2o.
Let's say we have the confusion_matrix of the first model of the grid_search below:
+---+-------+--------+--------+--------+------------------+
| | 0 | 1 | Error | Rate | |
+---+-------+--------+--------+--------+------------------+
| 0 | 0 | 766.0 | 2718.0 | 0.7801 | (2718.0/3484.0) |
| 1 | 1 | 351.0 | 6412.0 | 0.0519 | (351.0/6763.0) |
| 2 | Total | 1117.0 | 9130.0 | 0.2995 | (3069.0/10247.0) |
+---+-------+--------+--------+--------+------------------+
Actually, the only things that really interest me is the precision of the class 0 as 766/1117 = 0,685765443. While h2o consider precision metrics for all the classes and it is done to the detriment of what I am looking for.
I tried to convert it in dataframe with:
model = grid_search.get_grid(sort_by='a_metrics', decreasing=True)[0]
model.confusion_matrix(valid=valid_set).as_data_frame()
Even if some topics on internet suggest it works, actually it does not (or doesn't anymore):
AttributeError: 'ConfusionMatrix' object has no attribute 'as_data_frame'
I search a way to return a list of attributes of the confusion_matrix without success.
According to H2O documentation there is no as_dataframe method: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/model/confusion_matrix.html
I assume the easiest way is to call to_list().
The .table attribute gives the object with as_data_frame method.
model.confusion_matrix(valid=valid_set).table.as_data_frame()
If you need to access the table header, you can do
model.confusion_matrix(valid=valid_set).table._table_header
Hint: You can use dir() to check the valid attributes of a python object.
I am trying to put together a useable set of data about glaciers. Our original data comes from an ArcGIS dataset, and latitude/longitude values were stored in a separate file, now detached from the CSV with all of our data. I am attempting to merge the latitude/longitude files with our data set. Heres a preview of what the files look like.
This is my main dataset file, glims (columns dropped for clarity)
| ANLYS_ID | GLAC_ID | AREA |
|----------|----------------|-------|
| 101215 | G286929E46788S | 2.401 |
| 101146 | G286929E46788S | 1.318 |
| 101162 | G286929E46788S | 0.061 |
This is the latitude-longitude file, coordinates
| lat | long | glacier_id |
|-------|---------|----------------|
| 1.187 | -70.166 | G001187E70166S |
| 2.050 | -70.629 | G002050E70629S |
| 3.299 | -54.407 | G002939E70509S |
The problem is, the coordinates data frame has one row for each glacier id with latitude longitude, whereas my glims data frame has multiple rows for each glacier id with varying data for each entry.
I need every single entry in my main data file to have a latitude-longitude value added to it, based on the matching glacier_id between the two data frames.
Heres what I've tried so far.
glims = pd.read_csv('glims_clean.csv')
coordinates = pd.read_csv('LatLong_GLIMS.csv')
df['que'] = np.where((coordinates['glacier_id'] ==
glims['GLAC_ID']))
error returns: 'int' object is not subscriptable
and:
glims.merge(coordinates, how='right', on=('glacier_id', 'GLAC_ID'))
error returns: int' object has no attribute 'merge'
I have no idea how to tackle this big of a merge. I am also afraid of making mistakes because it is nearly impossible to catch them, since the data carries no other identifying factors.
Any guidance would be awesome, thank you.
This should work
glims = glims.merge(coordinates, how='left', left_on='GLAC_ID', right_on='glacier_id')
This a classic merging problem. One way to solve is using straight loc and index-matching
glims = glims.set_index('GLAC_ID')
glims.loc[:, 'lat'] = coord.set_index('glacier_id').lat
glims.loc[:, 'long'] = coord.set_index('glacier_id').long
glims = glims.reset_index()
You can also use pd.merge
pd.merge(glims,
coord.rename(columns={'glacier_id': 'GLAC_ID'}),
on='GLAC_ID')
I am playing a little with PrettyTable in Python and I noticed completely different behavior in Python2 and Python 3. Can somebody exactly explain me the difference in output? Nothing in docs gave me satisfied answer for that. But let's start with little code. Let's start with creating my_table
from prettytable import PrettyTable
my_table = PrettyTable()
my_table.field_name = ['A','B']
It creates two column table with column A and column B. Let's add on row to it, but assume that value in cell can have multi lines, separated by Python new line '\n' , as the example some properties of parameter from column A.
row = ['parameter1', 'component: my_component\nname:somename\nmode: magic\ndate: None']
my_table.add_row(row)
Generally the information in row can be anything, it's just a string retrieved from other function. As you can see, it has '\n' inside. The thing that I don't completely understand is the output of print function.
I have in Python2
print(my_table.get_string().encode('utf-8'))
Which have me output like this:
+------------+-------------------------+
| Field 1 | Field 2 |
+------------+-------------------------+
| parameter1 | component: my_component |
| | name:somename |
| | mode: magic |
| | date: None |
+------------+-------------------------+
But in Python3 I have:
+------------+-------------------------+
| Field 1 | Field 2 |
+------------+-------------------------+
| parameter1 | component: my_component |
| | name:somename |
| | mode: magic |
| | date: None |
+------------+-------------------------+
If I completely removes the encode part, it seems that output looks ok on both version of Python.
So when I have
print(my_table.get_string())
It works on Python3 and Python2. Should I remove the encode part from code? It is save to assume it is not necessary? Where is the problem exactly?
I'm having trouble understanding under what circumstances are .values() or .values_list() better than just using Model instances?
I think the following are all equivalent:
results = SomeModel.objects.all()
for result in results:
print(result.some_field)
results = SomeModel.objects.all().values()
for result in results:
print(result['some_field'])
results = SomeModel.objects.all().values_list()
for some_field, another_field in results:
print(some_field)
obviously these are stupid examples, could anyone point out a good reason for using .values() / .values_list() over just using Model instances directly?
edit :
I did some simple profiling, using a noddy model that contained 2 CharField(max_length=100)
Iterating over just 500 instances to copy 'first' to another variable, taking the average of 200 runs I got following results:
Test.objects.all() time: 0.010061947107315063
Test.objects.all().values('first') time: 0.00578328013420105
Test.objects.all().values_list('first') time: 0.005257354974746704
Test.objects.all().values_list('first', flat=True) time: 0.0052023959159851075
Test.objects.all().only('first') time: 0.011166254281997681
So the answer is definitively : performance! (mostly, see knbk answer below)
.values() and .values_list() translate to a GROUP BY query. This means that rows with duplicate values will be grouped into a single value. So say you have a model People the following data:
+----+---------+-----+
| id | name | age |
+----+---------+-----+
| 1 | Alice | 23 |
| 2 | Bob | 42 |
| 3 | Bob | 23 |
| 4 | Charlie | 30 |
+----+---------+-----+
Then People.objects.values_list('name', flat=True) will return 3 rows: ['Alice', 'Bob', 'Charlie']. The rows with name 'Bob' are grouped into a single value. People.objects.all() will return 4 rows.
This is especially useful when doing annotations. You can do e.g. People.objects.values_list('name', Sum('age')), and it will return the following results:
+---------+---------+
| name | age_sum |
+---------+---------+
| Alice | 23 |
| Bob | 65 |
| Charlie | 30 |
+---------+---------+
As you can see, the ages of both Bob's have been summed, and are returned in a single row. This is different from distinct(), which only applies after the annotations.
Performance is just a side-effect, albeit a very useful one.
values() and values_list() are both intended as optimizations for a specific use case: retrieving a subset of data without the overhead of creating a model instance. Good explanation is given in the Django Documentation.
I use "values_list()" to create a Custom Dropdown Single Select Box for Django Admin as shown below:
# "admin.py"
from django.contrib import admin
from django import forms
from .models import Favourite, Food, Fruit, Vegetable
class FoodForm(forms.ModelForm):
# Here
FRUITS = Fruit.objects.all().values_list('id', 'name')
fruits = forms.ChoiceField(choices=FRUITS)
# Here
VEGETABLES = Vegetable.objects.all().values_list('id', 'name')
vegetables = forms.ChoiceField(choices=VEGETABLES)
class FoodInline(admin.TabularInline):
model = Food
form = FoodForm
#admin.register(Favourite)
class FavouriteAdmin(admin.ModelAdmin):
inlines = [FoodInline]