So basically I have a database table with photos. Each photo has a rating <0,1> and categories (one or more). I need a way to efficiently chose x elements from this table at weighted random but with respect to categories, I have to do this in python3 + Django (or microservice communicating thru Redis or exposing RESTapi).
eg:
table:
.---------.--------.------------.
| photo | rating | categories |
:---------+--------+------------:
| Value 1 | 0.8 | art, cats |
:---------+--------+------------:
| value 2 | 0.5 | cats |
:---------+--------+------------:
| value 3 | 0.9 | night |
'---------'--------'------------'
And when I ask for 1 photo with categories (cats, dogs). The algorithm should return something like
numpy.random.choice([Value 1, Value 2], 1, [0.8, 0.5], replace=False)
Currently, every time I am asked for it I do something as follow:
photos = Photos.objects.filter(category__in=[list of wanted categories])
photos, weights = zip(*list(photos.values_list('photo', 'rating')))
res = numpy.random.choice(photos, amount_wanted, weights, , replace=False)
Is there more efficent approach to this? I can use any AWS service to achive it.
You may be able to use something like
photo = random.sample(Photos.objects.filter(category__in=[list of wanted categories], rating__gte=random.random())), 1)
This line basically selects all categories you want, filters out entries according to their probability, and returns a random one.
Related
I have a table of about 250,000 by 300 that contains data from an inertial measurement unit.
+----------+-----+-----------+----------+-----+-----------+----------+-----+-----------+
| acce_x_0 | ... | acce_x_99 | acce_y_0 | ... | acce_y_99 | acce_z_0 | ... | acce_z_99 |
+----------+-----+-----------+----------+-----+-----------+----------+-----+-----------+
| 1.3435 | ... | 1.7688 | -0.4566 | ... | -1.4554 | 9.6564 | ... | 9.5768 |
+----------+-----+-----------+----------+-----+-----------+----------+-----+-----------+
I would like to get a tensor like in the picture.
But when trying to change the form of the array np.reshape(data_imu.to_numpy(), newshape=(-1, 100, 3)), I get a different view.
For example, data_imu[0][0].shape gives 3 instead of 100 as I expected.
From what i understood you have several samples of timeseries for each acceleration component x,y,z.
A solution would be to separate data for each component, then reassembling them in order to build a 3D array.
Here is a simple example:
data = np.random.uniform(-1,1,size=(100,10))
#creating data, 10 samples of timeseries with 100 values
df = pd.DataFrame(data=data,columns= ['device_sample_'+str(i) for i in range(10)])
"""
device_0 device_1 device_2 device_3 device_4 device_5 device_6 device_7 device_8 device_9
0 0.846339 0.014831 0.380373 0.910142 0.283169 0.926771 0.651504 0.267011 -0.735348 -0.563671
1 -0.076040 -0.107705 0.783594 -0.731901 0.328230 0.104527 0.373363 0.135972 0.145868 -0.068370
2 -0.914331 -0.106772 -0.111691 -0.747672 -0.367210 0.293646 0.278765 -0.659683 0.464896 0.675855
3 0.008376 0.823489 0.017261 0.540690 -0.052503 0.396828 -0.219417 -0.872403 -0.631343 0.288238
4 -0.317125 0.662676 -0.912503 -0.047759 0.286468 -0.938535 -0.962357 0.922892 0.168540 0.847411
"""
#To make it simple, let's say we have 2 devices, even samples for first device
#and other for the second one
#first we regroup desired samples corresponding to each device
df_device1=df[['device_sample_0','device_sample_2','device_sample_4','device_sample_6','device_sample_8']]
#Can loop to select columns
df_device2=df[['device_sample_1','device_sample_3','device_sample_5','device_sample_7','device_sample_9']]
data_dev1 = df_device1.to_numpy()
data_dev2 = df_device2.to_numpy()
print(data_dev1.shape)
# (100,5) device 1 has 5 samples of timeseries
print(data_dev2.shape)
#Now you build you 3D array
final_data = np.dstack((data_dev1, data_dev2))
print(final_data.shape)
#(100, 5, 2)
# lines : timeseries
# columns : samples
# depth : devices
# Different from the picture, but you can use reshape(5,100,2) to modifiy dimensions
I'm new to pandas and I'm currently trying to use it on a data set I have on my tablet using qPython (temporary situation, laptop's being fixed). I have a csv file with a set of data organised by country, region, market and item label, with additional columns price, year and month. These are set out in the following manner:
Country | Region | Market | Item Label | ... | Price | Year | Month |
Canada | Quebec | Market No. | Item Name | ... | $$$ | 2002 | 1 |
Canada | Quebec | Market No. | Item Name | ... | $$$ | 2002 | 2 |
Canada | Quebec | Market No. | Item Name | ... | $$$ | 2002 | 3 |
Canada | Quebec | Market No. | Item Name | ... | $$$ | 2002 | 4 |
and so on. I'm looking for a way to plot these prices against time (I've taken to adding the month/12 to the year to effectively merge the last columns).
Originally I had a code to take the csv data and put it in a Dictionary, like so:
{Country_Name: {Region_Name: {Market_Name: {Item_Name: {"Price": price_list, "Time": time_list}}}}}
and used for loops over the keys to access each price and time list.
However, I'm having difficulty using pandas to get a similar result: I've tried a fair few different approaches, such as iloc, data[data.Country == "Canada"][data.Region == "Quebec"][..., etc. to filter the data for each country, region, market and item, but all of them were particularly slow. The data set is fairly hefty (approx. 12000 by 12), so I wouldn't expect instant results, but is there something obvious I'm missing? Or should I just wait til I have my laptop back?
Edit: to try and provide more context, I'm trying to get the prices over the course of the years and months, to plot how the prices fluctuate. I want to separate them based on the country, region, market and item lael, so each line plotted will be a different item in a market in a region in a country. So far, I have the following code:
def abs_join_paths(*args):
return os.path.abspath(os.path.join(*args))
def get_csv_data_frame(*path, memory = True):
return pandas.read_csv(abs_join_paths(*path[:-1], path[-1] + ".csv"), low_memory = memory)
def get_food_data(*path):
food_price_data = get_csv_data_frame(*path, memory = False)
return food_price_data[food_price_data.cm_name != "Fuel (diesel) - Retail"]
food_data = get_food_data(data_path, food_price_file_name)
def plot_food_price_time_data(data, title, ylabel, xlabel, plot_style = 'k-'):
plt.clf()
plt.hold(True)
data["mp_year"] += data["mp_month"]/12
for country in data["adm0_name"].unique():
for region in data[data.adm0_name == country]["adm1_name"].unique():
for market in data[data.adm0_name == country][data.adm1_name == region]["mkt_name"]:
for item_label in data[data.adm0_name == country][data.adm1_name == region][data.mkt_name == market]["cm_name"]:
current_data = data[data.adm0_name == country][data.adm1_name == region][data.mkt_name == market][data.cm_name == item_label]
#year = list(current_data["mp_year"])
#month = list(current_data["mp_month"])
#time = [float(y) + float(m)/12 for y, m in zip(year, month)]
plt.plot(list(current_data["mp_year"]), list(current_data["mp_price"]), plot_style)
print(list(current_data["mp_price"]))
plt.savefig(abs_join_paths(imagepath, title + ".png"))
Edit2/tl;dr: I have a bunch of prices and times, one after the other in one long list. How do I use pandas to split them up based on the contents of the other columns?
Cheers!
I hesitate to guess, but it seems that you are probably iterating through rows (you said you were using iloc). This is the slowest operation in pandas. Pandas data frames are optimized for series access.
If your plotting you can use matplotlib directly with pandas data frames and use the groupby method to combine data, without having to iterate through the rows of your data frame.
Without more information it's difficult to answer your question specifically. Please take a look at the comments on your question.
The groupby function did the trick:
def plot_food_price_time_data(data, title, ylabel, xlabel, plot_style = 'k-'):
plt.clf()
plt.hold(True)
group_data = data.groupby(["adm0_name", "adm1_name", "mkt_name", "cm_name"])
for i in range(len(data)):
print(data.iloc[i, [1, 3, 5, 7]])
specific_data = group_data.get_group(tuple(data.iloc[i, [1, 3, 5, 7]]))
plt.plot(specific_data["mp_price"], specific_data["mp_year"] + specific_data["mp_month"]/12)
I am trying to put together a useable set of data about glaciers. Our original data comes from an ArcGIS dataset, and latitude/longitude values were stored in a separate file, now detached from the CSV with all of our data. I am attempting to merge the latitude/longitude files with our data set. Heres a preview of what the files look like.
This is my main dataset file, glims (columns dropped for clarity)
| ANLYS_ID | GLAC_ID | AREA |
|----------|----------------|-------|
| 101215 | G286929E46788S | 2.401 |
| 101146 | G286929E46788S | 1.318 |
| 101162 | G286929E46788S | 0.061 |
This is the latitude-longitude file, coordinates
| lat | long | glacier_id |
|-------|---------|----------------|
| 1.187 | -70.166 | G001187E70166S |
| 2.050 | -70.629 | G002050E70629S |
| 3.299 | -54.407 | G002939E70509S |
The problem is, the coordinates data frame has one row for each glacier id with latitude longitude, whereas my glims data frame has multiple rows for each glacier id with varying data for each entry.
I need every single entry in my main data file to have a latitude-longitude value added to it, based on the matching glacier_id between the two data frames.
Heres what I've tried so far.
glims = pd.read_csv('glims_clean.csv')
coordinates = pd.read_csv('LatLong_GLIMS.csv')
df['que'] = np.where((coordinates['glacier_id'] ==
glims['GLAC_ID']))
error returns: 'int' object is not subscriptable
and:
glims.merge(coordinates, how='right', on=('glacier_id', 'GLAC_ID'))
error returns: int' object has no attribute 'merge'
I have no idea how to tackle this big of a merge. I am also afraid of making mistakes because it is nearly impossible to catch them, since the data carries no other identifying factors.
Any guidance would be awesome, thank you.
This should work
glims = glims.merge(coordinates, how='left', left_on='GLAC_ID', right_on='glacier_id')
This a classic merging problem. One way to solve is using straight loc and index-matching
glims = glims.set_index('GLAC_ID')
glims.loc[:, 'lat'] = coord.set_index('glacier_id').lat
glims.loc[:, 'long'] = coord.set_index('glacier_id').long
glims = glims.reset_index()
You can also use pd.merge
pd.merge(glims,
coord.rename(columns={'glacier_id': 'GLAC_ID'}),
on='GLAC_ID')
I'm having trouble understanding under what circumstances are .values() or .values_list() better than just using Model instances?
I think the following are all equivalent:
results = SomeModel.objects.all()
for result in results:
print(result.some_field)
results = SomeModel.objects.all().values()
for result in results:
print(result['some_field'])
results = SomeModel.objects.all().values_list()
for some_field, another_field in results:
print(some_field)
obviously these are stupid examples, could anyone point out a good reason for using .values() / .values_list() over just using Model instances directly?
edit :
I did some simple profiling, using a noddy model that contained 2 CharField(max_length=100)
Iterating over just 500 instances to copy 'first' to another variable, taking the average of 200 runs I got following results:
Test.objects.all() time: 0.010061947107315063
Test.objects.all().values('first') time: 0.00578328013420105
Test.objects.all().values_list('first') time: 0.005257354974746704
Test.objects.all().values_list('first', flat=True) time: 0.0052023959159851075
Test.objects.all().only('first') time: 0.011166254281997681
So the answer is definitively : performance! (mostly, see knbk answer below)
.values() and .values_list() translate to a GROUP BY query. This means that rows with duplicate values will be grouped into a single value. So say you have a model People the following data:
+----+---------+-----+
| id | name | age |
+----+---------+-----+
| 1 | Alice | 23 |
| 2 | Bob | 42 |
| 3 | Bob | 23 |
| 4 | Charlie | 30 |
+----+---------+-----+
Then People.objects.values_list('name', flat=True) will return 3 rows: ['Alice', 'Bob', 'Charlie']. The rows with name 'Bob' are grouped into a single value. People.objects.all() will return 4 rows.
This is especially useful when doing annotations. You can do e.g. People.objects.values_list('name', Sum('age')), and it will return the following results:
+---------+---------+
| name | age_sum |
+---------+---------+
| Alice | 23 |
| Bob | 65 |
| Charlie | 30 |
+---------+---------+
As you can see, the ages of both Bob's have been summed, and are returned in a single row. This is different from distinct(), which only applies after the annotations.
Performance is just a side-effect, albeit a very useful one.
values() and values_list() are both intended as optimizations for a specific use case: retrieving a subset of data without the overhead of creating a model instance. Good explanation is given in the Django Documentation.
I use "values_list()" to create a Custom Dropdown Single Select Box for Django Admin as shown below:
# "admin.py"
from django.contrib import admin
from django import forms
from .models import Favourite, Food, Fruit, Vegetable
class FoodForm(forms.ModelForm):
# Here
FRUITS = Fruit.objects.all().values_list('id', 'name')
fruits = forms.ChoiceField(choices=FRUITS)
# Here
VEGETABLES = Vegetable.objects.all().values_list('id', 'name')
vegetables = forms.ChoiceField(choices=VEGETABLES)
class FoodInline(admin.TabularInline):
model = Food
form = FoodForm
#admin.register(Favourite)
class FavouriteAdmin(admin.ModelAdmin):
inlines = [FoodInline]
Let's say I have a model like this:
+-----------+--------+--------------+
| Name | Amount | Availability |
+-----------+--------+--------------+
| Milk | 100 | True |
+-----------+--------+--------------+
| Chocolate | 200 | False |
+-----------+--------+--------------+
| Honey | 450 | True |
+-----------+--------+--------------+
Now in a second model I want to have a field (also named 'Amount') which is always equal to the sum of the amounts of the rows which have Availability = True. For example like this:
+-----------+-----------------------------------------------+
| Inventory | Amount |
+-----------+-----------------------------------------------+
| Groceries | 550 #this is the field I want to be dependent |
+-----------+-----------------------------------------------+
Is that possible? Or is there a better way of doing this?
Of course that is possible: i would recommend one of two things:
Do this "on the fly" as one person commented. then store in django cache mechanisim so that it only calculates once in awhile (saving database/computation resources).
create a database view that does the summation; again it will let the database cache the results/etc. to save resources.
That said, I only think #1 or 2 is needed on a very large record set on a very busy site.