Converting Django QuerySet to pandas DataFrame - python

I am going to convert a Django QuerySet to a pandas DataFrame as follows:
qs = SomeModel.objects.select_related().filter(date__year=2012)
q = qs.values('date', 'OtherField')
df = pd.DataFrame.from_records(q)
It works, but is there a more efficient way?

import pandas as pd
import datetime
from myapp.models import BlogPost
df = pd.DataFrame(list(BlogPost.objects.all().values()))
df = pd.DataFrame(list(BlogPost.objects.filter(date__gte=datetime.datetime(2012, 5, 1)).values()))
# limit which fields
df = pd.DataFrame(list(BlogPost.objects.all().values('author', 'date', 'slug')))
The above is how I do the same thing. The most useful addition is specifying which fields you are interested in. If it's only a subset of the available fields you are interested in, then this would give a performance boost I imagine.

Convert the queryset on values_list() will be more memory efficient than on values() directly. Since the method values() returns a queryset of list of dict (key:value pairs), values_list() only returns list of tuple (pure data). It will save about 50% memory, just need to set the column information when you call pd.DataFrame().
Method 1:
queryset = models.xxx.objects.values("A","B","C","D")
df = pd.DataFrame(list(queryset)) ## consumes much memory
#df = pd.DataFrame.from_records(queryset) ## works but no much change on memory usage
Method 2:
queryset = models.xxx.objects.values_list("A","B","C","D")
df = pd.DataFrame(list(queryset), columns=["A","B","C","D"]) ## this will save 50% memory
#df = pd.DataFrame.from_records(queryset, columns=["A","B","C","D"]) ##It does not work. Crashed with datatype is queryset not list.
I tested this on my project with >1 million rows data, the peak memory is reduced from 2G to 1G.

Django Pandas solves this rather neatly: https://github.com/chrisdev/django-pandas/
From the README:
class MyModel(models.Model):
full_name = models.CharField(max_length=25)
age = models.IntegerField()
department = models.CharField(max_length=3)
wage = models.FloatField()
from django_pandas.io import read_frame
qs = MyModel.objects.all()
df = read_frame(qs)

From the Django perspective (I'm not familiar with pandas) this is fine. My only concern is that if you have a very large number of records, you may run into memory problems. If this were the case, something along the lines of this memory efficient queryset iterator would be necessary. (The snippet as written might require some rewriting to allow for your smart use of .values()).

You maybe can use model_to_dict
import datetime
from django.forms import model_to_dict
pallobjs = [ model_to_dict(pallobj) for pallobj in PalletsManag.objects.filter(estado='APTO_PARA_VENTA')]
df = pd.DataFrame(pallobjs)
df.head()

Related

Cannot filter my Django model depence on created_date

I am trying to do a chart. My database has created_date. I am getting product data every day about 150 times and I want to see a daily increase and decrease of my data. I have no problem with my front end and Django-template (I try manual data and it works well) I just want to see the last 7 days chart.
When I use Products.objects.filter(created_dates=days) filter method I am getting empty Queryset.
I already try created_dates__gte=startdate,created_dates__lte=enddate it return empty Queryset to.
I also try created_dates__range to there is no answer too.
I just get data from created_dates__gte=days but I don't want these data.
view.py
from datetime import date,timedelta
import datetime
def data_chart(request):
data = []
week_days = [datetime.datetime.now().date()-timedelta(days=i) for i in range (1,7)]
for days in week_days:
product_num = Products.objects.filter(created_dates=days)
date =days.strftime("%d.%m")
item = {"day": date,"value":len(product_num)}
data.append(item)
return render(request, 'chartpage.html', {'data': data})
In my database, I have thousands of data and my daily data about 150. My created_dates column format like this.
created_dates col:
2020-10-19 09:39:19.894184
So what is wrong with my code?. Could you please help?
You are trying to compare DateTimeField type (created_dates) with Date type (week_days is list of days) so maybe You should try __date lookup.
product_num = Products.objects.filter(created_dates__date=days)
https://docs.djangoproject.com/en/3.0/ref/models/querysets/#date
Furthermore maybe You should consider start using Count() database function with group by instead of iterating over days.
Here is great explanation:
https://stackoverflow.com/a/19102493/5160341
You should be able to do this with a single aggregation query:
import datetime
from django.db.models import Count
def data_chart(request):
cutoff = datetime.date.today() - datetime.timedelta(days=7)
raw_data = (
Products.objects.filter(created_dates__gte=cutoff)
.values_list("created_dates__date")
.annotate(count=Count("id"))
.values_list("created_dates__date", "count")
)
data = [{"day": str(date), "value": value} for (date, value) in raw_data]
return render(request, "chartpage.html", {"data": data})

Using Dask to parallelize HDF read-translate-write

TL;DR: We're having issues parallelizing Pandas code with Dask that reads and writes from the same HDF
I'm working on a project that generally requires three steps: reading, translating (or combining data), and writing these data. For context, we're working with medical records, where we receive claims in different formats, translate them into a standardized format, then re-write them to disk. Ideally, I'm hoping to save intermediate datasets in some form that I can access via Python/Pandas later.
Currently, I've chosen HDF as my data storage format, however I'm having trouble with runtime issues. On a large population, my code currently can take upwards of a few days. This has led me to investigate Dask, but I'm not positive I've applied Dask best to my situation.
What follows is a working example of my workflow, hopefully with enough sample data to get a sense of runtime issues.
Read (in this case Create) data
import pandas as pd
import numpy as np
import dask
from dask import delayed
from dask import dataframe as dd
import random
from datetime import timedelta
from pandas.io.pytables import HDFStore
member_id = range(1, 10000)
window_start_date = pd.to_datetime('2015-01-01')
start_date_col = [window_start_date + timedelta(days=random.randint(0, 730)) for i in member_id]
# Eligibility records
eligibility = pd.DataFrame({'member_id': member_id,
'start_date': start_date_col})
eligibility['end_date'] = eligibility['start_date'] + timedelta(days=365)
eligibility['insurance_type'] = np.random.choice(['HMO', 'PPO'], len(member_id), p=[0.4, 0.6])
eligibility['gender'] = np.random.choice(['F', 'M'], len(member_id), p=[0.6, 0.4])
(eligibility.set_index('member_id')
.to_hdf('test_data.h5',
key='eligibility',
format='table'))
# Inpatient records
inpatient_record_number = range(1, 20000)
service_date = [window_start_date + timedelta(days=random.randint(0, 730)) for i in inpatient_record_number]
inpatient = pd.DataFrame({'inpatient_record_number': inpatient_record_number,
'service_date': service_date})
inpatient['member_id'] = np.random.choice(list(range(1, 10000)), len(inpatient_record_number))
inpatient['procedure'] = np.random.choice(['A', 'B', 'C', 'D'], len(inpatient_record_number))
(inpatient.set_index('member_id')
.to_hdf('test_data.h5',
key='inpatient',
format='table'))
# Outpatient records
outpatient_record_number = range(1, 30000)
service_date = [window_start_date + timedelta(days=random.randint(0, 730)) for i in outpatient_record_number]
outpatient = pd.DataFrame({'outpatient_record_number': outpatient_record_number,
'service_date': service_date})
outpatient['member_id'] = np.random.choice(range(1, 10000), len(outpatient_record_number))
outpatient['procedure'] = np.random.choice(['A', 'B', 'C', 'D'], len(outpatient_record_number))
(outpatient.set_index('member_id')
.to_hdf('test_data.h5',
key='outpatient',
format='table'))
Translate/Write data
Sequential approach
def pull_member_data(member_i):
inpatient_slice = pd.read_hdf('test_data.h5', 'inpatient', where='index == "{}"'.format(member_i))
outpatient_slice = pd.read_hdf('test_data.h5', 'outpatient', where='index == "{}"'.format(member_i))
return inpatient_slice, outpatient_slice
def create_visits(inpatient_slice, outpatient_slice):
# In reality this is more complicated, using some logic to combine inpatient/outpatient/ER into medical 'visits'
# But for simplicity, we'll just stack the inpatient/outpatient and assign a record identifier
visits_stacked = pd.concat([inpatient_slice, outpatient_slice]).reset_index().sort_values('service_date')
visits_stacked.insert(0, 'visit_id', range(1, len(visits_stacked) + 1))
return visits_stacked
def save_visits_to_hdf(visits_slice):
with HDFStore('test_data.h5', mode='a') as store:
store.append('visits', visits_slice)
# Read in the data by member_id, perform some operation
def translate_by_member(member_i):
inpatient_slice, outpatient_slice = pull_member_data(member_i)
visits_slice = create_visits(inpatient_slice, outpatient_slice)
save_visits_to_hdf(visits_slice)
def run_translate_sequential():
# Simple approach: Loop through each member sequentially
for member_i in member_id:
translate_by_member(member_i)
run_translate_sequential()
The above code takes ~9 minutes to run on my machine.
Dask approach
def create_visits_dask_version(visits_stacked):
# In reality this is more complicated, using some logic to combine inpatient/outpatient/ER
# But for simplicity, we'll just stack the inpatient/outpatient and assign a record identifier
len_of_visits = visits_stacked.shape[0]
visits_stacked_1 = (visits_stacked
.sort_values('service_date')
.assign(visit_id=range(1, len_of_visits + 1))
.set_index('visit_id')
)
return visits_stacked_1
def run_translate_dask():
# Approach 2: Dask, with individual writes to HDF
inpatient_dask = dd.read_hdf('test_data.h5', 'inpatient')
outpatient_dask = dd.read_hdf('test_data.h5', 'outpatient')
stacked = dd.concat([inpatient_dask, outpatient_dask])
visits = stacked.groupby('member_id').apply(create_visits_dask_version)
visits.to_hdf('test_data_dask.h5', 'visits')
run_translate_dask()
This Dask approach takes 13 seconds(!)
While this is a great improvement, we're generally curious about a few things:
Given this simple example, is the approach of using Dask dataframes, concatenating them, and using groupby/apply the best approach?
In reality, we have multiple processes like this that read from the same HDF, and write to the same HDF. Our original codebase was structured in a manner that allowed for running the entire workflow one member_id at a time. When we tried to parallelize them, it sometimes worked on small samples, but most of the time produced a segmentation fault. Are there known issues with parallelizing workflows like this, reading/writing with HDFs? We're working on producing an example of this as well, but figured we'd post this here in case this triggers suggestions (or if this code helps someone facing a similar problem).
Any and all feedback appreciated!
In general groupby-apply will be fairly slow. It is generally challenging to resort data like this, especially in limited memory.
In general I recommend using the Parquet format (dask.dataframe has to_ and read_parquet functions). You are much less likely to get segfaults than with HDF files.

Pandas: fastest way to the DF by date

I have an efficiency question for you. I wrote some code to analyze a report that holds over 70k records and over 400+ unique organizations to allow my supervisor to enter in year/month/date they are interested in and have it pop out the information.
The beginning of my code is:
import pandas as pd
import numpy as np
import datetime
main_data = pd.read_excel("UpdatedData.xlsx", encoding= 'utf8')
#column names from DF
epi_expose = "EpitheliumExposureSeverity"
sloughing = "EpitheliumSloughingPercentageSurface"
organization = "OrgName"
region = "Region"
date = "DeathOn"
#list storage of definitions
sl_list = ["",'None','Mild','Mild to Moderate']
epi_list= ['Moderate','Moderate to Severe','Severe']
#Create DF with four columns
df = main_data[[region, organization, epi_expose, sloughing, date]]
#filter it down to months
starting_date = datetime.date(2017,2,1)
ending_date = datetime.date(2017,2,28)
df = df[(df[date] > starting_date) & (df[date] < ending_date)]
I am then performing conditional filtering below to get counts by region and organization. It works, but is slow. Is there a more efficient way to query my DF and set up a DF that ONLY has the dates that it is supposed to sit between? Or is this the most efficient way without altering how the Database I am using is set up?
I can provide more of my code but if I filter it out by month before exporting to excel, the code runs in a matter of seconds so I am not concerned about the speed of it besides getting the correct date fields.
Thank you!

How to either skip lines or check type of data within single construction line when processing csv input into Python dictionary

My input is a .csv file that happens to have headers.
I want to use a concise line, like this:
mydict = {custID:[parser.parse(str(date)), amount]
for transID, custID, amount, date in reader}
to create a dictionary from the input. However, the data isn't perfectly "clean". I want to check that each row of data is the sort of data that I want the dictionary to map.
Something like:
mydict = {if custID is type int custID:[parser.parse(str(date)), amount]
for transID, custID, amount, date in reader}
would be a nice fix, but, alas, it does not work.
Any suggestions that keep the short dictionary constructor while facilitating input processing?
I think you are on the right track and filtering with dictionary comprehension should work here:
mydict = {custID: [parser.parse(str(date)), amount]
for transID, custID, amount, date in reader
if isinstance(custID, int)}
In this case, you would though silently ignore rows where custID is not of an integer type.
Plus, things would go wrong if custID is not unique. If custIDs could repeat, you might want to switch to a defaultdict(list) collection, collecting date+amount pairs grouped by custID.
For a similar task, I've personally used CsvSchema third-party package - you can define what types in csv columns are you expecting, extra validation rules:
CsvSchema is easy to use module designed to make CSV file checking
easier. It allows to create more complex validation rules faster
thanks to some predefined building blocks.
In your case, here is an example CSV structure class you may start with:
from datetime import datetime
from csv_schema.structure.base import BaseCsvStructure
from csv_schema.columns.base import BaseColumn
from csv_schema.exceptions import ImproperValueException
from csv_schema.columns import IntColumn, DecimalColumn, StringColumn
class DateColumn(BaseColumn):
def convert(self, raw_val):
try:
return datetime.strptime(raw_val, '%Y-%m-%d') if raw_val else None
except ValueError:
raise ImproperValueException('Invalid date format')
class MyCsvStructure(BaseCsvStructure):
transID = IntColumn(max_length=10)
custID = IntColumn(max_length=10)
amount = DecimalColumn(blank=True, fraction_digits=2)
date = DateColumn(max_length=10, blank=True)

Looking for a python datastructure for cleaning/annotating large datasets

I'm doing a lot of cleaning, annotating and simple transformations on very large twitter datasets (~50M messages). I'm looking for some kind of datastructure that would contain column info the way pandas does, but works with iterators rather than reading the whole dataset into memory at once. I'm considering writing my own, but I wondered if there was something with similar functionality out there. I know I'm not the only one doing things like this!
Desired functionality:
>>> ds = DataStream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds.columns
['id', 'message']
>>> ds.iterator.next()
[2385, "Hi it's me, Sally!"]
>>> ds = datastream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds_tok = get_tokens(ds)
>>> ds_tok.columns
['message_id', 'token', 'n']
>>> ds_tok.iterator.next()
[2385, "Hi", 0]
>>> ds_tok.iterator.next()
[2385, "it's", 1]
>>> ds_tok.iterator.next()
[2385, "me", 2]
>>> ds_tok.to_sql(db_info)
UPDATE: I've settled on a combination of dict iterators and pandas dataframes to satisfy these needs.
As commented there is a chunksize argument for read_sql which means you can work on sql results piecemeal. I would probably use HDF5Store to save the intermediary results... or you could just append it back to another sql table.
dfs = pd.read_sql(..., chunksize=100000)
store = pd.HDF5Store("store.h5")
for df in dfs:
clean_df = ... # whatever munging you have to do
store.append("df", clean_df)
(see hdf5 section of the docs), or
dfs = pd.read_sql(..., chunksize=100000)
for df in dfs:
clean_df = ...
clean_df.to_sql(..., if_exists='append')
see the sql section of the docs.

Categories

Resources