Making a string by accessing dictionary values - python

years, months, days, hours, minutes these values accessed from a dict. now I want to create a string like years = 12 months= 1 if there is only years and months. Consider a case only minutes and second. then the string should be minutes=1 seconds= 1 . how can I do this in an effective way??
The sample data may look like this
I tried to do something like this. But not working
years, months, days, hours, minutes = self.initial_data["months"], \
self.initial_data["years"], \
self.initial_data["days"], \
self.initial_data["hours"], \
self.initial_data["minutes"],
duration = if years: "years= {}".format(years) + \
if months "months={}".format(months) +\
and so on
So the string change by if there is value or not

Apart from bad syntax (you cannot use if in such a way), you're making this more complicated than it needs to be. Just create an ordered map and loop through it to extract existing fields from self.initial_data:
fields = ["years", "months", "days", "hours", "minutes", "seconds"] # ordered list
data = self.initial_data # a shorthand for readability sake
duration = " ".join("{}={}".format(f, data[f]) for f in fields if f in data)
You can add an additional data[f] check in the duration list build up if you want to ignore fields whose values evaluate to False (e.g. 0, None etc.).

Related

Python For Loop Create DataFrame

Hihi, I'm reasonably new to Python, more a R guy. But I'm required to use python for a task.
However, I encountered a situation where I need to subset data into different dataFrame by the date.
"in future not 3 exactly, therefore want to create a loop to do this"
I want to potentially create 3 dataframes like
df_train_30 which contains start day 0 - 30
df_train_60 which contains start day 30 - 60
df_train_90 which contains start day 60 - 90
but not sure how to achieve this... pls help.
idea in code below
today = pd.to_datetime('now')
df_train['START_DATE'] = pd.to_datetime(df_train['START_DATE'])
previous_day_del = 0
for day_del in (30,60,90):
**'df_train_' + str(day_del)** = df_train[(df_train['START_DATE']>= today - timedelta(days=day_del)) & (df_train['START_DATE']< today - timedelta(days=previous_day_del))]
previous_day_del = day_del
You could probably store it into a dictionary - it's easier to manage rather than dynamically generated variables. A dictionary's more of an object with key-value pairs, and the values can be of almost any type, including even dataframes. Here's a quick guide on Python dictionaries that you could look at.
In your example, you could probably go ahead with this:
today = pd.to_datetime('now')
df_train['START_DATE'] = pd.to_datetime(df_train['START_DATE'])
previous_day_del = 0
# Creating an empty dictionary here called dict_train
dict_train = {}
for day_del in (30,60,90):
dict_train[day_del] = df_train[(df_train['START_DATE']>= today -timedelta(days=day_del)) & (df_train['START_DATE']< today - timedelta(days=previous_day_del))]
previous_day_del = day_del
Hope this helps, cheers! :)

Django prefetch_related and performance optimisation with multiple QuerySets within loops

Effectively I have multiple Queries within loops that I am just not happy with. I was wondering if someone had the expertise with prefetch_related and other Django Query construction optimisation to be able to help me on this issue.
I start with:
users = User.objects.filter(organisation=organisation).filter(is_active=True)
Then, I start my loop over all days starting from a certain date "start_date":
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
I then within this loop over a filtered subset of the above users
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
for user in users.filter(created_date__lte=date).iterator():
Ok, so firstly, is there any way to optimise this?
What may make some of the hardcore Django-ers loose their tether, I do all of the above inside another loop!!
for survey in Survey.objects.all().iterator():
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
for user in users.filter(created_date__lte=date).iterator():
Inside the last loop, I perform one final Query filter:
survey_result = SurveyResult.objects.filter(survey=survey, user=user, created_date__lte=date).order_by('-updated_date')[0]
I do this because I feel I need to have the user, survey and date variables ready to filter...
I have started thinking about prefetch_related and the Prefetch object. I've consulted the documentation but I can't seem to wrap my head around how I could apply this to my situation.
Effectively, the query is taking far too long. For an average of 1000 users, 4 surveys and approximately 30 days, this query is taking 1 minute to complete.
Ideally, I would like to shave off 50% of this. Any better, and I will be extremely happy. I'd also like the load on the DB server to be reduced as this query could be running multiple times across different organisations.
Any expert tips on how to organise such horrific queries within loops within loops would be greatly appreciated!
Full "condensed" minimum viable snippet:
users = User.objects.filter(organisation=organisation).filter(is_active=True)
datasets = []
for survey in Survey.objects.all():
data = []
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
total_score = 0
participants = 0
for user in users.filter(created_date__lte=date):
participants += 1
survey_result = SurveyResult.objects.filter(survey=survey, user=user, created_date__lte=date).order_by('-updated_date')[0]
total_score += survey_result.score
# An average is calculated from the total_score and participants and append to the data array.:
# Divide catches divide by zero errors.
# Round will round to two decimal places for front end readability.
data.append(
round(
divide(total_score, participants), 2
)
)
datasets.append(data)
********* ADDENDUM: *********
So, further to #dirkgroten's answer I am currently running with:
for survey in Survey.objects.all():
results = SurveyResult.objects.filter(
user__in=users, survey=survey, created_date__range=date_range
).values(
'survey',
'created_date',
).annotate(
total_score=Sum('normalized_score'),
participants=Count('user'),
average_score=Avg('normalized_score'),
).order_by(
'created_date'
)
for result in results:
print(result)
As I ("think I") need a breakdown by survey for each QuerySet.
Are there any other optimisations available to me?
You can actually combine queries and perform the calculations directly inside your query:
from django.db.models import Sum, Count, Avg
from django.utils import timezone
users = User.objects.filter(organisation=organisation).filter(is_active=True)
date_range = [start_date, timezone.now().date] # or adapt end time to different time zone
results = SurveyResult.objects.filter(user__in=users, created_date__range=date_range)\
.values('survey', 'created_date')\
.annotate(total_score=Sum('score'), participants=Count('pk'))
.order_by('survey', 'created_date')
This will group the results by survey and created_date and add the total_score and participants to each result, something like:
[{'survey': 1, 'created_date': '2019-08-05', 'total_score': 54, 'participants': 20},
{'survey': 1, ... } ... ]
I'm assuming there's only one SurveyResult per user so the number of SurveyResult in each group is the number of participants.
Note that Avg also gives you the average score at once, that is assuming only one possible score per user:
.annotate(average_score=Avg('score')) # instead of total and participants
This should shave off 99.9% of your query time :-)
If you want the dataset as a list of lists, you just do something like this:
dataset = []
data = []
current_survey = None
current_date = start_date
for result in results
if not result['survey'] == current_survey:
# results ordered by survey, so if it changes, reset data
if data: dataset.append(data)
data = []
current_survey = result['survey']
if not result['created_date'] == current_date:
# results ordered by date so missing date won't be there later
# assume a daterange function to create a list of dates
for date in daterange(current_date, result['created_date']):
data.append(0) # padding data
current_date = result['created_date']
data.append(result['average_score'])
The result will be a list of lists:
dataset = [[0, 0, 10.4, 3.9, 0], [20.2, 3.5, ...], ...]
Not super efficient python, but with a few 1000 results, this will be super fast anyway, way faster than performing more db queries.
EDIT: Since created_date is DateTimeField, you first need to get the corresponding date:
from django.db.models.functions import TruncDate
results = SurveyResult.objects.filter(user__in=users, created_date__range=date_range)
.annotate(date=TruncDate('created_date'))
.values('survey', 'date')
.annotate(average_score=Avg('score'))
.order_by('survey', 'date')

Neo4j cypher to filter date and hour ranges doesn't give required results

I am a bit new to neo4j and struggling to construct a proper cypher query for the following use case: My network has source ---> destination relations where ---> is a GoesTo label with property date and hour (0-23).
Now i wish to construct a query that fetches all the nodes between 2 dates and 2 hours. I am using the query shared below but it doesn't give correct results.
dates = [20150501,20150502]
#date1=20150501
#date2=20150502
hour1 = 13 # starting hour for date1
hour2 = 12 # ending hour for date2
hours = [hour1,hour2]
q9a = """
MATCH (c:Gate)-[r:GoesTo]->(d:Gate) WHERE (r.date >= {date1} AND r.hour >= {houra}) AND (r.date <={date2} AND r.hour <= {hourb})
RETURN c.name AS FromGate, d.name AS ToGate, sum(r.weight)
AS WDegree ORDER BY WDegree DESC
"""
result = graph.cypher.execute(q9a, date1=dates[0],date2=dates[1], houra=hours[0], hourb=hours[1])
this query returns empty result as i understand that i am using a common 'r' for r.hour. Any tips please.
I ended up resolving it using unix time. I converted timestamps to unix time and added unixtime as property to each edge and then the query was good to go.

Python image file manipulation

Python beginner here. I am trying to make us of some data stored in a dictionary.
I have some .npy files in a folder. It is my intention to build a dictionary that encapsulates the following: reading of the map, done with np.load, the year, month, and date of the current map (as integers), the fractional time in years (given that a month has 30 days - it does not affect my calculations afterwards), and the number of pixels, and number of pixels above a certain value. At the end I expect to get a dictionary like:
{'map0':'array(from np.load)', 'year', 'month', 'day', 'fractional_time', 'pixels'
'map1':'....}
What I managed until now is the following:
import glob
file_list = glob.glob('*.npy')
def only_numbers(seq): #for getting rid of any '.npy' or any other string
seq_type= type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
maps = {}
for i in range(0, len(file_list)-1):
maps[i] = np.load(file_list[i])
numbers[i]=list(only_numbers(file_list[i]))
I have no idea how to to get a dictionary to have more values that are under the for loop. I can only manage to generate a new dictionary, or a list (e.g. numbers) for every task. For the numbers dictionary, I have no idea how to manipulate the date in the format YYYYMMDD to get the integers I am looking for.
For the pixels, I managed to get it for a single map, using:
data = np.load('20100620.npy')
print('Total pixel count: ', data.size)
c = (data > 50).astype(int)
print('Pixel >50%: ',np.count_nonzero(c))
Any hints? Until now, image processing seems to be quite a challenge.
Edit: Managed to split the dates and make them integers using
date=list(only_numbers.values())
year=int(date[i][0:4])
month=int(date[i][4:6])
day=int(date[i][6:8])
print (year, month, day)
If anyone is interested, I managed to do something else. I dropped the idea of a dictionary containing everything, as I needed to manipulate further easier. I did the following:
file_list = glob.glob('data/...') # files named YYYYMMDD.npy
file_list.sort()
def only_numbers(seq): # i make sure that i remove all characters and symbols from the name of the file
seq_type = type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
numbers = {}
time = []
np_above_value = []
for i in range(0, len(file_list) - 1):
maps = np.load(file_list[i])
maps[np.isnan(maps)] = 0 # had some NANs and getting some errors
numbers[i] = only_numbers(file_list[i]) # getting a dictionary with the name of the files that contain only the dates - calling the function I defined earlier
date = list(numbers.values()) # registering the name of the files (only the numbers) as a list
year = int(date[i][0:4]) # selecting first 4 values (YYYY) and transform them as integers, as required
month = int(date[i][4:6]) # selecting next 2 values (MM)
day = int(date[i][6:8]) # selecting next 2 values (DD)
time.append(year + ((month - 1) * 30 + day) / 360) # fractional time
print('Total pixel count for map '+ str(i) +':', maps.size) # total number of pixels for the current map in iteration
c = (maps > value).astype(int)
np_above_value.append (np.count_nonzero(c)) # list of the pixels with a value bigger than value
print('Pixels with concentration >value% for map '+ str(i) +':', np.count_nonzero(c)) # total number of pixels with a value bigger than value for the current map in iteration
plt.plot(time, np_above_value) # pixels with concentration above value as a function of time
I know it might be very clumsy. Second week of python, so please overlook that. It does the trick :)

Django: Which days have objects?

I'm currently creating a simple calendar for one of my Django projects. The calendar will display the current month and the days. Any day which has a item for that day, will be highlighted in red so that the user knows that there are items for that day. The number of items or what items they are don't matter. All we care about is whether a day has items.
Lets say I have the following model.
class Items(models.Model):
name = models.CharField(max_length=140)
datetime = models.DateTimeField(auto_now_add=False)
def save(self, *args, **kwargs):
if datetim is None:
created = datetime.now()
super(Items, self).save()
Here is my current logic for finding which days have items:
from calendar import monthrange
# Find number of days for June 2015
num_days = monthrange(2015, 6)[1]
days_with_items = []
'''
Increase num_days by 1 so that the last day of the month
is included in the range
'''
num_days =+ 1
for day in range(0, num_days):
has_items = Items.objects.filter(datetime__day = day,
datetime__month = 6,
datetime__year = 2015).exists()
if has_items:
days_with_items.append(day)
return days_with_items
This returns me a list with all the days that have items. This works however I'm looking for a more efficient way of doing this since Django is making multiple trips to the DB for the .exists()
Any suggestions?
I see two possible options. The first one is to add counts at DB level, the second is to have an efficient loop over available data at python level. Depending on data size and db-efficiency you can choose which suits you best.
Counting in the database is explained here:
Django ORM, group by day
Or solution two, a simple script (not so elegant.. but just as an example):
days_with_items_hash = {}
items = Items.objects.filter(
datetime__month = 6,
datetime__year = 2015
)
for item in items:
days_with_item_hash[item.datetime.day] = True
days_with_item = days_with_item_hash.keys()
I would stick with the database solution because it can be optimised (sql views, extra column with just the day, etc)
At first, let's get all the items for the required month.
items = Items.objects.filter(datetime__month=6, datetime__year=2015)
days = set([item.datetime.day for item in items]) # unique days
If you want to make a partial query, specify values you need, here's the concept:
items = Item.objects.filter(
date_added__month=6, date_added__year=2015
).values('date_added')
days = set([item['date_added'].day for item in items])
It will result in the following SQL query:
QUERY = u'SELECT "main_item"."date_added" FROM "main_item" WHERE
(django_datetime_extract(\'month\', "main_item"."date_added", %s) = %s
AND "main_item"."date_added" BETWEEN %s AND %s)'
- PARAMS = (u"'UTC'", u'6', u'datetime.datetime(2015, 1, 1, 0, 0, tzinfo=<UTC>)',
u'datetime.datetime(2015, 12, 31, 23, 59, 59, 999999, tzinfo=<UTC>)')
If you are dealing with big amout of Items, you can break your query into parts (< 15 and >=15 for example). This will result in extra database hit, but the memory usage pick will be smaller. You can also consider different methods.
Please, also note:
that datetime is not the best name for a field. Name it
meaningfully, like: "date_added", "date_created" or something like
that
if self.datetime is None is 'almost' equal to if not self.datetime
Use the dates method.
items = Item.objects.filter(date_added__month=6, date_added__year=2015)
dates = items.dates('date_added', 'day') # returns a queryset of datetimes
days = [d.day for d in dates] # convert to a list of days

Categories

Resources