How to efficiently select entries by date in python?

How to efficiently select entries by date in python? - python

I have emails and dates. I can use 2 nested for loops to choose emails sent on same date, but how can i do it 'smart way' - efficiently?
# list of tuples - (email,date)
for entry in list_emails_dates:
current_date = entry[1]
for next_entry in list_emails_dates:
if current_date = next_entry[1]
list_one_date_emails.append(next_entry)
I know it can be written in shorter code, but I don't know itertools, or maybe use map, xrange?

You can just convert this to a dictionary, by collecting all emails related to a date into the same key.
To do this, you need to use defaultdict from collections. It is an easy way to give a new key in a dictionary a default value.
Here we are passing in the function list, so that each new key in the dictionary will get a list as the default value.
emails = defaultdict(list)
for email,email_date in list_of_tuples:
emails[email].append(email_date)
Now, you have emails['2013-14-07'] which will be a list of emails for that date.
If we don't use a defaultdict, and do a dictionary comprehension like this:
emails = {x[1]:x[0] for x in list_of_tuples}
You'll have one entry for each date, which will be the last email for that that, since assigning to the same key will override its value. A dictionary is the most efficient way to lookup a value by a key. A list is good if you want to lookup a value by its position in a series of values (assuming you know its position).
If for some reason you are not able to refactor it, you can use this template method, which will create a generator:
def find_by_date(haystack, needle):
for email, email_date in haystack:
if email_date == needle:
yield email
Here is how you would use it:
>>> email_list = [('foo#bar.com','2014-07-01'), ('zoo#foo.com', '2014-07-01'), ('a#b.com', '2014-07-03')]
>>> all_emails = list(find_by_date(email_list, '2014-07-01'))
>>> all_emails
['foo#bar.com', 'zoo#foo.com']
Or, you can do this:
>>> july_first = find_by_date(email_list, '2014-07-01')
>>> next(july_first)
'foo#bar.com'
>>> next(july_first)
'zoo#foo.com'

I would do an (and it's good to try using itertools)
itertools.groupby(list_of_tuples, lambda x: x[1])
which gives you the list of emails grouped by the date (x[1]). Note that when you do it you have to sort it regarding the same component (sorted(list_of_tuples, lambda x: x[1])).
One nice thing (other than telling the reader that we do a group) is that it works lazily and, if the list is kind of long, its performance is dominated by n log n for the sorting instead of n^2 for the nested loop.

Related

Sorting a list of dict from redis in python

in my current project i generate a list of data, each entry is from a key in redis in a special DB where only one type of key exist.
r = redis.StrictRedis(host=settings.REDIS_AD, port=settings.REDIS_PORT, db='14')
item_list = []
keys = r.keys('*')
for key in keys:
item = r.hgetall(key)
item_list.append(item)
newlist = sorted(item_list, key=operator.itemgetter('Id'))
The code above let me retrieve the data, create a list of dict each containing the information of an entry, problem is i would like to be able to sort them by ID, so they come out in order when displayed on my html tab in the template, but the sorted function doesn't seem to work since the table isn't sorted.
Any idea why the sorted line doesn't work ? i suppose i'm missing something to make it work but i can't find what.
EDIT :
Thanks to the answer in the comments,the problem was that my 'Id' come out of redis as a string and needed to be casted as int to be sorted
key=lambda d: int(d['Id'])

All values returned from redis are apparently strings and strings do not sort numerically ("10" < "2" == True).
Therefore you need to cast it to a numerical value, probably to int (since they seem to be IDs):
newlist = sorted(item_list, key=lambda d: int(d['Id']))

Combining two lists and sorting through reference to dictionary Python

I have (what seems to me is) a pretty convoluted problem. I'm going to try to be as succinct as possible - though in order to understand the issue fully, you might have to click on my profile and look at the (only other) two questions I've posted on StackOverflow. In short: I have two lists -- one is comprised of email strings that contain a facility name, and a date of incident. The other is comprised of the facility ids for each email (I use one of the following regex functions to get this list). I've used Regex to be able to search each string for these pieces of information. The 3 Regex functions are:
def find_facility_name(incident):
pattern = re.compile(r'Subject:.*?for\s(.+?)\n')
findPat1 = re.search(pattern, incident)
facility_name = findPat1.group(1)
return facility_name
def find_date_of_incident(incident):
pattern = re.compile(r'Date of Incident:\s(.+?)\n')
findPat2 = re.search(pattern, incident)
incident_date = findPat2.group(1)
return incident_date
def find_facility_id(incident):
pattern = re.compile('(\d{3})\n')
findPat3 = re.search(pattern, incident)
f_id = findPat3.group(1)
return f_id
I also have a dictionary that is formatted like this:
d = {'001' : 'Facility #1', '002' : 'Another Facility'...etc.}
I'm trying to COMBINE the two lists and sort by the Key values in the dictionary, followed by the Date of Incident. Since the key values are attached to the facility name, this should automatically caused emails from the same facilities to be grouped together. In order to do that, I've tried to use these two functions:
def get_facility_ids(incident_list):
'''(lst) -> lst
Return a new list from incident_list that inserts the facility IDs from the
get_facilities dictionary into each incident.
'''
f_id = []
for incident in incident_list:
find_facility_name(incident)
for k in d:
if find_facility_name(incident) == d[k]:
f_id.append(k)
return f_id
id_list = get_facility_ids(incident_list)
def combine_lists(L1, L2):
combo_list = []
for i in range(len(L1)):
combo_list.append(L1[i] + L2[i])
return combo_list
combination = combine_lists(id_list, incident_list)
def get_sort_key(incident):
'''(str) -> tup
Return a tuple from incident containing the facility id as the first
value and the date of the incident as the second value.
'''
return (find_facility_id(incident), find_date_of_incident(incident))
final_list = sorted(combination, key=get_sort_key)
Here is an example of what my input might be and the desired output:
d = {'001' : 'Facility #1', '002' : 'Another Facility'...etc.}
input: first_list = ['email_1', 'email_2', etc.]
first output: next_list = ['facility_id_for_1+email_1', 'facility_id_for_2 + email_2', etc.]
DESIRED OUTPUT: FINAL_LIST = sorted(next_list, key=facility_id, date of incident)
The only problem is, the key values are not matching properly with what's found in each individual email string. Some DO, others are completely random. I have no idea why this is happening, but I have a feeling it has something to do with the way I'm combining the two lists. Can anyone help this lowly n00b? Thanks!!!

First off, I would suggest reversing your ID-to-name dictionary. Looking up a value by key is very fast but finding a key by value is very slow.
rd = { name: id_num for id_num, name in d.items() }
Then your first function can be replaced by a list comprehension:
id_list = [rd[find_facility_name(incident)] for incident in incident_list]
This might also expose why you're getting messed up values in your results. If an incident has a facility name that's not in your dictionary, this code will raise a KeyError (whereas your old function would just skip it).
Your combine function is very similar to Python's built in zip function. I'd replace it with:
combination = [id+incident for id, incident in zip(id_list, incident_list)]
However, since you're building the first list from the second one, it might make sense to build the combined version directly, rather than making separate lists and then combining them in a separate step. Here's an update to the list comprehension above that goes right to the combination result:
combination = [rd[find_facility_name(incident)] + incident
for incident in incident_list]
To do the sort, you can use the ID string that we just prepended to the email message, rather than parsing to find the ID again:
combination.sort(key=lambda x: (x[0:3], get_date_of_incident(x)))
The 3 in the slice is based off of your example of "001" and "002" as the id values. If the actual ids are longer or shorter you'll need to adjust that.

So, here is what I think is going on. Please correct me if possible.
The 'incident_list' is a list of email strings. You go in and find the facility names in each email because you have an external dictionary that has the (key:value)=(facility id: facility name). From the dictionary, you can extract the facility id in this 'id_list'.
You combine the lists so that you get a list of strings [facility id + email,...]
Then you want it to sort by a tuple( facility id, date of incidence ).
It looks like you are searching for the facility id and the facility name twice. You can skip a step if they are the same. Then, the best way is to do it all at once with tuples:
incident_list = ['email1', 'email2',...]
unsorted_list = []
for email in incident list:
id = find_facility_id(email)
date = find_date_of_incident(email)
mytuple = ( id, date, id + email )
unsorted_list.append(mytuple)
final_list = sorted(unsorted_list, key=lambda mytup:(mytup[0], mytup[1]))
Then you get an easy list of tuples sorted by first element (id as a string), and then second element (date as a string). If you just need a list of strings ( id + email ), then you need a list with the last element of each tuple part
FINALLIST = [ tup[-1] for tup in final_list ]

Sort list of strings by substring using Python

I have a list of strings, each of which is an email formatted in almost exactly the same way. There is a lot of information in each email, but the most important info is the name of a facility, and an incident date.
I'd like to be able to take that list of emails, and create a new list where the emails are grouped together based on the "location_substring" and then sorted again for the "incident_date_substring" so that all of the emails from one location will be grouped together in the list in chronological order.
The facility substring can be found usually in the subject line of each email. The incident date can be found in a line in the email that starts with: "Date of Incident:".
Any ideas as to how I'd go about doing this?

Write a function that returns the two pieces of information you care about from each email:
def email_sort_key(email):
"""Find two pieces of info in the email, and return them as a tuple."""
# ...search, search...
return "location", "incident_date"
Then, use that function as the key for sorting:
emails.sort(key=email_sort_key)
The sort key function is applied to all the values, and the values are re-ordered based on the values returned from the key function. In this case, the key function returns a tuple. Tuples are ordered lexicographically: find the first unequal element, then the tuples compare as the unequal elements compare.

Your solution might look something like this:
def getLocation (mail): pass
#magic happens here
def getDate (mail): pass
#here be dragons
emails = [...] #original list
#Group mails by location
d = {}
for mail in emails:
loc = getLocation (mail)
if loc not in d: d [loc] = []
d [loc].append (mail)
#Sort mails inside each group by date
for k, v in d.items ():
d [k] = sorted (v, key = getDate)

This is something you could do:
from collections import defaultdict
from datetime import datetime
import re
mails = ['list', 'of', 'emails']
mails2 = defaultdict(list)
for mail in mails:
loc = re.search(r'Subject:.*?for\s(.+?)\n', mail).group(1)
mails2[loc].append(mail)
for m in mails2.values():
m.sort(key=lambda x:datetime.strptime(re.search(r'Date of Incident:\s(.+?)\n',
x).group(1), '%m/%d/%Y'))
Please note that this has absolutely no error handling for cases where the regexes don't match.

How can I sort this list in Python?

I have a list which contains a simple object of a class, say Person i.e:
my_list [Person<obj>, Person<obj> ..]
This Person object is very simple, has various variables, values i.e:
Person_n.name = 'Philip'
Person_n.height = '180'
Person_n.lives_in = 'apartment'
So as you can see, all these Persons live somewhere whether: apartment, house or boat.
What I want to create is a new list, or a dictionary (does not matter which) in which I have this list sorted in a way that they are grouped by their lives_in values and the most populated choice is the number one in the new list (or dictionary, in which lives_in value would be the key).
E.g.:
new_list = [('apartment', [Person_1, Person_5, Person_6, Person_4]), ('house': [Person_2, Peson_7]), ('boat': [Person_3])]
I am new to Python and I am stuck with endless loops. There must be a simple way to do this without looping through 4 times.
What is the Pythonic way to achieve this desired new list?

You need to sort it first before passing it to groupby:
sorted_list = sorted(my_list, key=lambda x: x.lives_in)
then use itertools.groupby:
from itertools import groupby
groupby(sorted_list, key=lambda x: x.lives_in)
result = [(key, list(group)) \
for key, group in groupby(sorted_list, key=lambda x: x.lives_in)]

people = my_list_of_people
people.sort(key=operator.attrgetter('lives_in')) # sort the people by where they live
groups = itertools.groupby(people, key=operator.attrgetter('lives_in')) # group the sorted people by where they live

Suppose your Person list is in myList and you want to create newList. I'm kind of new to stack overflow so I'm not using tabs. Can someone help me out lol. But here's the code:
for i in xrange(len(myList)):
found = false;
for j in xrange(len(newList)):
if newList[j][0]==myList[i].lives_in:
found = true;
newList[j][1].append(myList[i]);
if !found:
newList.append((myList[i].lives_in, [myList[i]]))

How to go from a values_list to a dictionary of lists

I have a django queryset that returns a list of values:
[(client pk, timestamp, value, task pk), (client pk, timestamp, value, task pk),....,].
I am trying to get it to return a dictionary of this format:
{client pk:[[timestamp, value],[timestamp, value],...,], client pk:[list of lists],...,}
The values_list may have multiple records for each client pk. I have been able to get dictionaries of lists for client or task pk using:
def dict_from_vl(vls_list):
keys=[values_list[x][3] for x in range(0,len(values_list),1)]
values = [[values_list[x][1], values_list[x][2]] for x in range(0,len(values_list),1)]
target_dict=dict(zip(keys,values))
return target_dict
However using this method, values for the same key write over previous values as it iterates through the values_list, rather than append them to a list. So this works great for getting the most recent if the values list is sorted oldest records to newest, but not for the purpose of creating a list of lists for the dict value.

Instead of target_dict=dict(zip(keys,values)), do
target_dict = defaultdict(list)
for i, key in enumerate(keys):
target_dict[k].append(values[i])
(defaultdict is available in the standard module collections.)

from collections import defaultdict
d = defaultdict(list)
for x in vls_list:
d[x].append(list(x[1:]))
Although I'm not sure if I got the question right.

I know in Python you're supposed to cram everything into a single line, but you could do it the old fashioned way...
def dict_from_vl(vls_list):
target_dict = {}
for v in vls_list:
if v[0] not in target_dict:
target_dict[v[0]] = []
target_dict[v[0]].append([v[1], v[2]])
return target_dict

For better speed, I suggest you don't create the keys and values lists separately but simply use only one loop:
tgt_dict = defaultdict(list)
for row in vas_list:
tgt_dict[row[0]].append([row[1], row[2]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to efficiently select entries by date in python? - python

Related

Sorting a list of dict from redis in python

Combining two lists and sorting through reference to dictionary Python

Sort list of strings by substring using Python

How can I sort this list in Python?

How to go from a values_list to a dictionary of lists

Categories

Resources