How to transform list from database into dataframe? - python

I have a following problem. My database returns a list:
[Order(id=22617, frm=datetime.datetime(2020, 6, 1, 8, 0), to=datetime.datetime(2020, 6, 1, 10, 0), loc=Location(lat=14.491272455461, lng=50.130463596998), address='Makedonska 619/11, Praha', duration=600), datetime.datetime(2020, 6, 1, 11, 38, 46), Order(id=22615, frm=datetime.datetime(2020, 6, 1, 8, 0), to=datetime.datetime(2020, 6, 1, 14, 0), loc=Location(lat=14.681866313487, lng=50.007439571346), address='Výhledová 256, Říčany', duration=600), datetime.datetime(2020, 6, 1, 10, 33, 33)]
Every output from the database is a type routes_data_loading.data_structures.Order and datetime.datetime. I would like to save it as a pandas dataframe.
Desired output for the first row is:
id;frm;to;lat;lng;address;duration;time
22617;2020-06-01 08:00;2020-06-01 10:00;14.491272455461;50.130463596998;Makedonska 619/11, Praha;600;2020-06-01 11:38:46
Semicolumn stands for a new column. Note that the last column time has to be created, because its name is not in the original list.
Can you help me how to convert this list into pandas df, please? I know how to convert simple list into df, but not this complicated one. Thanks a lot.

Try this without guarantee of success:
data = []
for order, time in zip(lst[::2], lst[1::2]):
data.append({'id': order.id, 'frm': order.frm, 'to': order.to,
'lat': order.loc.lat, 'lng': order.loc.lng,
'address': order.address, 'duration': order.duration,
'time': time})
df = pd.DataFrame(data)
Output:
>>> df
id frm to lat lng address duration time
0 22617 2020-06-01 08:00:00 2020-06-01 10:00:00 14.491272 50.130464 Makedonska 619/11, Praha 600 2020-06-01 11:38:46
1 22615 2020-06-01 08:00:00 2020-06-01 14:00:00 14.681866 50.007440 Výhledová 256, Říčany 600 2020-06-01 10:33:33
How do I setup:
from collections import namedtuple
import datetime
Order = namedtuple('Order', ['id', 'frm', 'to', 'loc', 'address', 'duration'])
Location = namedtuple('Location', ['lat', 'lng'])
lst = [Order(id=22617, frm=datetime.datetime(2020, 6, 1, 8, 0), to=datetime.datetime(2020, 6, 1, 10, 0), loc=Location(lat=14.491272455461, lng=50.130463596998), address='Makedonska 619/11, Praha', duration=600),
datetime.datetime(2020, 6, 1, 11, 38, 46),
Order(id=22615, frm=datetime.datetime(2020, 6, 1, 8, 0), to=datetime.datetime(2020, 6, 1, 14, 0), loc=Location(lat=14.681866313487, lng=50.007439571346), address='Výhledová 256, Říčany', duration=600),
datetime.datetime(2020, 6, 1, 10, 33, 33)]

Related

Issues transforming tuple to denormalized dataframe

I have a tuple which is a list of 200 dicts:
eg:
mytuple= ([{'reviewId': '1234', 'userName': 'XXX', 'userImage': 'imagelink', 'content': 'AAA', 'score': 1, 'thumbsUpCount': 1, 'reviewCreatedVersion': '3.31.0', 'at': datetime.datetime(2022, 12, 1, 11, 49, 34), 'replyContent': "replycontent", 'repliedAt': datetime.datetime(2022, 12, 1, 12, 19, 51)},
{'reviewId': '5678', 'userName': 'S L', 'userImage': 'imagelink2', 'content': "content2", 'score': 1, 'thumbsUpCount': 0, 'reviewCreatedVersion': '3.31.0', 'at': datetime.datetime(2022, 11, 29, 12, 27, 46), 'replyContent': "replycontent2", 'repliedAt': datetime.datetime(2022, 11, 29, 12, 30, 40)}])
Ideally, I'd like to transform this into a dataframe with the following column headers:
reviewId
userName
userImage
1234
XXXX
imagelink
5678
S L
imagelink2
and so on with the column headers as the key and the columns containing the values.
mytuple was initially of size 2, from which I removed the second index and brought it down to just a list of dicts.
I tried different possibilities which include:
df=pd.DataFrame(mytuple)
df=pd.DataFrame.from_dict(mytuple)
df=pd.json_normalize(mytuple)
However, in all these cases, I get a dataframe as below
1
2
3
4
{'reviewId':..}
{'reviewId':..}
{}
{}
I'd like to understand where I'm going wrong. Thanks in advance!

Check which value from my list is not in my dataframe column

I need to check if any of the values in my list is missing in my df column. I used this:
data_xls['date'].isin([datetime(2015, 7, 20, 11,7),datetime(2015, 7, 20, 11,13),datetime(2015, 7, 20, 11,14),datetime(2015, 7, 20, 11,16)])
But I also want to know which one amongst my list is missing. How can I do that?
You need the ~ symbol to index the dates that are not in that list:
lst = [datetime(2015, 7, 20, 11,7),datetime(2015, 7, 20, 11,13),datetime(2015, 7, 20, 11,14),datetime(2015, 7, 20, 11,16)]
data_xls['date'][~data_xls['date'].isin(lst)]
But since you want the dates in your list missing in data_xls, you can find that by:
set(lst).difference(data_xls['date'])
If need difference between dates and data_xls['date'] columns use:
data_xls = pd.DataFrame({'date': pd.date_range(datetime(2015, 7, 20, 11,11),
freq='1Min', periods=5)})
print (data_xls)
date
0 2015-07-20 11:11:00
1 2015-07-20 11:12:00
2 2015-07-20 11:13:00
3 2015-07-20 11:14:00
4 2015-07-20 11:15:00
dates = [datetime(2015, 7, 20, 11,7),datetime(2015, 7, 20, 11,13),
datetime(2015, 7, 20, 11,14),datetime(2015, 7, 20, 11,16)]
missing = [x for x in dates if x not in set(data_xls['date'])]
print (missing)
[datetime.datetime(2015, 7, 20, 11, 7), datetime.datetime(2015, 7, 20, 11, 16)]
missing = list(set(dates) - set(data_xls['date']))
print (missing)
[datetime.datetime(2015, 7, 20, 11, 7), datetime.datetime(2015, 7, 20, 11, 16)]

How to create a nested list conditioned on a parameter in python

I have generated a day-wise nested list and want to calculate total duration between login and logout sessions and store that value individually in a duration nested list, organized by the day in which the login happened.
My python script is:
import datetime
import itertools
Logintime = [
datetime.datetime(2021,1,1,8,10,10),
datetime.datetime(2021,1,1,10,25,19),
datetime.datetime(2021,1,2,8,15,10),
datetime.datetime(2021,1,2,9,35,10)
]
Logouttime = [
datetime.datetime(2021,1,1,10,10,11),
datetime.datetime(2021,1,1,17,0,10),
datetime.datetime(2021,1,2,9,30,10),
datetime.datetime(2021,1,2,17,30,12)
]
Logintimedaywise = [list(group) for k, group in itertools.groupby(Logintime,
key=datetime.datetime.toordinal)]
Logouttimedaywise = [list(group) for j, group in itertools.groupby(Logouttime,
key=datetime.datetime.toordinal)]
print(Logintimedaywise)
print(Logouttimedaywise)
# calculate total duration
temp = []
l = []
for p,q in zip(Logintimedaywise,Logouttimedaywise):
for a,b in zip(p, q):
tdelta = (b-a)
diff = int(tdelta.total_seconds()) / 3600
if diff not in temp:
temp.append(diff)
l.append(temp)
print(l)
this script generating the following output (the duration in variable l is coming out as a flat list inside a singleton list):
[[datetime.datetime(2021, 1, 1, 8, 10, 10), datetime.datetime(2021, 1, 1, 10, 25, 19)], [datetime.datetime(2021, 1, 2, 8, 15, 10), datetime.datetime(2021, 1, 2, 9, 35, 10)]]
[[datetime.datetime(2021, 1, 1, 10, 10, 11), datetime.datetime(2021, 1, 1, 17, 0, 10)], [datetime.datetime(2021, 1, 2, 9, 30, 10), datetime.datetime(2021, 1, 2, 17, 30, 12)]]
[[2.000277777777778, 6.5808333333333335, 1.25, 7.917222222222223]]
But my desired output format is the following nested list of durations (each item in the list should be the list of durations for a given login day):
[[2.000277777777778, 6.5808333333333335] , [1.25, 7.917222222222223]]
anyone can help how can i store total duration as a nested list according to the login day?
thanks in advance.
Try changing this peace of code:
# calculate total duration
temp = []
l = []
for p,q in zip(Logintimedaywise,Logouttimedaywise):
for a,b in zip(p, q):
tdelta = (b-a)
diff = int(tdelta.total_seconds()) / 3600
if diff not in temp:
temp.append(diff)
l.append(temp)
print(l)
To:
# calculate total duration
l = []
for p,q in zip(Logintimedaywise,Logouttimedaywise):
l.append([])
for a,b in zip(p, q):
tdelta = (b-a)
diff = int(tdelta.total_seconds()) / 3600
if diff not in l[-1]:
l[-1].append(diff)
print(l)
Then the output would be:
[[datetime.datetime(2021, 1, 1, 8, 10, 10), datetime.datetime(2021, 1, 1, 10, 25, 19)], [datetime.datetime(2021, 1, 2, 8, 15, 10), datetime.datetime(2021, 1, 2, 9, 35, 10)]]
[[datetime.datetime(2021, 1, 1, 10, 10, 11), datetime.datetime(2021, 1, 1, 17, 0, 10)], [datetime.datetime(2021, 1, 2, 9, 30, 10), datetime.datetime(2021, 1, 2, 17, 30, 12)]]
[[2.000277777777778, 6.5808333333333335], [1.25, 7.917222222222223]]
I add a new sublist for every iteration.
Your solution and the answer by #U11-Forward will break if login and logout for the same session happen in different days, since the inner lists in Logintimedaywise and Logouttimedaywise will have different number of elements.
To avoid that, a way simpler solution is if you first calculate the duration for all pairs of login, logout, then you create the nested lists based only on the login day (or logout day if you wish), like this:
import datetime
import itertools
import numpy
# define the login and logout times
Logintime = [datetime.datetime(2021,1,1,8,10,10),datetime.datetime(2021,1,1,10,25,19),datetime.datetime(2021,1,2,8,15,10),datetime.datetime(2021,1,2,9,35,10)]
Logouttime = [datetime.datetime(2021,1,1,10,10,11),datetime.datetime(2021,1,1,17,0,10), datetime.datetime(2021,1,2,9,30,10),datetime.datetime(2021,1,2,17,30,12) ]
# calculate the duration and the unique days in the set
duration = [ int((logout - login).total_seconds())/3600 for login,logout in zip(Logintime,Logouttime) ]
login_days = numpy.unique([login.day for login in Logintime])
# create the nested list of durations
# each inner list correspond to a unique login day
Logintimedaywise = [[ login for login in Logintime if login.day == day ] for day in login_days ]
Logouttimedaywise = [[ logout for login,logout in zip(Logintime,Logouttime) if login.day == day ] for day in login_days ]
duration_daywise = [[ d for d,login in zip(duration,Logintime) if login.day == day ] for day in login_days ]
# check
print(Logintimedaywise)
print(Logouttimedaywise)
print(duration_daywise)
Outputs
[[datetime.datetime(2021, 1, 1, 8, 10, 10), datetime.datetime(2021, 1, 1, 10, 25, 19)], [datetime.datetime(2021, 1, 2, 8, 15, 10), datetime.datetime(2021, 1, 2, 9, 35, 10)]]
[[datetime.datetime(2021, 1, 1, 10, 10, 11), datetime.datetime(2021, 1, 1, 17, 0, 10)], [datetime.datetime(2021, 1, 2, 9, 30, 10), datetime.datetime(2021, 1, 2, 17, 30, 12)]]
[[2.000277777777778, 6.5808333333333335], [1.25, 7.917222222222223]]

Pandas : first datetime field gets automatically converted to timestamp type

When creating a pandas dataframe object (python 2.7.9, pandas 0.16.2), the first datetime field gets automatically converted into a pandas timestamp. Why? Is it possible to prevent this so as to keep the field in the original type?
Please see code below:
import numpy as np
import datetime
import pandas
create a dict:
x = {'cusip': np.array(['10553M10', '67085120', '67085140'], dtype='|S8'),
'vstart':np.array([datetime.datetime(2001, 11, 16, 0, 0),
datetime.datetime(2012, 2, 28, 0, 0), datetime.datetime(2014, 12, 22, 0, 0)],
dtype=object),
'vstop': np.array([datetime.datetime(2012, 2, 28, 0, 0),
datetime.datetime(2014, 12, 22, 0, 0), datetime.datetime(9999, 12, 31, 0, 0)],
dtype=object),
'id': np.array(['EQ0000000000041095', 'EQ0000000000041095', 'EQ0000000000041095'],
dtype='|S18')}
So, the vstart and vstop keys are datetime so far. However, after:
df = pandas.DataFrame(data = x)
the vstart becomes a pandas Timestamp automatically while vstop remains a datetime
type(df.vstart[0])
#class 'pandas.tslib.Timestamp'
type(df.vstop[0])
#type 'datetime.datetime'
I don't understand why the first datetime column that the constructor comes across gets converted to Timestamp by pandas. And how to tell pandas to keep the data types as they are. Can you help? Thank you.
actually I've noticed something in your data , it has nothing to do with your first or second date column in your column vstop there is a datetime with value dt.datetime(9999, 12, 31, 0, 0) , if you changed the year on this date to a normal year like 2020 for example both columns will be treated the same .
just note that I'm importing datetime module as dt
x = {'cusip': np.array(['10553M10', '67085120', '67085140'], dtype='|S8'),
'vstop': np.array([dt.datetime(2012, 2, 28, 0, 0), dt.datetime(2014, 12, 22, 0, 0), dt.datetime(2020, 12, 31, 0, 0)], dtype=object),
'vstart': np.array([dt.datetime(2001, 11, 16, 0, 0),dt.datetime(2012, 2, 28, 0, 0), dt.datetime(2014, 12, 22, 0, 0)], dtype=object),
'id': np.array(['EQ0000000000041095', 'EQ0000000000041095', 'EQ0000000000041095'], dtype='|S18')}
In [27]:
df = pd.DataFrame(x)
df
Out[27]:
cusip id vstart vstop
10553M10 EQ0000000000041095 2001-11-16 2012-02-28
67085120 EQ0000000000041095 2012-02-28 2014-12-22
67085140 EQ0000000000041095 2014-12-22 2020-12-31
In [25]:
type(df.vstart[0])
Out[25]:
pandas.tslib.Timestamp
In [26]:
type(df.vstop[0])
Out[26]:
pandas.tslib.Timestamp

Value difference comparison within a list in python

I have a nested list that contains different variables in it. I am trying to check the difference value between two consecutive items, where if a condition match, group these items together.
i.e.
Item 1 happened on 1-6-2012 1 pm
Item 2 happened on 1-6-2012 4 pm
Item 3 happened on 1-6-2012 6 pm
Item 4 happened on 3-6-2012 5 pm
Item 5 happened on 5-6-2012 5 pm
I want to group the items that have gaps less than 24 Hours. In this case, Items 1, 2 and 3 belong to a group, Item 4 belong to a group and Item 5 belong to another group. I tried the following code:
Time = []
All_Traps = []
Traps = []
Dic_Traps = defaultdict(list)
Traps_CSV = csv.reader(open("D:/Users/d774911/Desktop/Telstra Internship/Working files/Traps_Generic_Features.csv"))
for rows in Traps_CSV:
All_Traps.append(rows)
All_Traps.sort(key=lambda x: x[9])
for length in xrange(len(All_Traps)):
if length == (len(All_Traps) - 1):
break
Node_Name_1 = All_Traps[length][2]
Node_Name_2 = All_Traps[length + 1][2]
Event_Type_1 = All_Traps[length][5]
Event_Type_2 = All_Traps[length + 1][5]
Time_1 = All_Traps[length][9]
Time_2 = All_Traps[length + 1][9]
Difference = datetime.strptime(Time_2[0:19], '%Y-%m-%dT%H:%M:%S') - datetime.strptime(Time_1[0:19], '%Y-%m-%dT%H:%M:%S')
if Node_Name_1 == Node_Name_2 and \
Event_Type_1 == Event_Type_2 and \
float(Difference.seconds) / (60*60) < 24:
Dic_Traps[length].append(All_Traps[Length])
But I am missing some items. Ideas?
For sorted list you may use groupby. Here is a simplified example (you should convert your date strings to datetime objects), it should give the main idea:
from itertools import groupby
import datetime
SRC_DATA = [
(1, datetime.datetime(2015, 06, 20, 1)),
(2, datetime.datetime(2015, 06, 20, 4)),
(3, datetime.datetime(2015, 06, 20, 5)),
(4, datetime.datetime(2015, 06, 21, 1)),
(5, datetime.datetime(2015, 06, 22, 1)),
(6, datetime.datetime(2015, 06, 22, 4)),
]
for group_date, group in groupby(SRC_DATA, key=lambda X: X[1].date()):
print "Group {}: {}".format(group_date, list(group))
Output:
$ python python_groupby.py
Group 2015-06-20: [(1, datetime.datetime(2015, 6, 20, 1, 0)), (2, datetime.datetime(2015, 6, 20, 4, 0)), (3, datetime.datetime(2015, 6, 20, 5, 0))]
Group 2015-06-21: [(4, datetime.datetime(2015, 6, 21, 1, 0))]
Group 2015-06-22: [(5, datetime.datetime(2015, 6, 22, 1, 0)), (6, datetime.datetime(2015, 6, 22, 4, 0))]
First of all, change those horrible cased variable names. Python has its own convention of naming variables, classes, methods and so on. It's called snake case.
Now, on to what you need to do:
import datetime as dt
import pprint
ts_dict = {}
with open('timex.dat', 'r+') as f:
for line in f.read().splitlines():
if line:
item = line.split('happened')[0].strip().split(' ')[1]
timestamp_string = line.split('on')[-1].split('pm')[0]
datetime_stamp = dt.datetime.strptime(timestamp_string.strip(), "%d-%m-%Y %H")
ts_dict[item] = datetime_stamp
This is a hackish way of giving you this:
item_timestamp_dict= {
'1': datetime.datetime(2012, 6, 1, 1, 0),
'2': datetime.datetime(2012, 6, 1, 4, 0),
'3': datetime.datetime(2012, 6, 1, 6, 0),
'4': datetime.datetime(2012, 6, 3, 5, 0),
'5': datetime.datetime(2012, 6, 5, 5, 0)}
A dictionary of item # as key, and their datetime timestamp as value.
You can use the datetime timestamp values' item_timestamp_dict['1'].hour values to do your calculation.
EDIT: It can be optimized a lot.

Categories

Resources