Pandas - Issue with plotting a graph after group by - python

I am have a table "attendance" in sqlite3 which I am importing as a dataframe using pandas. The dataframe looks like this,
id name date time
0 12345 Pankaj 2020-09-12 1900-01-01 23:17:49
1 12345 Pankaj 2020-09-12 1900-01-01 23:20:28
2 12345 Pankaj 2020-09-13 1900-01-01 13:36:01
A person a 'id' can appear multiple times, which is equivalent to a person going in and out of the door multiple times a day and we are recording each of that transition.
I wish to find the difference of time last out and first in, to find the number of hours a person was present at the work.
Since we need only the data for one person at a time, I am first filtering out data for one person, like this .
df = df.loc[df['id']== id]
This leaves me all the entries for a particular person.
Now, for difference of the last entry time and the first entry time, I am calculating like this,
df_gp = df.groupby('date')['time']
difference = df_gp.max() - df_gp.min()
Now, the "difference" comes out as a pandas series.
date
2020-09-12 00:02:39
2020-09-13 00:00:00
When I try to plot the graph using the pandas.series.plot() method, with type kind = 'line', like this,
difference.plot(kind = 'line')
I don't see graph being made at all. I don't see any error of such, it just simply does not show anything.
When I print,
print(difference.plot(kind = 'line'))
It prints in the terminal this,
AxesSubplot(0.125,0.2;0.775x0.68)
So I thought, it must be something with time.sleep() that the graph gets destroyed and exits the function too quickly, but it is not the case, I have tried so many thing, it simply doesn't show.
I need help with-
I don't; know if this is the correct way to have a graph when I want to have difference of time for a particular day. Please suggest if you have any elegant way to do the same.
What is the reason it doesn't show at all?
Complete code
def main():
emp_id = "12345"
db = os.path.join(constants.BASE_DIR.format("db"),"db_all.db")
with closing(sqlite3.connect(db)) as conn:
df = pd.read_sql_query("select * from attendance where id = {} order by date ASC".format(emp_id), conn, parse_dates={'date':'%Y-%m-%d',
'time':'%H:%M:%S'})
print(df.head())
#df = df.loc[df['id']== id]
is_empty = df.empty
if is_empty:
messagebox.showerror("Error","There are not enough records of employee")
return
# Add the latest row
emp_name = df.loc[(df['id'] == id).idxmax(),'name']
# dt_time = datetime.datetime.now().replace(microsecond=0)
# _date, _time = dt_time.date(),dt_time.time()
# print(type(_date))
# print(type(_time))
# df.loc[-1] = [emp_id,emp_name,_date,_time]
# df.index += 1
# df = df.sort_index()
# print(df.dtypes)
df_gp = df.groupby('date')['time']
print("Here")
difference = df_gp.max() - df_gp.min()
print(difference)
print(difference.plot(kind = 'line'))
if __name__ == '__main__':
main()
-Thanks

Related

Python: iterate through the rows of a csv and calculate date difference if there is a change in a column

Only basic knowledge of Python, so I'm not even sure if this is possible?
I have a csv that looks like this:
[1]: https://i.stack.imgur.com/8clYM.png
(This is dummy data, the real one is about 30K rows.)
I need to find the most recent job title for each employee (unique id) and then calculate how long (= how many days) the employee has been on the same job title.
What I have done so far:
import csv
import datetime
from datetime import *
data = open("C:\\Users\\User\\PycharmProjects\\pythonProject\\jts.csv",encoding="utf-8")
csv_data = csv.reader(data)
data_lines = list(csv_data)
print(data_lines)
for i in data_lines:
for j in i[0]:
But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
I also know that at one point I will need:
datetime.strptime(data_lines[1][2] , '%Y/%M/%d').date()
Could somebody help, please? I just need a new list saying something like:
id jt days
500 plumber 370
Edit to clarify: The dates are data points taken. I need to calculate back from the most recent of those back until the job title was something else. So in my example for employee 5000 from 04/07/2021 to 01/03/2020.
Let's consider sample data as follows:
id,jtitle,date
5000,plumber,01/01/2020
5000,senior plumber,02/03/2020
6000,software engineer,01/02/2020
6000,software architecture,06/02/2021
7000,software tester,06/02/2019
The following code works.
import pandas as pd
import datetime
# load data
data = pd.read_csv('data.csv')
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
print(data)
# group employees by ID
latest = data.sort_values('date', ascending=False).groupby('id').nth(0)
print(latest)
# find the latest point in time where there is a change in job title
prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date
print(prev_date)
# calculate the difference in days
latest['days'] = latest.date - prev_date
print(latest)
Output:
jtitle date days
id
5000 senior plumber 2020-03-02 61 days
6000 software architecture 2021-02-06 371 days
7000 software tester 2019-02-06 NaT
But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
Have a map (dict) of employee to (date, title).
For every row, check if you already have an entry for the employee. If you don't just put the information in the map, otherwise compare the date of the row and that of the entry. If the row has a more recent date, replace the entry.
Once you've gone through all the rows, you can just go through the map you've collected and compute the difference between the date you ended up with and "today".
Incidentally your pattern is not correct, the sample data uses a %d/%m/%Y (day/month/year) or %m/%d/%Y (month/day/year) format, the sample data is not sufficient to say which, but it certainly is not YMD.
Seems like I'm too late... Nevertheless, in case you're interested, here's a suggestion in pure Python (nothing wrong with Pandas, though!):
import csv
import datetime as dt
from operator import itemgetter
from itertools import groupby
reader = csv.reader('data.csv')
next(reader) # Discard header row
# Read, transform (date), and sort in reverse (id first, then date):
data = sorted(((i, jtitle, dt.datetime.strptime(date, '%d/%m/%Y'))
for i, jtitle, date in reader),
key=itemgetter(0, 2), reverse=True)
# Process data grouped by id
result = []
for i, group in groupby(data, key=itemgetter(0)):
_, jtitle, end = next(group) # Fetch last job title resp. date
# Search for first ocurrence of different job title:
start = end
for _, jt, start in group:
if jt != jtitle:
break
# Collect results in list with datetimes transformed back
result.append((i, jtitle, end.strftime('%d/%m/%Y'), (end - start).days))
result = sorted(result, key=itemgetter(0))
The result for the input data
id,jtitle,date
5000,plumber,01/01/2020
5000,plumber,01/02/2020
5000,senior plumber,01/03/2020
5000,head plumber,01/05/2020
5000,head plumber,02/09/2020
5000,head plumber,05/01/2021
5000,head plumber,04/07/2021
6000,electrician,01/02/2018
6000,qualified electrician,01/06/2020
7000,plumber,01/01/2004
7000,plumber,09/11/2020
7000,senior plumber,05/06/2021
is
[('5000', 'head plumber', '04/07/2021', 490),
('6000', 'qualified electrician', '01/06/2020', 851),
('7000', 'senior plumber', '05/06/2021', 208)]

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

Faster loop in Pandas looking for ID and older date

So, I have a DataFrame that represents purchases with 4 columns:
date (date of purchase in format %Y-%m-%d)
customer_ID (string column)
claim (1-0 column that means 1-the customer complained about the purchase, 0-customer didn't complain)
claim_value (for claim = 1 it means how much the claim cost to the company, for claim = 0 it's NaN)
I need to build 3 new columns:
past_purchases (how many purchases the specific customer had before this purchase)
past_claims (how many claims the specific customer had before this purchase)
past_claims_value (how much did the customer's past claims cost)
This has been my approach until now:
past_purchases = []
past_claims = []
past_claims_value = []
for i in range(0, len(df)):
date = df['date'][i]
customer_ID = df['customer_ID'][i]
df_temp = df[(df['date'] < date) & (df['customer_ID'] == customer_ID)]
past_purchases.append(len(df_temp))
past_claims.append(df_temp['claim'].sum())
past_claims_value.append(df['claim_value'].sum())
df['past_purchases'] = pd.DataFrame(past_purchases)
df['past_claims'] = pd.DataFrame(past_claims)
df['past_claims_value'] = pd.DataFrame(past_claims_value)
The code works fine, but it's too slow. Can anyone make it work faster? Thanks!
Ps: It's importante to check that the date is older, if the customer had 2 purchases in the same date they shouldn't count for each other.
Pss: I'm willing to use libraries for parallel processing like multiprocessing, concurrent.futures, joblib or dask, but never had before in a similar way.
Expected outcome:
Maybe you can try using a cumsum over customers, if the dates are sorted ascendant
df.sort_values('date', inplace=True)
new_temp_columns = ['claim_s','claim_value_s']
df[['claim_s','claim_value_s']] = df[new_temp_columns].shift()
df['past_claims'] = df.groupby('customer_ID')['claim_s'].transform(pd.Series.cumsum)
df['past_claims_value'] = df.groupby('customer_ID')['claim_value_s'].transform(pd.Series.cumsum)
# set the min value for the groups
dfc = data.groupby(['customer_ID','date'])[['past_claims','past_claims_value']]
data[['past_claims', 'past_claims_value']] = dfc.transform(min)
# Remove temp columns
data = data.loc[:, ~data.columns.isin(new_temp_columns)]
Again, this will only works if te date are srotes

Slicing my data frame is returning unexpected results

I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day.
The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later.
My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY.
The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect.
My code is below. I have included comments to help further explain things. I also have an example of sample output of my function.
def make_df(df):
#starting values
output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
time = df1.loc[3:50,0]
val_start = 3
val_end = 51
date_val = [0,2]
day_type = [1,2]
# There are 7 row movements that need to take place.
for row_move in range(1,8):
day = [1,2,3]
date_val[1] = 2
day_type[1] = 2
# There are 5 column movements that take place.
# The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe,
# and then append that dataframe onto the output dataframe
for col_move in range(1,6):
temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
temp_df['TIME'] = time
#These are the 3 values that stop working after the first column change
# I get the values that I expect for the first 5 days
temp_df['KW'] = df.iloc[val_start:val_end, day[0]]
temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]]
temp_df['KVA'] = df.iloc[val_start:val_end, day[2]]
# These 2 values work perfectly for the entire data set
temp_df['DAY'] = df.iloc[day_type[0], day_type[1]]
temp_df["DATE"] = df.iloc[date_val[0], date_val[1]]
# trouble shooting
print(df.iloc[val_start:val_end, day[0]])
print(temp_df)
output = output.append(temp_df)
# increase values for each iteration of row loop.
# seems to work perfectly when I print the data
day = [x + 3 for x in day]
date_val[1] = date_val[1] + 3
day_type[1] = day_type[1] + 3
# increase values for each iteration of column loop
# seems to work perfectly when I print the data
date_val[0] = date_val[0] + 55
day_type [0]= day_type[0] + 55
val_start = val_start + 55
val_end = val_end + 55
return output
test = make_df(df1)
Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?
Could be pd.append requiring matched row indices for numerical values.
import pandas as pd
import numpy as np
output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data
output['c'] = list('abcdefghij') # add a column of non-numerical entries
tmp = pd.DataFrame(columns=['a','b','c'])
tmp['a'] = output.iloc[0:2, 2]
tmp['b'] = output.iloc[3:5, 2] # generates NaN
tmp['c'] = output.iloc[0:2, 2]
data.append(tmp)
(initial response)
How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself.
My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data
df.iloc[0:2, 1] # returns the subset
df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64)
A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.

Python: using pandas.pivot_table to flatten activity logs and display time spent doing activities

I'm looking at using python and pandas to flatten our VLE (Blackboard inc.) activity table. I'm trying to sum the total time spent per day on accessing courses as opposed to doing other non course activity in the activity logs/table.
I've created some fake data and code below (python) to simulate the question and where I'm struggling. It's the flattened_v2 part I'm struggling with as that's close to my actual case.
The log data typically looks like this and I've created it below in the code example: (activity dataframe in the code below)
DAY event somethingelse timespent logtime
0 2013-01-02 null foo 0.274139 2013-01-02 00:00:00
0 2013-01-02 course1 foo 1.791061 2013-01-02 01:00:00
1 2013-01-02 course1 foo 0.824152 2013-01-02 02:00:00
2 2013-01-02 course1 foo 1.626477 2013-01-02 03:00:00
I've got a field called logtime in the real data. This is an actual datetime rather than a time spent field( also included in my fake data as I was experimenting).
How do I record total time spent (using logtime) on event = course (many courses)?
Each record contains logtime which shows datetime to access page
Next record logtime shows the datetime accessing new page and therefore leaving old page (close enough). How can I get the total time where event is not null. If I just use the max/min values then this leads to an overestimate as the gaps in course access (where event = null) are also included. I've simplified the data so that each record increments by 1 hour which isn't the real case.
Thanks for any tips
Jason
The code is:
# dataframe example
# How do I record total time spent on event = course (many courses)?
# Each record contains logtime which shows datetime to access page
# Next record logtime shows the datetime accessing new page and
# therefore leaving old page (close enough)
#
#
import pandas as pd
import numpy as np
import datetime
# Creating fake data with string null and course1, course2
df = pd.DataFrame({
'DAY' : pd.Timestamp('20130102'),
'timespent' : abs(np.random.randn(5)),
'event' : "course1",
'somethingelse' : 'foo' })
df2 = pd.DataFrame({
'DAY' : pd.Timestamp('20130102'),
'timespent' : abs(np.random.randn(5)),
'event' : "course2",
'somethingelse' : 'foo' })
dfN =pd.DataFrame({
'DAY' : pd.Timestamp('20130102'),
'timespent' : abs(np.random.randn(1)),
'event' : "null",
'somethingelse' : 'foo' })
dfLog = [dfN, df,df2,dfN,dfN,dfN,df2,dfN,dfN,df,dfN,df2,dfN,df,df2,dfN, ]
activity = pd.concat(dfLog)
# add time column
times = pd.date_range('20130102', periods=activity.shape[0], freq='H')
activity['logtime'] = times
# activity contains a DAY field (probably not required)
# timespent -this is fake time spent on each event. This is
# not in my real data but I started this way when faking data
# event -either a course or null (not a course)
# somethingelse -just there to indicate other data.
#
print activity # This is quite close to real data.
# Fake activity date created above to demo question.
# *********************************************
# Actual code to extract time spent on courses
# *********************************************
# Lambda function to aggregate data -max and min
# Where time diff each minutes.
def agg_timespent(a, b):
c = abs(b-a)
return c
# Where the time difference is not explicit but is
# record of time recorded when accessing page (course event)
def agg_logtime(a, b):
# In real data b and a are strings
# b = datetime.datetime.strptime(b, '%Y-%m-%d %H:%M:%S')
# a = datetime.datetime.strptime(a, '%Y-%m-%d %H:%M:%S')
c = abs(b-a).seconds
return c
# Remove 'null' data as that's not of interest here.
# null means non course activity e.g. checking email
# or timetable -non course stuff.
activity= activity[(activity.event != 'null') ]
print activity # This shows *just* course activity info
# pivot by Day (only 1 day in fake data but 1 year in real data)
# Don't need DAY field but helped me fake-up data
flattened_v1 = activity.pivot_table(index=['DAY'], values=["timespent"],aggfunc=[min, max],fill_value=0)
flattened_v1['time_diff'] = flattened_v1.apply(lambda row: agg_timespent(row[0], row[1]), axis=1)
# How to achieve this?
# Where NULL has been removed I think this is wrong as NULL records could
# indicate several hours gap between course accesses but as
# I'm using MAX and MIN then I'm ignoring the periods of null
# This is overestimating time on courses
# I need to subtract/remove/ignore?? the hours spent on null times
flattened_v2 = activity.pivot_table(index=['DAY'], values=["logtime"],aggfunc=[min, max],fill_value=0)
flattened_v2['time_diff'] = flattened_v2.apply(lambda row: agg_logtime(row[0], row[1]), axis=1)
print
print '*****Wrong!**********'
print 'This is not what I have but just showing how I thought it might work.'
print flattened_v1
print
print '******Not sure how to do this*********'
print 'This is wrong as nulls/gaps are also included too'
print flattened_v2
You're right (in your comment): you'll need dataframe.shift.
If I'm understanding your question correctly, you want to record the time elapsed since the last timestamp, so timestamps signify the beginning of an activity, and when the last activity was null we should not record any elapsed time. Assuming that's all correct, use shift to add a column for time differences:
activity['timelog_diff'] = activity['logtime'] - activity['logtime'].shift()
Now the first row will show the special "not a time" value NaT, but that's fine as we can't calculate elapsed time there. Next we can fill in some more NaT values for any elapsed time where a null event has just occurred:
mask = activity.event == 'null'
activity.loc[mask.shift(1).fillna(False), 'timelog_diff'] = pd.NaT
When we want to find out how much time was spent on course1, we have to shift again, because indexing for the course1 rows will produce rows where course1 is beginning. We need those where course1 is finishing/has finished:
activity[(activity.event == 'course1').shift().fillna(False)]['timelog_diff'].sum()
That returns 15 hours for course1 and 20 for course2 in your example.

Categories

Resources