I have a data analysis task in which I want to analyze the real time service logs. Could you please help me how to do this in Pandas?
My initial dataframe look like this:
I want to generate time series for each service name and make a correlation analysis based on this.
How can I divide this dataframe into different dataframes(indexed with time slot) for each service name by aggregating their respective data as shown below?
Ps:I have seen similar questions, but I believe my question is different because I want to generate many time series from a dataframe. And sorry in advance if this is an easy one, I am new to Pandas :)
Here is my Dataframe as code:
ERRORCODE ERRORTEXT SERVICENAME REQTDURATION RESPTDURATION HOSTDURATION
10:00:27:000 NaN NaN serviceA 0 1 4612
10:00:27:822 NaN NaN serviceB 0 1 14994
10:01:27:622 -1 'Timeout' serviceA 1 0 7695
10:01:27:323 NaN NaN serviceD 0 1 2612
10:01:27:755 NaN NaN serviceA 0 1 1612
10:02:27:666 -5 'Timeout' serviceA 0 1 11612
10:02:27:111 NaN NaN serviceB 0 1 111112
10:02:27:333 NaN NaN serviceC 0 1 412
Starting with:
ERRORCODE ERRORTEXT SERVICENAME REQTDURATION RESPTDURATION \
10:00:27:000 NaN NaN serviceA 0 1
10:00:27:822 NaN NaN serviceB 0 1
10:01:27:622 -1 'Timeout' serviceA 1 0
10:01:27:323 NaN NaN serviceD 0 1
10:01:27:755 NaN NaN serviceA 0 1
10:02:27:666 -5 'Timeout' serviceA 0 1
10:02:27:111 NaN NaN serviceB 0 1
10:02:27:333 NaN NaN serviceC 0 1
HOSTDURATION
10:00:27:000 4612
10:00:27:822 14994
10:01:27:622 7695
10:01:27:323 2612
10:01:27:755 1612
10:02:27:666 11612
10:02:27:111 111112
10:02:27:333 412
Converting index to DateTimeIndex:
df.index = pd.to_datetime(df.index, format='%H:%M:%S:%f')
And then looping over SERVICENAME groups:
for service, data in df.groupby('SERVICENAME'):
service_result = pd.concat([data.groupby(pd.TimeGrouper('Min')).size(), data.groupby(pd.TimeGrouper('Min'))['REQTDURATION', 'RESPTDURATION', 'HOSTDURATION'].mean()], axis=1)
service_result.columns = ['ERRORCOUNT', 'AVGREQTURATION', 'AVGRESPTDURATION', 'AVGHOSTDURATION']
service_result.index = service_result.index.time
yields:
serviceA
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:00:00 1 0.0 1.0 4612.0
10:01:00 2 0.5 0.5 4653.5
10:02:00 1 0.0 1.0 11612.0
serviceB
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:00:00 1 0 1 14994
10:01:00 0 NaN NaN NaN
10:02:00 1 0 1 111112
serviceC
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:02:00 1 0 1 412
serviceD
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:01:00 1 0 1 2612
Related
I will need to pivot a column in pandas, would greatly appreciate any help.
Input:
ID
Status
Date
1
Online
2022-06-31
1
Offline
2022-07-28
2
Online
2022-08-01
3
Online
2022-07-03
3
None
2022-07-05
4
Offline
2022-05-02
5
Online
2022-04-04
5
Online
2022-04-06
Output: Pivot on Status
ID
Date
Online
Offline
None
1
2022-06-31
1
0
0
1
2022-07-28
0
1
0
2
2022-08-01
1
0
0
3
2022-07-03
1
0
0
3
2022-07-05
1
0
0
4
2022-05-02
0
0
1
5
2022-04-04
1
0
0
5
2022-04-06
1
0
0
Or even better output if I am able to merge the counts for example:
Output: Pivot on Status & merge
ID
Online
Offline
None
1
1
1
0
2
1
0
0
3
2
0
0
4
0
0
1
5
2
0
0
The main issue here is that I won't know the status values i.e. Offline, Online, None.
I believe doing it in pandas might be easier due to the dynamic nature of not knowing column values for the column I want to pivot on.
df.assign(seq=1).pivot_table(index='ID', columns='Status', values='seq', aggfunc='sum').fillna(0)
Status None Offline Online
ID
1 0.0 1.0 1.0
2 0.0 0.0 1.0
3 1.0 0.0 1.0
4 0.0 1.0 0.0
5 0.0 0.0 2.0
I have a Dataframe of the form
date_time uids
2018-10-16 23:00:00 1000,1321,7654,1321
2018-10-16 23:10:00 7654
2018-10-16 23:20:00 NaN
2018-10-16 23:30:00 7654,1000,7654,1321,1000
2018-10-16 23:40:00 691,3974,3974,323
2018-10-16 23:50:00 NaN
2018-10-17 00:00:00 NaN
2018-10-17 00:10:00 NaN
2018-10-17 00:20:00 27,33,3974,3974,7665,27
This is a very big data frame containing the 5 mins time interval and the number of appearances of ids during those time intervals.
I want to iterate over these DataFrame 6 rows at a time (corresponding to 1 hour) and create DataFrame containing the ID and the number of times each id appear during this time.
Expected output is one dataframe per hour information. For example, in the above case dataframe for the hour 23 - 00 will have this form
uid 1 2 3 4 5 6
1000 1 0 0 2 0 0
1321 2 0 0 1 0 0
and so on
How can I do this efficiently?
I don't have an exact solution but you could create a pivot table: ids on the index and datetimes on the columns. Then you just have to select the columns you want.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"date_time": [
"2018-10-16 23:00:00",
"2018-10-16 23:10:00",
"2018-10-16 23:20:00",
"2018-10-16 23:30:00",
"2018-10-16 23:40:00",
"2018-10-16 23:50:00",
"2018-10-17 00:00:00",
"2018-10-17 00:10:00",
"2018-10-17 00:20:00",
],
"uids": [
"1000,1321,7654,1321",
"7654",
np.nan,
"7654,1000,7654,1321,1000",
"691,3974,3974,323",
np.nan,
np.nan,
np.nan,
"27,33,3974,3974,7665,27",
],
}
)
df["date_time"] = pd.to_datetime(df["date_time"])
df = (
df.set_index("date_time") #do not use set_index if date_time is current index
.loc[:, "uids"]
.str.extractall(r"(?P<uids>\d+)")
.droplevel(level=1)
) # separate all the ids
df["number"] = df.index.minute.astype(float) / 10 + 1 # get the number 1 to 6 depending on the minutes
df_pivot = df.pivot_table(
values="number",
index="uids",
columns=["date_time"],
) #dataframe with all the uids on the index and all the datetimes in columns.
You can apply this to the whole dataframe or just a subset containing 6 rows. Then you rename your columns.
You can use the function crosstab:
df['uids'] = df['uids'].str.split(',')
df = df.explode('uids')
df['date_time'] = df['date_time'].dt.minute.floordiv(10).add(1)
pd.crosstab(df['uids'], df['date_time'], dropna=False)
Output:
date_time 1 2 3 4 5 6
uids
1000 1 0 0 2 0 0
1321 2 0 0 1 0 0
27 0 0 2 0 0 0
323 0 0 0 0 1 0
33 0 0 1 0 0 0
3974 0 0 2 0 2 0
691 0 0 0 0 1 0
7654 1 1 0 2 0 0
7665 0 0 1 0 0 0
We can achieve this with extracting the minutes from your datetime column. Then using pivot_table to get your wide format:
df['date_time'] = pd.to_datetime(df['date_time'])
df['minute'] = df['date_time'].dt.minute // 10
piv = (df.assign(uids=df['uids'].str.split(','))
.explode('uids')
.pivot_table(index='uids', columns='minute', values='minute', aggfunc='size')
)
minute 0 1 2 3 4
uids
1000 1.0 NaN NaN 2.0 NaN
1321 2.0 NaN NaN 1.0 NaN
27 NaN NaN 2.0 NaN NaN
323 NaN NaN NaN NaN 1.0
33 NaN NaN 1.0 NaN NaN
3974 NaN NaN 2.0 NaN 2.0
691 NaN NaN NaN NaN 1.0
7654 1.0 1.0 NaN 2.0 NaN
7665 NaN NaN 1.0 NaN NaN
I'm very new to Python. I've tried to reshape a data set using pd.wide_to_long. The original dataframe looks like this:
chk1 chk2 chk3 ... chf1 chf2 chf3 id var1 var2
0 3 4 2 ... nan nan nan 1 1 0
1 4 4 4 ... nan nan nan 2 1 0
2 2 nan nan ... 3 4 3 3 0 1
3 3 3 3 ... 3 2 2 4 1 0
I used the following code:
df2 = pd.wide_to_long(df,
stubnames=['chk', 'chf'],
i=['id', 'var1', 'var2'],
j='type')
When checking the data after these codes, it looks like this
chk chf
id var1 var2 egenskap
1 1 0 1 3 nan
2 4 nan
3 2 nan
4 nan nan
5 4 nan
6 nan nan
7 4 nan
8 4 nan
2 1 0 1 4 nan
2 4 nan
3 4 nan
4 5 nan
But when I check the columns in the new data set, it seems that all columns except 'chk' and 'chf' are gone!
df2.columns
Out[47]: Index(['chk', 'chf'], dtype='object')
df2.columns
for col in df2.columns:
print(col)
chk
chf
From the dataview it looks like 'id', 'var1', 'var2' have been merged into one common index:
Screenprint dataview here
Can someone please help me? :)
I am working with data like the following. The dataframe is sorted by the date:
category value Date
0 1 24/5/2019
1 NaN 24/5/2019
1 1 26/5/2019
2 2 1/6/2019
1 2 23/7/2019
2 NaN 18/8/2019
2 3 20/8/2019
7 3 1/9/2019
1 NaN 12/9/2019
2 NaN 13/9/2019
I would like to replace the "NaN" values with the previous mean for that specific category.
What is the best way to do this in pandas?
Some approaches I considered:
1) This litte riff:
df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
source
This gets me the the correct means in but in another column, and it does not replace the NaNs.
2) This riff replaces the NaNs with the average of the columns:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
Source 2
Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!
You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
category value Date
0 0 1.0 2019-05-24
1 1 NaN 2019-05-24
2 1 1.0 2019-05-26
3 2 2.0 2019-01-06
4 1 2.0 2019-07-23
5 2 2.0 2019-08-18
6 2 3.0 2019-08-20
7 7 3.0 2019-01-09
8 1 1.5 2019-12-09
9 2 2.5 2019-09-13
You can use pandas.Series.fillna to replace NaN values:
df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)
category value Date
0 0 1.0 24/5/2019
1 1 NaN 24/5/2019
2 1 1.0 26/5/2019
3 2 2.0 1/6/2019
4 1 2.0 23/7/2019
5 2 2.0 18/8/2019
6 2 3.0 20/8/2019
7 7 3.0 1/9/2019
8 1 1.5 12/9/2019
9 2 2.5 13/9/2019
Consider the following dataset:
After running the code:
convert_dummy1 = convert_dummy.pivot(index='Product_Code', columns='Month', values='Sales').reset_index()
The data is in the right form, but my index column is named 'Month', and I cannot seem to remove this at all. I have tried codes such as the below, but it does not do anything.
del convert_dummy1.index.name
I can save the dataset to a csv, delete the ID column, and then read the csv - but there must be a more efficient way.
Dataset after reset_index():
convert_dummy1
Month Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
convert_dummy1.index = pd.RangeIndex(len(convert_dummy1.index))
del convert_dummy1.columns.name
convert_dummy1
Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
Since you pivot with columns="Month", each column in output corresponds to a month. If you decide to reset index after the pivot, you should check column names with convert_dummy1.columns.value which should return in your case :
array(['Product_Code', 1, 2, 3, 4, 5], dtype=object)
while convert_dummy1.columns.names should return:
FrozenList(['Month'])
So to rename Month, use rename_axis function:
convert_dummy1.rename_axis('index',axis=1)
Output:
index Product_Code 1 2 3 4 5
0 10133 NaN NaN NaN NaN 0.0
1 10234 NaN 0.0 NaN NaN NaN
2 10245 0.0 NaN NaN NaN NaN
3 10345 NaN NaN NaN 0.0 NaN
4 10987 NaN NaN 1.0 NaN NaN
If you wish to reproduce it, this is my code:
df1=pd.DataFrame({'Product_Code':[10133,10245,10234,10987,10345], 'Month': [1,2,3,4,5], 'Sales': [0,0,0,1,0]})
df2=df1.pivot_table(index='Product_Code', columns='Month', values='Sales').reset_index().rename_axis('index',axis=1)