Pivot just one column without knowing the values of that column - python

I will need to pivot a column in pandas, would greatly appreciate any help.
Input:
ID
Status
Date
1
Online
2022-06-31
1
Offline
2022-07-28
2
Online
2022-08-01
3
Online
2022-07-03
3
None
2022-07-05
4
Offline
2022-05-02
5
Online
2022-04-04
5
Online
2022-04-06
Output: Pivot on Status
ID
Date
Online
Offline
None
1
2022-06-31
1
0
0
1
2022-07-28
0
1
0
2
2022-08-01
1
0
0
3
2022-07-03
1
0
0
3
2022-07-05
1
0
0
4
2022-05-02
0
0
1
5
2022-04-04
1
0
0
5
2022-04-06
1
0
0
Or even better output if I am able to merge the counts for example:
Output: Pivot on Status & merge
ID
Online
Offline
None
1
1
1
0
2
1
0
0
3
2
0
0
4
0
0
1
5
2
0
0
The main issue here is that I won't know the status values i.e. Offline, Online, None.
I believe doing it in pandas might be easier due to the dynamic nature of not knowing column values for the column I want to pivot on.

df.assign(seq=1).pivot_table(index='ID', columns='Status', values='seq', aggfunc='sum').fillna(0)
Status None Offline Online
ID
1 0.0 1.0 1.0
2 0.0 0.0 1.0
3 1.0 0.0 1.0
4 0.0 1.0 0.0
5 0.0 0.0 2.0

Related

Replace columns with the same value between two dataframes

I have two dataframes
df1
Date RPM
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
and df2
Date RPM
0 0 0
1 2 2
2 4 4
3 6 6
I want to replace the RPM in df1 with the RPM in df2 where they have the same Date
I tried with replace but it didn't work out
Use Series.map by Series created from df2 and then replace misisng valeus by original column by Series.fillna:
df1['RPM'] = df1['Date'].map(df2.set_index('Date')['RPM']).fillna(df1['RPM'])
You could merge() the two frames on the Date column to get the new RPM against the corresponding date row:
df = df1.merge(df2, on='Date', how='left', suffixes=[None, ' new'])
Date RPM RPM new
0 1 0 NaN
1 2 0 2.0
2 3 0 NaN
3 4 0 4.0
4 5 0 NaN
5 6 0 6.0
6 7 0 NaN
You can then fill in the nans in RPM new using .fillna() to get the RPM column:
df['RPM'] = df['RPM new'].fillna(df['RPM'])
Date RPM RPM new
0 1 0.0 NaN
1 2 2.0 2.0
2 3 0.0 NaN
3 4 4.0 4.0
4 5 0.0 NaN
5 6 6.0 6.0
6 7 0.0 NaN
Then drop the RPM new column:
df = df.drop('RPM new', axis=1)
Date RPM
0 1 0.0
1 2 2.0
2 3 0.0
3 4 4.0
4 5 0.0
5 6 6.0
6 7 0.0
Full code:
df = df1.merge(df2, on='Date', how='left', suffixes=[None, ' new'])
df['RPM'] = df['RPM new'].fillna(df['RPM'])
df = df.drop('RPM new', axis=1)

Pandas DataFrame Change Values Based on Values in Different Rows

I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0

How to change a non top 3 values columns in a dataframe in Python [duplicate]

This question already has an answer here:
How to find the top column values of each row in a pandas dataframe
(1 answer)
Closed 1 year ago.
I have a dataframe that was made out of BOW results called df_BOW
dataframe looks like this
df_BOW
Out[42]:
blue drama this ... book mask
0 3 0 1 ... 1 0
1 0 1 0 ... 0 4
2 0 1 3 ... 6 0
3 6 0 0 ... 1 0
4 7 2 0 ... 0 0
... ... ... ... ... ... ...
81991 0 0 0 ... 0 1
81992 0 0 0 ... 0 1
81993 3 3 5 ... 4 1
81994 4 0 0 ... 0 0
81995 0 1 0 ... 9 2
this data frame has around 12,000 column and 82,000 rows
I want to reduce the number of columns by doing this
for each row keep only top 3 columns and make everything else 0
so for row number 543 ( the original record looks like this)
blue drama this ... book mask
543 1 11 21 ... 7 4
It should become like this
blue drama this ... book mask
543 0 11 21 ... 7 0
only top 3 columns kept (drama, this, book) all other columns became zeros
blue drama this ... book mask
929 5 3 2 ... 4 3
will become
blue drama this ... book mask
929 5 3 0 ... 4 0
at the end of I should remove all columns that are zeros for all rows
I start putting this function to loop all rows and all columns
for i in range(0, len(df_BOW.index)):
Col1No = 0
Col1Val = 0
Col2No = 0
Col2Val = 0
Col3No = 0
Col3Val = 0
for j in range(0, len(df_BOW.columns)):
if (df_BOW.iloc[i,j] > min(Col1Val, Col2Val, Col3Val)):
if (Col1Val <= Col2Val) & (Col1Val <= Col3Val):
df_BOW.iloc[i,Col1No] = 0
Col1Val = df_BOW.iloc[i,j]
Col1No = j
elif (Col2Val <= Col1Val) & (Col2Val <= Col3Val):
df_BOW.iloc[i,Col2No] = 0
Col2Val = df_BOW.iloc[i,j]
Col2No = j
elif (Col3Val <= Col1Val) & (Col3Val <= Col2Val):
df_BOW.iloc[i,Col3No] = 0
Col3Val = df_BOW.iloc[i,j]
Col3No = j
I don't think this loop is the best way to do that.
beside it will become impossible to do for top 50 columns with this loop.
is there a better way to do that?
You can use pandas.Series.nlargest, pass keep as first to include the first record only if multiple value exists for top 3 largest values. Finally use fillna(0) to fill all the NaN columns with 0
df.apply(lambda row: row.nlargest(3, keep='first'), axis=1).fillna(0)
OUTPUT:
blue book drama mask this
0 0.0 1.0 0.0 0.0 1.0
1 1.0 0.0 1.0 4.0 0.0
2 2.0 6.0 0.0 0.0 3.0
3 3.0 1.0 0.0 0.0 0.0
4 4.0 0.0 2.0 0.0 0.0
5 0.0 0.0 0.0 1.0 0.0
6 0.0 0.0 0.0 1.0 0.0
7 3.0 4.0 0.0 0.0 5.0
8 4.0 0.0 0.0 0.0 0.0
9 0.0 9.0 1.0 2.0 0.0

Pandas - Replace NaNs in a column with the mean of specific group

I am working with data like the following. The dataframe is sorted by the date:
category value Date
0 1 24/5/2019
1 NaN 24/5/2019
1 1 26/5/2019
2 2 1/6/2019
1 2 23/7/2019
2 NaN 18/8/2019
2 3 20/8/2019
7 3 1/9/2019
1 NaN 12/9/2019
2 NaN 13/9/2019
I would like to replace the "NaN" values with the previous mean for that specific category.
What is the best way to do this in pandas?
Some approaches I considered:
1) This litte riff:
df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
source
This gets me the the correct means in but in another column, and it does not replace the NaNs.
2) This riff replaces the NaNs with the average of the columns:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
Source 2
Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!
You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
category value Date
0 0 1.0 2019-05-24
1 1 NaN 2019-05-24
2 1 1.0 2019-05-26
3 2 2.0 2019-01-06
4 1 2.0 2019-07-23
5 2 2.0 2019-08-18
6 2 3.0 2019-08-20
7 7 3.0 2019-01-09
8 1 1.5 2019-12-09
9 2 2.5 2019-09-13
You can use pandas.Series.fillna to replace NaN values:
df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)
category value Date
0 0 1.0 24/5/2019
1 1 NaN 24/5/2019
2 1 1.0 26/5/2019
3 2 2.0 1/6/2019
4 1 2.0 23/7/2019
5 2 2.0 18/8/2019
6 2 3.0 20/8/2019
7 7 3.0 1/9/2019
8 1 1.5 12/9/2019
9 2 2.5 13/9/2019

Generating regular time series from irregular time series in pandas

I have a data analysis task in which I want to analyze the real time service logs. Could you please help me how to do this in Pandas?
My initial dataframe look like this:
I want to generate time series for each service name and make a correlation analysis based on this.
How can I divide this dataframe into different dataframes(indexed with time slot) for each service name by aggregating their respective data as shown below?
Ps:I have seen similar questions, but I believe my question is different because I want to generate many time series from a dataframe. And sorry in advance if this is an easy one, I am new to Pandas :)
Here is my Dataframe as code:
ERRORCODE ERRORTEXT SERVICENAME REQTDURATION RESPTDURATION HOSTDURATION
10:00:27:000 NaN NaN serviceA 0 1 4612
10:00:27:822 NaN NaN serviceB 0 1 14994
10:01:27:622 -1 'Timeout' serviceA 1 0 7695
10:01:27:323 NaN NaN serviceD 0 1 2612
10:01:27:755 NaN NaN serviceA 0 1 1612
10:02:27:666 -5 'Timeout' serviceA 0 1 11612
10:02:27:111 NaN NaN serviceB 0 1 111112
10:02:27:333 NaN NaN serviceC 0 1 412
Starting with:
ERRORCODE ERRORTEXT SERVICENAME REQTDURATION RESPTDURATION \
10:00:27:000 NaN NaN serviceA 0 1
10:00:27:822 NaN NaN serviceB 0 1
10:01:27:622 -1 'Timeout' serviceA 1 0
10:01:27:323 NaN NaN serviceD 0 1
10:01:27:755 NaN NaN serviceA 0 1
10:02:27:666 -5 'Timeout' serviceA 0 1
10:02:27:111 NaN NaN serviceB 0 1
10:02:27:333 NaN NaN serviceC 0 1
HOSTDURATION
10:00:27:000 4612
10:00:27:822 14994
10:01:27:622 7695
10:01:27:323 2612
10:01:27:755 1612
10:02:27:666 11612
10:02:27:111 111112
10:02:27:333 412
Converting index to DateTimeIndex:
df.index = pd.to_datetime(df.index, format='%H:%M:%S:%f')
And then looping over SERVICENAME groups:
for service, data in df.groupby('SERVICENAME'):
service_result = pd.concat([data.groupby(pd.TimeGrouper('Min')).size(), data.groupby(pd.TimeGrouper('Min'))['REQTDURATION', 'RESPTDURATION', 'HOSTDURATION'].mean()], axis=1)
service_result.columns = ['ERRORCOUNT', 'AVGREQTURATION', 'AVGRESPTDURATION', 'AVGHOSTDURATION']
service_result.index = service_result.index.time
yields:
serviceA
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:00:00 1 0.0 1.0 4612.0
10:01:00 2 0.5 0.5 4653.5
10:02:00 1 0.0 1.0 11612.0
serviceB
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:00:00 1 0 1 14994
10:01:00 0 NaN NaN NaN
10:02:00 1 0 1 111112
serviceC
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:02:00 1 0 1 412
serviceD
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:01:00 1 0 1 2612

Categories

Resources