Construct new columns based on all previous pairs information using Pandas - python

I am trying to create 3 new columns in a dataframe, which are based on previous pairs information.
You can think of the dataframe as the results of comptetion ('xx' column) within diffrerent types ('type' column) at different dates ('date column).
The idea is to create the following new columns:
(i) numb_comp_past: sum of the number of times every type faced the competitors in the past.
(ii) win_comp_past: sum of the win (+1), ties (+0), and loss (-1) of the previous competitions that all the types competing with each other had in the past.
(iii) win_comp_past_difs: sum of difference of the results of the previous competitions that all the types competing with each other had in the past.
The original dataframe (df) is the following:
idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Mar-18', 'Mar-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'D', 'E', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3},{'xx': 1}, {'xx': 6}, {'xx': 3}, {'xx': 5}, {'xx': 2}, {'xx': 3},{'xx': 1}, {'xx': 9}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}, {'xx': 6}, {'xx': 8}, {'xx': 2}, {'xx': 7}, {'xx': 9}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y')
df=df.set_index(['date','type'])
df['xx'] = df.xx.astype('float')
And it looks like this:
xx
date type
2018-01-01 A 1.0
B 5.0
2018-02-01 B 3.0
2018-03-01 A 2.0
B 7.0
C 3.0
D 1.0
E 6.0
2018-05-01 B 3.0
2018-06-01 A 5.0
B 2.0
C 3.0
2018-07-01 A 1.0
2018-08-01 B 9.0
C 3.0
2018-09-01 A 2.0
B 7.0
2018-10-01 C 3.0
A 6.0
B 8.0
2018-11-01 A 2.0
2018-12-01 B 7.0
C 9.0
The 3 new columns I need to add to the dataframe are shown below (expected output of the Pandas code):
xx numb_comp_past win_comp_past win_comp_past_difs
date type
2018-01-01 A 1.0 0.0 0.0 0.0
B 5.0 0.0 0.0 0.0
2018-02-01 B 3.0 0.0 0.0 0.0
2018-03-01 A 2.0 1.0 -1.0 -4.0
B 7.0 1.0 1.0 4.0
C 3.0 0.0 0.0 0.0
D 1.0 0.0 0.0 0.0
E 6.0 0.0 0.0 0.0
2018-05-01 B 3.0 0.0 0.0 0.0
2018-06-01 A 5.0 3.0 -3.0 -10.0
B 2.0 3.0 3.0 13.0
C 3.0 2.0 0.0 -3.0
2018-07-01 A 1.0 0.0 0.0 0.0
2018-08-01 B 9.0 2.0 0.0 3.0
C 3.0 2.0 0.0 -3.0
2018-09-01 A 2.0 3.0 -1.0 -6.0
B 7.0 3.0 1.0 6.0
2018-10-01 C 3.0 5.0 -1.0 -10.0
A 6.0 6.0 -2.0 -10.0
B 8.0 7.0 3.0 20.0
2018-11-01 A 2.0 0.0 0.0 0.0
2018-12-01 B 7.0 4.0 2.0 14.0
C 9.0 4.0 -2.0 -14.0
Note that:
(i) for numb_comp_past if there are no previous competitions I assign a value of 0. On 2018-06-01, for example, the type A has a value of 3 given that he previously competed with type B on 2018-01-01 and 2018-03-01 and with type C on 2018-03-01.
(ii) for win_comp_past if there are no previous competitions I assign a value of 0. On 2018-06-01, for example, the type A has a value of -3 given that he previously lost with type B on 2018-01-01 (-1) and 2018-03-01 (-1) and with type C on 2018-03-01 (-1). Thus adding -1-1-1=-3.
(iii) for win_comp_past_value if there are no previous competitions I assign a value of 0. On 2018-06-01, for example, the type A has a value of -10 given that he previously lost with type B on 2018-01-01 by a difference of -4 (=1-5) and on 2018-03-01 by a diffrence of -5 (=2-7) and with type C on 2018-03-01 by -1 (=2-3). Thus adding -4-5-1=-10.
I really don't know how to start solving this problem. Any ideas of how to solve the new columns decribed in (i), (ii) and (ii) are very welcome.

Here's my take:
# get differences of pairs, useful for win counts and win_difs
def get_diff(x):
teams = x.index.get_level_values(1)
tmp = pd.DataFrame(x[:,None]-x[None,:],
columns = teams.values,
index=teams.values).stack()
return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]
new_df = df.groupby('date').xx.apply(get_diff).to_frame()
# win matches
new_df['win'] = new_df.xx.ge(0).astype(int) - new_df.xx.le(0).astype(int)
# group by players
groups = new_df.groupby(level=[1,2])
# sum function
def cumsum_shift(x):
return x.cumsum().shift()
# assign new values
df['num_comp_past'] = groups.xx.cumcount().sum(level=[0,1])
df['win_comp_past'] = groups.win.apply(cumsum_shift).sum(level=[0,1])
df['win_comp_past_difs'] = groups.xx.apply(cumsum_shift).sum(level=[0,1])
Output:
+------------+------+-----+---------------+---------------+--------------------+
| | | xx | num_comp_past | win_comp_past | win_comp_past_difs |
+------------+------+-----+---------------+---------------+--------------------+
| date | type | | | | |
+------------+------+-----+---------------+---------------+--------------------+
| 2018-01-01 | A | 1.0 | 0.0 | 0.0 | 0.0 |
| | B | 5.0 | 0.0 | 0.0 | 0.0 |
| 2018-02-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-03-01 | A | 2.0 | 1.0 | -1.0 | -4.0 |
| | B | 7.0 | 1.0 | 1.0 | 4.0 |
| | C | 3.0 | 0.0 | 0.0 | 0.0 |
| | D | 1.0 | 0.0 | 0.0 | 0.0 |
| | E | 6.0 | 0.0 | 0.0 | 0.0 |
| 2018-05-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-06-01 | A | 5.0 | 3.0 | -3.0 | -10.0 |
| | B | 2.0 | 3.0 | 3.0 | 13.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-07-01 | A | 1.0 | NaN | NaN | NaN |
| 2018-08-01 | B | 9.0 | 2.0 | 0.0 | 3.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-09-01 | A | 2.0 | 3.0 | -1.0 | -6.0 |
| | B | 7.0 | 3.0 | 1.0 | 6.0 |
| 2018-10-01 | C | 3.0 | 5.0 | -1.0 | -10.0 |
| | A | 6.0 | 6.0 | -2.0 | -10.0 |
| | B | 8.0 | 7.0 | 3.0 | 20.0 |
| 2018-11-01 | A | 2.0 | NaN | NaN | NaN |
| 2018-12-01 | B | 7.0 | 4.0 | 2.0 | 14.0 |
| | C | 9.0 | 4.0 | -2.0 | -14.0 |
| 2018-01-01 | A | 1.0 | 0.0 | 0.0 | 0.0 |
| | B | 5.0 | 0.0 | 0.0 | 0.0 |
| 2018-02-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-03-01 | A | 2.0 | 1.0 | -1.0 | -4.0 |
| | B | 7.0 | 1.0 | 1.0 | 4.0 |
| | C | 3.0 | 0.0 | 0.0 | 0.0 |
| | D | 1.0 | 0.0 | 0.0 | 0.0 |
| | E | 6.0 | 0.0 | 0.0 | 0.0 |
| 2018-05-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-06-01 | A | 5.0 | 3.0 | -3.0 | -10.0 |
| | B | 2.0 | 3.0 | 3.0 | 13.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-07-01 | A | 1.0 | NaN | NaN | NaN |
| 2018-08-01 | B | 9.0 | 2.0 | 0.0 | 3.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-09-01 | A | 2.0 | 3.0 | -1.0 | -6.0 |
| | B | 7.0 | 3.0 | 1.0 | 6.0 |
| 2018-10-01 | C | 3.0 | 5.0 | -1.0 | -10.0 |
| | A | 6.0 | 6.0 | -2.0 | -10.0 |
| | B | 8.0 | 7.0 | 3.0 | 20.0 |
| 2018-11-01 | A | 2.0 | NaN | NaN | NaN |
| 2018-12-01 | B | 7.0 | 4.0 | 2.0 | 14.0 |
| | C | 9.0 | 4.0 | -2.0 | -14.0 |
+------------+------+-----+---------------+---------------+--------------------+

Related

Accessing column of a list of nested json in a python dataframe

Identifier Properties
1 [{"$id":"2","SMName":"pia.redoabs.com","Type":"sms"},{"$id":"3","Name":"_18_Lucene41_0.doc","Type":"file"}]
2 [{"$id":"2","SMName":"pred.redocad.com","Type":"sms"},{"$id":"3","Name":"_18_Nil41_0.doc","Type":"file"}]
3 [{"$id":"2","SMName":"promomaster.com","Type":"sms"},{"$id":"3","Name":"_17_Litre41_0.doc","Type":"file"}]
4 [{"$id":"2","SMName":"admaster.com","Type":"sms"},{"$id":"3","Name":"_k.pos","Type":"file"}]
5 [{"$id":"2","SMName":"plan.com.com","Type":"sms"},{"$id":"3","Name":"_3_Lucene41_0.doc","Type":"file"}]
6 [{"$id":"2","Name":"jm460","ETNDomain":"ent.ad.ent","Sid":"S-1-5-21-117609710-2111687655-839522115-432193","AadUserId":"7133971dffgh5r-b9b8-4af3-bbfd-85b1b56d1f6f","IsDomainJoined":true,"Type":"account","UserPrincipalName":"jmjklo460#ent.com"},{"$id":"3","Directory":"C:\\CR\\new_cbest_malware","Name":"ent_Survey.zip","hash":[{"$id":"4","Algorithm":"hsa1","Value":"cecce931f21697876efc80f5897a31481c396795","Type":"hash"},{"$id":"5","Algorithm":"MI5","Value":"12c216630a5f24faab06d463c9ce72a5","Type":"hash"},{"$id":"6","Algorithm":"TM345","Value":"cbb327b70a83fefeaf744458f2ed62021e529ce0ece36566761779f17d4c07a6","Type":"hash"}],"CreatedTimeUtc":"2022-08-22T17:42:02.4272869Z","Type":"file"},{"$ref":"4"},{"$ref":"5"},{"$ref":"6"},{"$id":"7","ProcessId":"54884","CommandLine":"\"7zG.exe\" a -i#7zMap23807:40278:7zEvent24942 -ad -saa -- \"C:\\CR\\CR_2\"","ElevationToken":"Default","CreationTimeUtc":"2022-10-03T17:59:35.2339055Z","ImageFile":{"$id":"8","Directory":"C:\\Program Files\\7-Zip","Name":"9zG.exe","FileHashes":[{"$id":"9","Algorithm":"HSA2","Value":"df22612647e9404a515d48ebad490349685250de","Type":"hash"},{"$id":"10","Algorithm":"MI5","Value":"04fb3ae7f05c8bc333125972ba907398","Type":"hash"},{"$id":"11","Algorithm":"hsa1","Value":"2fb898bacb587f2484c9c4aa6da2729079d93d1f923a017bb84beef87bf74fef","Type":"hash"}],"CreatedTimeUtc":"2020-09-21T16:34:33.1299959Z","Type":"file"},"ParentProcess":{"$id":"12","ProcessId":"13516","CreationTimeUtc":"2022-09-21T12:41:32.4609401Z","CreatedTimeUtc":"2022-09-21T12:41:32.4609401Z","Type":"process"},"CreatedTimeUtc":"2022-10-03T17:59:35.2339055Z","Type":"process"},{"$ref":"12"},{"$ref":"8"},{"$ref":"9"},{"$ref":"10"},{"$ref":"11"},{"$id":"13","DnsDomain":"ent.ad.ent.com","HostName":"ilch-l788441","OSFamily":"Windows","OSVersion":"20H2","Tags":[{"ProviderName":"tmdp","TagId":"VwanPov","TagName":"VwanPov","TagType":"UserDefined"},{"ProviderName":"dmpt","TagId":"Proxy Allow Personal Storage","TagName":"Proxy Allow Personal Storage","TagType":"UserDefined"},{"ProviderName":"dmpt","TagId":"Proxy Allow Webmail","TagName":"Proxy Allow Webmail","TagType":"UserDefined"},{"ProviderName":"dmpt","TagId":"proxy-allow-social-media","TagName":"proxy-allow-social-media","TagType":"UserDefined"}],"Type":"host","dmptDeviceId":"fa52ff90ab60ee6eac86ec60ed2ac748a33e29fa","FQDN":"ilch-567.ent.ad.ent.com","AadDeviceId":"e1d59b69-dd3f-4f33-96b5-db9233654c16","RiskScore":"Medium","HealthStatus":"Active","LastSeen":"2022-10-03T18:09:32.7812655","LastExternalIpAddress":"208.95.144.39","LastIpAddress":"10.14.126.52","AvStatus":"Updated","OnboardingStatus":"Onboarded","LoggedOnUsers":[{"AccountName":"jmjklo460","DomainName":"ENT"}]}]
This is a dataframe with 2 columns "Identifier" & "Properties". The "Properties" column appears as a list of json.The aim is to create different columns for "sms" , "file" ,"ETNDomain" ,"UserPrincipalName" etc.
Not all the rows have same information as it is seen above. The first 5 rows are similar while the 6th row is different
Can we make the code dynamic to be able to extract any values ?
Further I used kql to parse this data & it was relatively straightforward. But having little/no experience with json it would be great if someone can help ?
You can use something like that. However, since there are many different keys in the json data, most rows will be nan.
import numpy as np
import ast
df['Properties']=df['Properties'].astype(str)
df['Properties']=df['Properties'].apply(lambda x: ast.literal_eval(x) if x != 'nan' else np.nan)
df = df.explode('Properties') #df is your dataframe.
final = df[['Identifier']].join(pd.json_normalize(df['Properties']))
'''
| | Identifier | $id | SMName | Type | Name | ETNDomain | Sid | AadUserId | IsDomainJoined | UserPrincipalName | Directory | hash | CreatedTimeUtc | $ref | ProcessId | CommandLine | ElevationToken | CreationTimeUtc | ImageFile.$id | ImageFile.Directory | ImageFile.Name | ImageFile.FileHashes | ImageFile.CreatedTimeUtc | ImageFile.Type | ParentProcess.$id | ParentProcess.ProcessId | ParentProcess.CreationTimeUtc | ParentProcess.CreatedTimeUtc | ParentProcess.Type | DnsDomain | HostName | OSFamily | OSVersion | Tags | dmptDeviceId | FQDN | AadDeviceId | RiskScore | HealthStatus | LastSeen | LastExternalIpAddress | LastIpAddress | AvStatus | OnboardingStatus | LoggedOnUsers |
|---:|-------------:|------:|:-----------------|:-------|:-------------------|------------:|------:|------------:|-----------------:|--------------------:|------------:|-------:|-----------------:|-------:|------------:|--------------:|-----------------:|------------------:|----------------:|----------------------:|-----------------:|-----------------------:|---------------------------:|-----------------:|--------------------:|--------------------------:|--------------------------------:|-------------------------------:|---------------------:|------------:|-----------:|-----------:|------------:|-------:|---------------:|-------:|--------------:|------------:|---------------:|-----------:|------------------------:|----------------:|-----------:|-------------------:|----------------:|
| 0 | 1 | 2 | pia.redoabs.com | sms | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 0 | 1 | 2 | pia.redoabs.com | sms | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 1 | 2 | 3 | nan | file | _18_Lucene41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 1 | 2 | 3 | nan | file | _18_Lucene41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 2 | 3 | 2 | pred.redocad.com | sms | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 2 | 3 | 2 | pred.redocad.com | sms | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 3 | 4 | 3 | nan | file | _18_Nil41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 3 | 4 | 3 | nan | file | _18_Nil41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 4 | 5 | 2 | promomaster.com | sms | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 4 | 5 | 2 | promomaster.com | sms | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 6 | 3 | nan | file | _17_Litre41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 6 | 3 | nan | file | _17_Litre41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 6 | 3 | nan | file | _17_Litre41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 6 | 3 | nan | file | _17_Litre41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 6 | 3 | nan | file | _17_Litre41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 6 | 3 | nan | file | _17_Litre41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 6 | 3 | nan | file | _17_Litre41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 6 | 3 | nan | file | _17_Litre41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 6 | 3 | nan | file | _17_Litre41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 6 | 3 | nan | file | _17_Litre41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 6 | 3 | nan | file | _17_Litre41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 6 | 3 | nan | file | _17_Litre41_0.doc | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
'''

maximum variation within one second for each row of a DataFrame

I'm having a calculation problem with pandas and I'd like to know if anyone could help me.
Having this df created using this code:
df = pd.DataFrame({'B': [0, 2, 1, np.nan, 4, 1, 3, 10, np.nan, 3, 6]},
index = [pd.Timestamp('20130101 09:31:23.999'),
pd.Timestamp('20130101 09:31:24.200'),
pd.Timestamp('20130101 09:31:24.250'),
pd.Timestamp('20130101 09:31:25.000'),
pd.Timestamp('20130101 09:31:25.375'),
pd.Timestamp('20130101 09:31:25.850'),
pd.Timestamp('20130101 09:31:26.100'),
pd.Timestamp('20130101 09:31:27.150'),
pd.Timestamp('20130101 09:31:28.050'),
pd.Timestamp('20130101 09:31:28.850'),
pd.Timestamp('20130101 09:31:29.200')])
df
| | B |
|-------------------------|------|
| 2013-01-01 09:31:23.999 | 0.0 |
| 2013-01-01 09:31:24.200 | 2.0 |
| 2013-01-01 09:31:24.250 | 1.0 |
| 2013-01-01 09:31:25.000 | NaN |
| 2013-01-01 09:31:25.375 | 4.0 |
| 2013-01-01 09:31:25.850 | 1.0 |
| 2013-01-01 09:31:26.100 | 3.0 |
| 2013-01-01 09:31:27.150 | 10.0 |
| 2013-01-01 09:31:28.050 | NaN |
| 2013-01-01 09:31:28.850 | 3.0 |
| 2013-01-01 09:31:29.200 | 6.0 |
I would like to be able to calculate for each row what the maximum variation of B has been during one second.
For example, in the first row you would have to look at how much it has changed with respect to the second row and the third row which are those within the interval of a second and calculate the difference with the maximum value.
In this case, the maximum value is in the second row "09:31:24.200", the maximum variation will be 2 - 0.
Then, we will create a new column with all these maximum variations for each of the rows.
df
| | B | Maximum Variation |
|-------------------------|------|--------------------|
| 2013-01-01 09:31:23.999 | 0.0 | 2.0 |
| 2013-01-01 09:31:24.200 | 2.0 | 1.0 |
| 2013-01-01 09:31:24.250 | 1.0 | 0.0 |
| 2013-01-01 09:31:25.000 | NaN | 4.0 |
| 2013-01-01 09:31:25.375 | 4.0 |-3.0 |
| 2013-01-01 09:31:25.850 | 1.0 | 2.0 |
| 2013-01-01 09:31:26.100 | 3.0 | 0.0 |
| 2013-01-01 09:31:27.150 | 10.0 | 0.0 |
| 2013-01-01 09:31:28.050 | NaN | 3.0 |
| 2013-01-01 09:31:28.850 | 3.0 | 3.0 |
| 2013-01-01 09:31:29.200 | 6.0 | 0.0 |
I hope it's clear enough
Solution has been found and shared in the answers, but still an efficiency improvement in this solution that doesn't involve having to make a loop for each row of the df, will be more than welcome
I've finally found the solution:
df = pd.DataFrame({'B': [0, 1, 2, 8, 6, 1, 3, 10, np.nan, 3, 6]},
index = [pd.Timestamp('20130101 09:31:23.999'),
pd.Timestamp('20130101 09:31:24.200'),
pd.Timestamp('20130101 09:31:24.250'),
pd.Timestamp('20130101 09:31:25.000'),
pd.Timestamp('20130101 09:31:25.375'),
pd.Timestamp('20130101 09:31:25.850'),
pd.Timestamp('20130101 09:31:26.100'),
pd.Timestamp('20130101 09:31:27.150'),
pd.Timestamp('20130101 09:31:28.050'),
pd.Timestamp('20130101 09:31:28.850'),
pd.Timestamp('20130101 09:31:29.200')])
df = df.reset_index()
df = df.rename(columns={"index": "start_date"})
df['duration_in_seconds'] = 1
df['end_date'] = df['start_date'] + pd.to_timedelta(df['duration_in_seconds'], unit='s')
df['max'] = np.nan
for index, row in df.iterrows():
start = row['start_date']
end = row['end_date']
maxi = df[(df['start_date'] >= start ) & (df['start_date'] <= end)]['B'].max()
df.iloc[index, df.columns.get_loc('max')] = maxi
df['Maximum Variation'] = df['max'] - df['B']
df
| | start_date | B | duration_in_seconds | end_date | max | Maximum Variation |
|----|-------------------------|------|---------------------|-------------------------|------|-------------------|
| 0 | 2013-01-01 09:31:23.999 | 0.0 | 1 | 2013-01-01 09:31:24.999 | 2.0 | 2.0 |
| 1 | 2013-01-01 09:31:24.200 | 1.0 | 1 | 2013-01-01 09:31:25.200 | 8.0 | 7.0 |
| 2 | 2013-01-01 09:31:24.250 | 2.0 | 1 | 2013-01-01 09:31:25.250 | 8.0 | 6.0 |
| 3 | 2013-01-01 09:31:25.000 | 8.0 | 1 | 2013-01-01 09:31:26.000 | 8.0 | 0.0 |
| 4 | 2013-01-01 09:31:25.375 | 6.0 | 1 | 2013-01-01 09:31:26.375 | 6.0 | 0.0 |
| 5 | 2013-01-01 09:31:25.850 | 1.0 | 1 | 2013-01-01 09:31:26.850 | 3.0 | 2.0 |
| 6 | 2013-01-01 09:31:26.100 | 3.0 | 1 | 2013-01-01 09:31:27.100 | 3.0 | 0.0 |
| 7 | 2013-01-01 09:31:27.150 | 10.0 | 1 | 2013-01-01 09:31:28.150 | 10.0 | 0.0 |
| 8 | 2013-01-01 09:31:28.050 | NaN | 1 | 2013-01-01 09:31:29.050 | 3.0 | NaN |
| 9 | 2013-01-01 09:31:28.850 | 3.0 | 1 | 2013-01-01 09:31:29.850 | 6.0 | 3.0 |
| 10 | 2013-01-01 09:31:29.200 | 6.0 | 1 | 2013-01-01 09:31:30.200 | 6.0 | 0.0 |
More time efficient solutions are still welcome
More efficient solution
df = df.reset_index()
df = df.rename(columns={"index": "start_date"})
df['duration_in_seconds'] = 1
df['end_date'] = df['start_date'] + pd.to_timedelta(df['duration_in_seconds'], unit='s')
df['max'] = np.nan
df["max"] = df.apply(lambda row : df.loc[(df["start_date"] >= row['start_date']) & (df["start_date"] <=row['end_date'])]["B"].max(), axis = 1)
df['Maximum Variation'] = df['max'] - df['B']
import numpy as np
import pandas as pd
df = pd.DataFrame({'B': [0, 2, 1, np.nan, 4, 1, 3, 10, np.nan, 3, 6]},
index = [pd.Timestamp('20130101 09:31:23.999'),
pd.Timestamp('20130101 09:31:24.200'),
pd.Timestamp('20130101 09:31:24.250'),
pd.Timestamp('20130101 09:31:25.000'),
pd.Timestamp('20130101 09:31:25.375'),
pd.Timestamp('20130101 09:31:25.850'),
pd.Timestamp('20130101 09:31:26.100'),
pd.Timestamp('20130101 09:31:27.150'),
pd.Timestamp('20130101 09:31:28.050'),
pd.Timestamp('20130101 09:31:28.850'),
pd.Timestamp('20130101 09:31:29.200')])
print(df)
B
2013-01-01 09:31:23.999 0.0
2013-01-01 09:31:24.200 2.0
2013-01-01 09:31:24.250 1.0
2013-01-01 09:31:25.000 NaN
2013-01-01 09:31:25.375 4.0
2013-01-01 09:31:25.850 1.0
2013-01-01 09:31:26.100 3.0
2013-01-01 09:31:27.150 10.0
2013-01-01 09:31:28.050 NaN
2013-01-01 09:31:28.850 3.0
2013-01-01 09:31:29.200 6.0
df_min = df.resample('1S').min()
print(df_min)
B
2013-01-01 09:31:23 0.0
2013-01-01 09:31:24 1.0
2013-01-01 09:31:25 1.0
2013-01-01 09:31:26 3.0
2013-01-01 09:31:27 10.0
2013-01-01 09:31:28 3.0
2013-01-01 09:31:29 6.0
df_max = df.resample('1S').max()
print(df_max)
B
2013-01-01 09:31:23 0.0
2013-01-01 09:31:24 2.0
2013-01-01 09:31:25 4.0
2013-01-01 09:31:26 3.0
2013-01-01 09:31:27 10.0
2013-01-01 09:31:28 3.0
2013-01-01 09:31:29 6.0
df_diff = df_max - df_min
print(df_diff)
B
2013-01-01 09:31:23 0.0
2013-01-01 09:31:24 1.0
2013-01-01 09:31:25 3.0
2013-01-01 09:31:26 0.0
2013-01-01 09:31:27 0.0
2013-01-01 09:31:28 0.0
2013-01-01 09:31:29 0.0

Efficiently apply several different operations to a dataset

I have to do several different operations to many columns of a DataSet, I did it but not in a very efficient way...
As an example, I have this table:
| A | B | C | D | E |
|------|------|------|------|------|
| 1.0 | 1.0 | 1.0 | 2.0 | a |
| 2.0 | 1.0 | 1.5 | 5.0 | a |
| 3.0 | 1.0 | 2.0 | 3.0 | b |
| 1.0 | 2.0 | 2.0 | 6.0 | a |
| 2.0 | 2.0 | 3.0 | 4.0 | b |
| 3.0 | 2.0 | 4.0 | 2.0 | b |
| 1.0 | 3.0 | 5.0 | 5.0 | b |
| 2.0 | 3.0 | 6.0 | 1.0 | a |
| 3.0 | 3.0 | 10.0 | 2.0 | a |
And I need to get the following result:
# I dont need the A column, the criteria is the B column, apply the mean
# to the C, the sum to the D and the most frequent on E
| B | C | D | E |
|------|------|------|------|
| 1.0 | 1.5 | 10.0 | a |
| 2.0 | 3.0 | 12.0 | b |
| 3.0 | 7.0 | 8.0 | a |
Here is my attempt but is extremely slow. My original dataset has 2.000.000 of rows. Transforming it to 130.000 takes more than 30 minutes and I have to apply it three times... this is why I need something more efficient.
import pandas as pd
df = pd.DataFrame({"A":[1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
"B":[1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0],
"C":[1.0, 1.5, 2.0, 2.0, 3.0, 4.0, 5.0, 6.0, 10.0],
"D":[2.0, 5.0, 3.0, 6.0, 4.0, 2.0, 5.0, 1.0, 2.0],
"E":['a', 'a', 'b', 'a', 'b', 'b', 'b', 'a', 'a']})
print(df)
dict_ds = { 'B' : [], 'C' : [], 'D' : [], 'E' : []}
df2 = pd.DataFrame(dict_ds)
df=df.groupby('B')
for n in df.first().index:
data = df.get_group(n)
partial = data.mean()
new_C = partial['C']
partial = data.sum()
new_D = partial['D']
new_E = data['E'].mode()[0]
df2.loc[len(df2)] = (n,new_C,new_D,new_E)
print(df2)
This part is after getting the solution.
If I apply the operation unique to the agg:
df.groupby('B').agg({
'A': 'unique',
'C': 'mean',
'D': 'sum',
'E': lambda x: x.mode()
}).reset_index()
I have the next result:
B A C D E
0 1.0 [1.0, 2.0, 3.0] 1.5 10.0 a
1 2.0 [1.0, 2.0, 3.0] 3.0 12.0 b
2 3.0 [1.0, 2.0, 3.0] 7.0 8.0 a
But I need to have it in this other way:
A B C D E
0 1.0 1.0 1.5 10.0 a
1 2.0 1.0 1.5 10.0 a
2 3.0 1.0 1.5 10.0 a
3 1.0 2.0 3.0 12.0 b
4 2.0 2.0 3.0 12.0 b
5 3.0 2.0 3.0 12.0 b
6 1.0 3.0 7.0 8.0 a
7 2.0 3.0 7.0 8.0 a
8 3.0 3.0 7.0 8.0 a
Is it possible to have something similar? A very efficent way?
new_df = df.groupby('B').agg({
'C': 'mean',
'D': 'sum',
'E': lambda x: x.mode()
})
>>> new_df
B C D E
1.0 1.5 10.0 a
2.0 3.0 12.0 b
3.0 7.0 8.0 a
EDIT: For your 2nd question...
I can't guarantee that this will be efficient but it gets what you want done:
df_1 = new_df['A'].apply(pd.Series).unstack().reset_index(level = 0, drop = True)
df_1.name = 'A'
df_2 = new_df[[col for col in df.columns if col != 'A']]
df_2.name = 'others'
pd.merge(df_1, df_2, left_index = True, right_index = True).reset_index(drop = True)
>>> output
A B C D E
0 1.0 1.0 1.5 10.0 a
0 2.0 1.0 1.5 10.0 a
0 3.0 1.0 1.5 10.0 a
1 1.0 2.0 3.0 12.0 b
1 2.0 2.0 3.0 12.0 b
1 3.0 2.0 3.0 12.0 b
2 1.0 3.0 7.0 8.0 a
2 2.0 3.0 7.0 8.0 a
2 3.0 3.0 7.0 8.0 a

How to unpivot pandas dataframe to create datetime column

I have a pivoted pandas dataframe that looks like the one below.
I need to unpivot it into a dataframe indexed by datetime, and the variables (columns) reduced to only one of each.
I tried using melt but I am struggling to reshape it because of the hour row.
What would be the best option to reshape such a dataframe?
The dataframe I have
+----------+------+------+------+------+------+
| nan | var1 | var1 | var2 | var2 | var3 |
+----------+------+------+------+------+------+
| Hour | 2 | 3 | 0 | 2 | 0 |
| 1/1/2019 | 0.8 | 0.4 | 0.6 | 0.9 | 0.7 |
| 1/2/2019 | 0.2 | 0.2 | 0.7 | 0.3 | 0.1 |
| 1/3/2019 | 0.1 | 0.0 | 0.3 | 0.4 | 1.0 |
+----------+------+------+------+------+------+
The dataframe I need to get
+---------------+------+------+------+
| Datetime | var1 | var2 | var3 |
+---------------+------+------+------+
| 1/1/2019 0:00 | NaN | 0.6 | 0.7 |
| 1/1/2019 1:00 | NaN | NaN | NaN |
| 1/1/2019 2:00 | 0.8 | 0.9 | NaN |
| 1/1/2019 3:00 | 0.4 | NaN | NaN |
| 1/2/2019 0:00 | NaN | 0.7 | 0.1 |
| 1/2/2019 1:00 | NaN | NaN | NaN |
| 1/2/2019 2:00 | 0.2 | 0.3 | NaN |
| 1/2/2019 3:00 | 0.2 | NaN | NaN |
| 1/3/2019 0:00 | NaN | 0.3 | 1.0 |
| 1/3/2019 1:00 | NaN | NaN | NaN |
| 1/3/2019 2:00 | 0.1 | 0.4 | NaN |
| 1/3/2019 3:00 | 0.0 | NaN | NaN |
+---------------+------+------+------+
Here's a really shitty answer that is unidiomatic pandas, but gets the job done given the data you presented in the format you presented in. If you have massive amount of data I highly recommend you find a more optimized way.
dff = df.copy()
mn, mx = df.loc['Hour'].agg([min, max]).astype(int)
df = df.loc[df.index.repeat(mx-mn+1)]
df = df.loc[df.index != 'Hour']
df = df.assign(time=list(range(mn,mx+1))*(mx-mn))
df = df.set_index('time', append=True).iloc[:,:0]
for i,v in enumerate(dff.columns):
d = dff.iloc[:, i].to_frame()
hour = d.at['Hour', v]
for idx, row in d.iloc[1:].iterrows():
df.loc[(idx, hour), v] = row[v]
df = df.reset_index().rename(columns={0: 'date'})
df['datetime'] = df[['date', 'time']].apply(lambda x: f"{x['date']} {x['time']}:00", axis=1)
df = df.drop(columns=['date', 'time']).set_index('datetime').reset_index()
print(df)
datetime v1 v2 v3
0 1/1/2019 0:00 NaN 0.6 0.7
1 1/1/2019 1:00 NaN NaN NaN
2 1/1/2019 2:00 0.8 0.9 NaN
3 1/1/2019 3:00 0.4 NaN NaN
4 1/2/2019 0:00 NaN 0.7 0.1
5 1/2/2019 1:00 NaN NaN NaN
6 1/2/2019 2:00 0.2 0.3 NaN
7 1/2/2019 3:00 0.2 NaN NaN
8 1/3/2019 0:00 NaN 0.3 1.0
9 1/3/2019 1:00 NaN NaN NaN
10 1/3/2019 2:00 0.1 0.4 NaN
11 1/3/2019 3:00 0.0 NaN NaN

Rate of change over last n hours using pandas timeseries

I would like to add columns to a time-indexed pandas DataFrame which contain the rate of change over the last n hours for each of the existing columns. I have accomplished this with the following code, however, it is too slow for my needs (probably due to looping over every index of each column?).
Is there a (computationally) faster way to do this?
roc_hours = 12
tol = 1e-10
for c in ts.columns:
c_roc = c + ' +++ RoC ' + str(roc_hours) + 'h'
ts[c_roc] = np.nan
for i in ts.index[np.isfinite(ts[c])]:
df = ts[c][i - np.timedelta64(roc_hours, 'h'):i]
X = (df.index.values - df.index.values.min()).astype('Int64')*2.77778e-13 #hours back
Y = df.values
if Y.std() > tol and X.shape[0] > 1:
fit = np.polyfit(X,Y,1)
ts[c_roc][i] = fit[0]
else:
ts[c_roc][i] = 0
Edit input dataframe ts is irregularly sampled and can contain NaNs. First few lines of input ts:
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
| WCT | a | b | c | d | e | f |
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
| 2011-09-04 20:00:00 | | | | | | |
| 2011-09-04 21:00:00 | | | | | | |
| 2011-09-04 22:00:00 | | | | | | |
| 2011-09-04 23:00:00 | | | | | | |
| 2011-09-05 02:00:00 | 93.0 | 97.0 | 20.0 | 209.0 | 85.0 | 98.0 |
| 2011-09-05 03:00:00 | 74.14285714285714 | 97.0 | 20.0 | 194.14285714285717 | 74.42857142857143 | 98.0 |
| 2011-09-05 04:00:00 | 67.5 | 98.5 | 20.0 | 176.0 | 75.0 | 98.0 |
| 2011-09-05 05:00:00 | 72.0 | 98.5 | 20.0 | 176.0 | 75.0 | 98.0 |
| 2011-09-05 07:00:00 | 80.0 | 93.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 08:00:00 | 80.0 | 93.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 09:00:00 | 78.5 | 98.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 10:00:00 | 73.0 | 98.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 11:00:00 | 77.0 | 98.0 | 18.0 | 175.0 | 87.0 | 97.0999984741211 |
| 2011-09-05 12:00:00 | 78.0 | 98.0 | 19.0 | 163.0 | 57.0 | 98.4000015258789 |
| 2011-09-05 15:00:00 | 78.0 | 98.0 | 19.0 | 163.0 | 57.0 | 98.4000015258789 |
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
Edit 2
After profiling, the bottleneck is in the slicing step: df = ts[c][i - np.timedelta64(roc_hours, 'h'):i]. This line pulls out observations time-stamped between now-roc_hours and now. It's very handy syntax, but is taking up the bulk of the compute time.
Works on a dataset of mine, haven't checked on yours:
import pandas as pd
from numpy import polyfit
from matplotlib import style
style.use('ggplot')
# ... acquire a dataframe named *water* with a column *value*
WINDOW = 10
ax=water.value.plot()
roll = pd.rolling_mean(water.value, WINDOW)
roll.plot(ax=ax)
def lintrend(df):
df = df.tolist()
m, b = polyfit(range(len(df)), df,1)
return m
linny = pd.rolling_apply(water.value, WINDOW, lintrend)
linny.plot(ax=ax)
Casting the numpy.ndarray to list after rolling_apply cast it to numpy.ndarray seems inelegant. Suggestions?

Categories

Resources