Time between status reports with pyspark - python

I have a dataset the contains the status changes of one of our companies' systems.
I am only able to use PySpark to process this data.
each row in the dataset is a status change. There is a status column and an update timestamp.
Status
timestamp
red
2023-01-02T01:05:32.113Z
yellow
2023-01-02T01:15:47.329Z
red
2023-01-02T01:25:11.257Z
green  
2023-01-02T01:33:12.187Z
red
2023-01-05T15:10:12.854Z
green  
2023-01-05T15:26:24.131Z
For the sake of what I need to do, we are going to say a degradation is the first time it reports anything not green to the time it reports green again. What I am trying to do is to create a table of degradations with the duration of each one. ex:
degradation
duration
start
end
degradation 1
27.65
2023-01-02T01:05:32.113Z
2023-01-02T01:33:12.187Z
degradation 2
16.2
2023-01-05T15:10:12.854Z
2023-01-05T15:26:24.131Z
I can get PySpark to return durations between two timestamps without an issue, what I am struggling with is getting PySpark to use the timestamp from the first red to the following green and then log it as a row in a new df.
Any help is appreciated. Thank you.

I have one solution. Dont know if this is the easiest and fastest way to calculate what you want. For me the problem is that those data are valid only when they are not partitioned and in correct order which forces me to move all of them to one partition, at least at first stage
What i am doing here
I am using one big window with lag function and sum to calculate new partitions. In this case partition are created base on occurence of record with status = 'green'
Then i am using group by to find first/last event within each partition and calculate the diff
import pyspark.sql.functions as F
from pyspark.sql import Window
df = [
{"Status": "red", "timestamp": datetime.datetime(2023, 1, 2, 1, 5, 32, 113)},
{"Status": "yellow", "timestamp": datetime.datetime(2023, 1, 2, 1, 15, 47, 329)},
{"Status": "red", "timestamp": datetime.datetime(2023, 1, 2, 1, 25, 11, 257)},
{"Status": "green", "timestamp": datetime.datetime(2023, 1, 2, 1, 33, 12, 187)},
{"Status": "red", "timestamp": datetime.datetime(2023, 1, 5, 15, 10, 12, 854)},
{"Status": "green", "timestamp": datetime.datetime(2023, 1, 5, 15, 26, 24, 131)},
]
df = spark.createDataFrame(df)
windowSpec = Window.partitionBy().orderBy("timestamp")
df.withColumn(
"partition_number",
F.sum(
(F.coalesce(F.lag("Status").over(windowSpec), F.lit(0)) == F.lit("green")).cast(
"int"
)
).over(windowSpec),
).groupBy("partition_number").agg(
F.first("timestamp", ignorenulls=True).alias("start"),
F.last("timestamp", ignorenulls=True).alias("end"),
).withColumn(
"duration",
(F.round((F.col("end").cast("long") - F.col("start").cast("long")) / 60, 2)),
).withColumn(
"degradation", F.concat(F.lit("Degradation"), F.col("partition_number"))
).select(
"degradation", "duration", "start", "end"
).show(
truncate=False
)
output is
+------------+--------+--------------------------+--------------------------+
|degradation |duration|start |end |
+------------+--------+--------------------------+--------------------------+
|Degradation0|27.67 |2023-01-02 01:05:32.000113|2023-01-02 01:33:12.000187|
|Degradation1|16.2 |2023-01-05 15:10:12.000854|2023-01-05 15:26:24.000131|
+------------+--------+--------------------------+--------------------------+
If there is such a need you may change precision in duration or adjust this code to start counting degradation from 1 not 0 if this is a problem

You can mark the status green first, use lag to move a row after. Then you can seperate for each degradation by summing the value over window.
w = Window.partitionBy().orderBy("timestamp")
df.withColumn('status', f.lag((f.col('status') == f.lit('green')).cast('int'), 1, 0).over(w)) \
.withColumn('status', f.sum('status').over(w) + 1) \
.groupBy('status') \
.agg(
f.first('timestamp').alias('start'),
f.last('timestamp').alias('end')
) \
.select(
f.concat(f.lit('Degradation'), f.col('status')).alias('degradation'),
f.round((f.col('end') - f.col('start')).cast('long') / 60, 2).alias('duration'),
'start',
'end'
) \
.show(truncate=False)
+------------+--------+-----------------------+-----------------------+
|degradation |duration|start |end |
+------------+--------+-----------------------+-----------------------+
|Degradation1|27.67 |2023-01-02 01:05:32.113|2023-01-02 01:33:12.187|
|Degradation2|16.18 |2023-01-05 15:10:12.854|2023-01-05 15:26:24.131|
+------------+--------+-----------------------+-----------------------+

Related

Most efficient way to append list in loop

I am new to python.
I have the following code that request data from an API:
histdata = ib.reqHistoricalTicks(contracts[i],start,"",1000,'TRADES', 1, True, [])'
print(histdata)
The data returned is the following price information without the contract symbol:
[HistoricalTickLast(time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.95, size=1, exchange='ISE', specialConditions='f'), HistoricalTickLast(time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.94, size=1, exchange='ISE', specialConditions='f')]
First thing I would like to know is whether this type of string is a list, a list of list, a dictionary, a dataframe or something else in python?
I would like to add a "column" with the contract symbol at the start of each price row.
The data should looks like this :
Symbol
time
tickAttribLast
price
size
exchange
specialConditions
XYZ
2021-03-03 14:30:00+00:00
TickAttribLast(pastLimit=False, unreported=True)
0.95
1
ISE
f
XYZ
2021-03-03 14:30:00+00:00
TickAttribLast(pastLimit=False, unreported=True)
0.94
1
ISE
f
Moreover, I would like to loop through multiple contracts, get the price information, add the contract symbol and merge the contract price with the previous contract price information.
Here is my failed attempt. Could you guide me on what would be the most efficient way to add the contract symbol to each rows in histdata and then append this information in a single list or dataframe?
Thanks in advance for your help!
i = 0
#The variable contracts is a list of contracts, here I loop the first 2 items
for t in contracts[0:1]:
print("processing contract: ", i)
#histdata get the price information of the contract (multiple price rows per contract as shown above)
histdata = ib.reqHistoricalTicks(contracts[i],start,"",1000,'TRADES', 1, True, [])
#failed attempt to add contracts[i].localSymbol at the start of each row
histdata.insert(0,contracts[i].localSymbol)
#failed attempt to append this table with the new contract information
histdata.append(histdata)
i = i + 1
Edit # 1 :
I will try and break down what I am trying to accomplish.
Here is the result of histdata :
[HistoricalTickLast(time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.95, size=1, exchange='ISE', specialConditions='f'), HistoricalTickLast(time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.94, size=1, exchange='ISE', specialConditions='f')]
What is the code needed to add the attribute "Symbol" and give this attribute the value "XYZ" to each HistoricalTickLast entries like this :
[HistoricalTickLast(Symbol='XYZ', time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.95, size=1, exchange='ISE', specialConditions='f'), HistoricalTickLast(Symbol='XYZ', time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.94, size=1, exchange='ISE', specialConditions='f')]
EDIT #2
I got a little confused with the map function, so I went out and transformed my LastHistoricalTicks instances to dataframe. Now, in addition to adding the attribute 'Symbol' to my first dataframe, I also merge another dataframe that contains the BID/ASK on the the key 'time'. I am sure this must be the least efficient way to do it.
Anyone wants to help me out have a more efficient code? :
histdf = pd.DataFrame()
print("CONTRACTS LENGTH :", len(contracts))
for t in contracts:
print("processing contract: ", i)
histdata = ib.reqHistoricalTicks(contracts[i],start,"",1000,'TRADES', 1,
True, [])
histbidask = ib.reqHistoricalTicks(contracts[i],start,"",1000,'BID_ASK', 1,
True, [])
tempdf = pd.DataFrame(histdata)
tempdf2 =pd.DataFrame(histbidask)
try :
tempdf3 = pd.merge(tempdf,tempdf2, how='inner', on='time')
tempdf3.insert(0,'localSymbol', contracts[i].localSymbol)
histdf = pd.concat([histdf,tempdf3])
except :
myerror["ErrorContracts"].append(format(contracts[i].localSymbol))
i = i + 1
Use type() to verify that your variable is a list (indicated by the [])
Each entry is instances of HistoricalTickLast. When you say you want to add a "column" that either means adding an attribute to the class, or more like that you want to process this as if it was plain old data (POD) for instance as a list of list or list of dict.
Are you sure histdata is a list?
If it is not a list but is an iterator, you could use list() to convert it to a list.
Also, to add an element at the begining of each interior list you could use map:
I think this code example could help you:
all_hisdata = []
for contract in contracts:
histdata = list(ib.reqHistoricalTicks(
contract,start,"",1000,'TRADES', 1, True, []))
new_histdata = list(
map(lambda e: [contract.localSymbol]+e, histdata)
)
all_hisdata.append(new_histdata)

How to convert pandas dataframe into multi level JSON with headers?

I have a pandas dataframe which I would like to convert to JSON format for my source system to utilize, which requires a very specific JSON format.
I cant seem to get to the exact format like shown in the expected output section, using simple dictionary loops.
Is there anyway I can convert csv/pd.Dataframe to nested JSON?
Any python package specifically built for this?
Input Dataframe:
#Create Input Dataframe
data = {
'col6':['A','A','A','B','B','B'],
'col7':[1, 1, 2, 1, 2, 2],
'col8':['A','A','A','B','B','B'],
'col10':['A','A','A','B','B','B'],
'col14':[1,1,1,1,1,2],
'col15':[1,2,1,1,1,1],
'col16':[9,10,26,9,12,4],
'col18':[1,1,2,1,2,3],
'col1':['xxxx','xxxx','xxxx','xxxx','xxxx','xxxx'],
'col2':[2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13],
'col3':['xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012'],
'col4':['yyyy','yyyy','yyyy','yyyy','yyyy','yyyy'],
'col5':[0,0,0,0,0,0],
'col9':['A','A','A','B','B','B'],
'col11':[0,0,0,0,0,0],
'col12':[0,0,0,0,0,0],
'col13':[0,0,0,0,0,0],
'col17':[51,63,47,59,53,56]
}
pd.DataFrame(data)
Expected Output:
{
"header1": {
"col1": "xxxx"
"col2": "20201107023012"
"col3": "xxxx20201107023012"
"col4": "yyyy",
"col5": "0"
},
"header2":
{
"header3":
[
{
col6: A,
col7: 1,
header4:
[
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
},
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
}
]
}
]
}
}
Maybe this will get you started. I'm not aware of a current python module that will do what you want but this is the basis of how I'd start it. Making assumptions based on what you've provided.
As each successive nest is based on some criteria, you'll need to loop through filtered dataframes. Depending on the size of your dataframes using groupby may be a better option than what I have here but the theory is the same. Also, you'll have to create you key value pairs correctly, this just creates the data support what you are builing.
# assume header 1 is constant so take first row and use .T to transpose to create dictionaries
header1 = dict(df.iloc[0].T[['col1','col2','col3','col4','col5']])
print('header1', header1)
# for header three, looks like you need the unique combinations so create dataframe
# and then iterate through to get all the header3 dictionaries
header3_dicts = []
dfh3 = df[['col6', 'col7']].drop_duplicates().reset_index(drop=True)
for i in range(dfh3.shape[0]):
header3_dicts.append(dict(dfh3.iloc[i].T[['col6','col7']]))
print('header3', header3_dicts)
# iterate over header3 to get header 4
for i in range(dfh3.shape[0]):
#print(dfh3.iat[i,0], dfh3.iat[i,1])
dfh4 = df.loc[(df['col6']==dfh3.iat[i,0]) & (df['col7']==dfh3.iat[i,1])]
header4_dicts = []
for j in range(dfh4.shape[0]):
header4_dicts.append(dict(df.iloc[j].T[['col8','col9','col10','col11','col12','col13']]))
print('header4', header4_dicts)
# next level repeat similar to above

Create values in column after groupby

I have a data frame that is obtained after grouping an initial data frame by the 'hour' and 'site' column. So the current data frame has details of 'value' grouped per 'hour' and 'site'. What I want is to fill the hour which has no 'value' with zero. 'Hour' range is from 0-23. how can I do this?
Left is input, right is expected output
You can try this:
import numpy as np
import pandas as pd
raw_df = pd.DataFrame(
{
"Hour": [1, 2, 4, 12, 0, 2, 7, 13],
"Site": ["x", "x", "x", "x", "y", "y", "y", "y"],
"Value": [1, 1, 1, 1, 1, 1, 1, 1],
}
)
full_hour = pd.DataFrame(
{
"Hour": np.concatenate(
[range(24) for site_name in raw_df["Site"].unique()]
),
"Site": np.concatenate(
[[site_name] * 24 for site_name in raw_df["Site"].unique()]
),
}
)
result = full_hour.merge(raw_df, on=["Hour", "Site"], how="left").fillna(0)
Then you can get what you want. But I suggest you copy your test data in your question instead an image. You know, we have no responsibility to create your data. You should think more about how can make others answer your question comfortably.
So if you want to change the value in hours column to zero, where the value is not in range of 0-23, here is what to do.I actually didn't get your question clearly so i assume this must be what you want.I have taken a dummy example as you have not provided you own data.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011',
'13/2/2011','14/2/2011'],
'Product':['Umbrella', 'Matress', 'Badminton', 'Shuttle','ewf'],
'Last_Price':[1200, 1500, 1600, 352,'ee'],
'Updated_Price':[12, 24, 0, 1,np.nan],
'Discount':[10, 10, 10, 10, 11]})
df['Updated_Price'] = df['Updated_Price'].fillna(0)
df.loc[df['Updated_Price']>23,'Updated_Price']=0
This replaces all nan values with 0 and and for values greater than 23, also replaces with 0

Python 3.x: Perform analysis on dictionary of dataframes in loops

I have a dataframe (df) whose column names are ["Home", "Season", "Date", "Consumption", "Temp"]. Now what I'm trying to do is perform calculations on these dataframe by "Home", "Season", "Temp" and "Consumption".
In[56]: df['Home'].unique().tolist()
Out[56]: [1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
In[57]: df['Season'].unique().tolist()
Out[57]: ['Spring', 'Summer', 'Autumn', 'Winter']
Here is what is done so far:
series = {}
for i in df['Home'].unique().tolist():
for j in df["Season"].unique().tolist():
series[i, j] = df[(df["Home"] == i) & (df["Consumption"] >= 0) & (df["Season"] == j)]
for key, value in series.items():
value["Corr"] = value["Temp"].corr(value["Consumption"])
Here is the dictionary of dataframes named "Series" as an output of loop.
What I expected from last loop is to give me a dictionary of dataframes with a new column i.e. "Corr" added that would have correlated values for "Temp" and "Consumption", but instead it gives a single dataframe for last home in the iteration i.e. 23.
To simply add sixth column named "Corr" in all dataframes in a dictionary that would be a correlation between "Temp" and "Consumption". Can you help me with the above? I'm somehow missing the use of keys in the last loop. Thanks in advance!
All of those loops are entirely unnecessary! Simply call:
df.groupby(['Home', 'Season'])['Consumption', 'Temp'].corr()
(thanks #jezrael for the correction)
One of the answer on How to find the correlation between a group of values in a pandas dataframe column
helped. Avoiding all unnecessary loops. Thanks #jezrael and #JoshFriedlander for suggesting groupby method. Upvote (y).
Posting solution here:
df = df[df["Consumption"] >= 0]
corrs = (df[["Home", "Season", "Temp"]]).groupby(
["Home", "Season"]).corrwith(
df["Consumption"]).rename(
columns = {"Temp" : "Corr"}).reset_index()
df = pd.merge(df, corrs, how = "left", on = ["Home", "Season"])

Slicing Mutliindex data with Pandas

I have imported a csv as a multi-indexed Dataframe. Here's a mockup of the data:
df = pd.read_csv("coursedata2.csv", index_col=[0,2])
print (df)
COURSE
ID Course List
12345 Interior Environments DESN10000
Rendering & Present Skills DESN20065
Lighting DESN20025
22345 Drawing Techniques DESN10016
Colour Theory DESN14049
Finishes & Sustainable Issues DESN12758
Lighting DESN20025
32345 Window Treatments&Soft Furnish DESN27370
42345 Introduction to CADD INFO16859
Principles of Drafting DESN10065
Drawing Techniques DESN10016
The Fundamentals of Design DESN15436
Colour Theory DESN14049
Interior Environments DESN10000
Drafting DESN10123
Textiles and Applications DESN10199
Finishes & Sustainable Issues DESN12758
[17 rows x 1 columns]
I can easily slice it by label using .xs -- eg:
selected = df.xs (12345, level='ID')
print selected
COURSE
Course List
Interior Environments DESN10000
Rendering & Present Skills DESN20065
Lighting DESN20025
[3 rows x 1 columns]
>
But what I want to do is step through the dataframe and perform an operation on each block of courses, by ID. The ID values in the real data are fairly random integers, sorted in ascending order.
df.index shows:
df.index
MultiIndex(levels=[[12345, 22345, 32345, 42345], [u'Colour Theory', u'Colour Theory ', u'Drafting', u'Drawing Techniques', u'Finishes & Sustainable Issues', u'Interior Environments', u'Introduction to CADD', u'Lighting', u'Principles of Drafting', u'Rendering & Present Skills', u'Textiles and Applications', u'The Fundamentals of Design', u'Window Treatments&Soft Furnish']],
labels=[[0, 0, 0, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3], [5, 9, 7, 3, 1, 4, 7, 12, 6, 8, 3, 11, 0, 5, 2, 10, 4]],
names=[u'ID', u'Course List'])
It seems to me that I should be able to use the first index labels to increment through the Dataframe. Ie. Get all the courses for label 0 then 1 then 2 then 3,... but it looks like .xs will not slice by label.
Am I missing something?
So there may be more efficient ways to do this, depending on what you're trying to do to the data. However, there are two approaches which immediately come to mind:
for id_label in df.index.levels[0]:
some_func(df.xs(id_label, level='ID'))
and
for id_label in df.index.levels[0]:
df.xs(id_label, level='ID').apply(some_func, axis=1)
depending on whether you want to operate on the group as a whole or on each row with in it.

Categories

Resources