Related
I have a dataset the contains the status changes of one of our companies' systems.
I am only able to use PySpark to process this data.
each row in the dataset is a status change. There is a status column and an update timestamp.
Status
timestamp
red
2023-01-02T01:05:32.113Z
yellow
2023-01-02T01:15:47.329Z
red
2023-01-02T01:25:11.257Z
green
2023-01-02T01:33:12.187Z
red
2023-01-05T15:10:12.854Z
green
2023-01-05T15:26:24.131Z
For the sake of what I need to do, we are going to say a degradation is the first time it reports anything not green to the time it reports green again. What I am trying to do is to create a table of degradations with the duration of each one. ex:
degradation
duration
start
end
degradation 1
27.65
2023-01-02T01:05:32.113Z
2023-01-02T01:33:12.187Z
degradation 2
16.2
2023-01-05T15:10:12.854Z
2023-01-05T15:26:24.131Z
I can get PySpark to return durations between two timestamps without an issue, what I am struggling with is getting PySpark to use the timestamp from the first red to the following green and then log it as a row in a new df.
Any help is appreciated. Thank you.
I have one solution. Dont know if this is the easiest and fastest way to calculate what you want. For me the problem is that those data are valid only when they are not partitioned and in correct order which forces me to move all of them to one partition, at least at first stage
What i am doing here
I am using one big window with lag function and sum to calculate new partitions. In this case partition are created base on occurence of record with status = 'green'
Then i am using group by to find first/last event within each partition and calculate the diff
import pyspark.sql.functions as F
from pyspark.sql import Window
df = [
{"Status": "red", "timestamp": datetime.datetime(2023, 1, 2, 1, 5, 32, 113)},
{"Status": "yellow", "timestamp": datetime.datetime(2023, 1, 2, 1, 15, 47, 329)},
{"Status": "red", "timestamp": datetime.datetime(2023, 1, 2, 1, 25, 11, 257)},
{"Status": "green", "timestamp": datetime.datetime(2023, 1, 2, 1, 33, 12, 187)},
{"Status": "red", "timestamp": datetime.datetime(2023, 1, 5, 15, 10, 12, 854)},
{"Status": "green", "timestamp": datetime.datetime(2023, 1, 5, 15, 26, 24, 131)},
]
df = spark.createDataFrame(df)
windowSpec = Window.partitionBy().orderBy("timestamp")
df.withColumn(
"partition_number",
F.sum(
(F.coalesce(F.lag("Status").over(windowSpec), F.lit(0)) == F.lit("green")).cast(
"int"
)
).over(windowSpec),
).groupBy("partition_number").agg(
F.first("timestamp", ignorenulls=True).alias("start"),
F.last("timestamp", ignorenulls=True).alias("end"),
).withColumn(
"duration",
(F.round((F.col("end").cast("long") - F.col("start").cast("long")) / 60, 2)),
).withColumn(
"degradation", F.concat(F.lit("Degradation"), F.col("partition_number"))
).select(
"degradation", "duration", "start", "end"
).show(
truncate=False
)
output is
+------------+--------+--------------------------+--------------------------+
|degradation |duration|start |end |
+------------+--------+--------------------------+--------------------------+
|Degradation0|27.67 |2023-01-02 01:05:32.000113|2023-01-02 01:33:12.000187|
|Degradation1|16.2 |2023-01-05 15:10:12.000854|2023-01-05 15:26:24.000131|
+------------+--------+--------------------------+--------------------------+
If there is such a need you may change precision in duration or adjust this code to start counting degradation from 1 not 0 if this is a problem
You can mark the status green first, use lag to move a row after. Then you can seperate for each degradation by summing the value over window.
w = Window.partitionBy().orderBy("timestamp")
df.withColumn('status', f.lag((f.col('status') == f.lit('green')).cast('int'), 1, 0).over(w)) \
.withColumn('status', f.sum('status').over(w) + 1) \
.groupBy('status') \
.agg(
f.first('timestamp').alias('start'),
f.last('timestamp').alias('end')
) \
.select(
f.concat(f.lit('Degradation'), f.col('status')).alias('degradation'),
f.round((f.col('end') - f.col('start')).cast('long') / 60, 2).alias('duration'),
'start',
'end'
) \
.show(truncate=False)
+------------+--------+-----------------------+-----------------------+
|degradation |duration|start |end |
+------------+--------+-----------------------+-----------------------+
|Degradation1|27.67 |2023-01-02 01:05:32.113|2023-01-02 01:33:12.187|
|Degradation2|16.18 |2023-01-05 15:10:12.854|2023-01-05 15:26:24.131|
+------------+--------+-----------------------+-----------------------+
Ive got a little issue while coding a script that takes a CSV string and is supposed to select a column name and value based on the input. The CSV string contains Names of NBA players, their Universities etc. Now when the input is "name" && "Andre Brown", it should search for those values in the given CSV string. I have a rough code laid out - but I am unsure on how to implement the where method. Any ideas?
import csv
import pandas as pd
import io
class MySelectQuery:
def __init__(self, table, columns, where):
self.table = table
self.columns = columns
self.where = where
def __str__(self):
return f"SELECT {self.columns} FROM {self.table} WHERE {self.where}"
csvString = "name,year_start,year_end,position,height,weight,birth_date,college\nAlaa Abdelnaby,1991,1995,F-C,6-10,240,'June 24, 1968',Duke University\nZaid Abdul-Aziz,1969,1978,C-F,6-9,235,'April 7, 1946',Iowa State University\nKareem Abdul-Jabbar,1970,1989,C,7-2,225,'April 16, 1947','University of California, Los Angeles\nMahmoud Abdul-Rauf,1991,2001,G,6-1,162,'March 9, 1969',Louisiana State University\n"
df = pd.read_csv(io.StringIO(csvString), error_bad_lines=False)
where = "name = 'Alaa Abdelnaby' AND year_start = 1991"
df = df.query(where)
print(df)
The CSV string is being transformed into a pandas Dataframe, which should then find the values based on the input - however I get the error "name 'where' not defined". I believe everything until the df = etc. part is correct, now I need help implementing the where method. (Ive seen one other solution on SO but wasnt able to understand or figure that out)
# importing pandas
import pandas as pd
record = {
'Name': ['Ankit', 'Amit', 'Aishwarya', 'Priyanka', 'Priya', 'Shaurya' ],
'Age': [21, 19, 20, 18, 17, 21],
'Stream': ['Math', 'Commerce', 'Science', 'Math', 'Math', 'Science'],
'Percentage': [88, 92, 95, 70, 65, 78]}
# create a dataframe
dataframe = pd.DataFrame(record, columns = ['Name', 'Age', 'Stream', 'Percentage'])
print("Given Dataframe :\n", dataframe)
options = ['Math', 'Science']
# selecting rows based on condition
rslt_df = dataframe[(dataframe['Age'] == 21) &
dataframe['Stream'].isin(options)]
print('\nResult dataframe :\n', rslt_df)
Output:
Source: https://www.geeksforgeeks.org/selecting-rows-in-pandas-dataframe-based-on-conditions/
Sometimes Googling does the trick ;)
You need the double = there. So should be:
where = "name == 'Alaa Abdelnaby' AND year_start == 1991"
I am new to python.
I have the following code that request data from an API:
histdata = ib.reqHistoricalTicks(contracts[i],start,"",1000,'TRADES', 1, True, [])'
print(histdata)
The data returned is the following price information without the contract symbol:
[HistoricalTickLast(time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.95, size=1, exchange='ISE', specialConditions='f'), HistoricalTickLast(time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.94, size=1, exchange='ISE', specialConditions='f')]
First thing I would like to know is whether this type of string is a list, a list of list, a dictionary, a dataframe or something else in python?
I would like to add a "column" with the contract symbol at the start of each price row.
The data should looks like this :
Symbol
time
tickAttribLast
price
size
exchange
specialConditions
XYZ
2021-03-03 14:30:00+00:00
TickAttribLast(pastLimit=False, unreported=True)
0.95
1
ISE
f
XYZ
2021-03-03 14:30:00+00:00
TickAttribLast(pastLimit=False, unreported=True)
0.94
1
ISE
f
Moreover, I would like to loop through multiple contracts, get the price information, add the contract symbol and merge the contract price with the previous contract price information.
Here is my failed attempt. Could you guide me on what would be the most efficient way to add the contract symbol to each rows in histdata and then append this information in a single list or dataframe?
Thanks in advance for your help!
i = 0
#The variable contracts is a list of contracts, here I loop the first 2 items
for t in contracts[0:1]:
print("processing contract: ", i)
#histdata get the price information of the contract (multiple price rows per contract as shown above)
histdata = ib.reqHistoricalTicks(contracts[i],start,"",1000,'TRADES', 1, True, [])
#failed attempt to add contracts[i].localSymbol at the start of each row
histdata.insert(0,contracts[i].localSymbol)
#failed attempt to append this table with the new contract information
histdata.append(histdata)
i = i + 1
Edit # 1 :
I will try and break down what I am trying to accomplish.
Here is the result of histdata :
[HistoricalTickLast(time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.95, size=1, exchange='ISE', specialConditions='f'), HistoricalTickLast(time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.94, size=1, exchange='ISE', specialConditions='f')]
What is the code needed to add the attribute "Symbol" and give this attribute the value "XYZ" to each HistoricalTickLast entries like this :
[HistoricalTickLast(Symbol='XYZ', time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.95, size=1, exchange='ISE', specialConditions='f'), HistoricalTickLast(Symbol='XYZ', time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.94, size=1, exchange='ISE', specialConditions='f')]
EDIT #2
I got a little confused with the map function, so I went out and transformed my LastHistoricalTicks instances to dataframe. Now, in addition to adding the attribute 'Symbol' to my first dataframe, I also merge another dataframe that contains the BID/ASK on the the key 'time'. I am sure this must be the least efficient way to do it.
Anyone wants to help me out have a more efficient code? :
histdf = pd.DataFrame()
print("CONTRACTS LENGTH :", len(contracts))
for t in contracts:
print("processing contract: ", i)
histdata = ib.reqHistoricalTicks(contracts[i],start,"",1000,'TRADES', 1,
True, [])
histbidask = ib.reqHistoricalTicks(contracts[i],start,"",1000,'BID_ASK', 1,
True, [])
tempdf = pd.DataFrame(histdata)
tempdf2 =pd.DataFrame(histbidask)
try :
tempdf3 = pd.merge(tempdf,tempdf2, how='inner', on='time')
tempdf3.insert(0,'localSymbol', contracts[i].localSymbol)
histdf = pd.concat([histdf,tempdf3])
except :
myerror["ErrorContracts"].append(format(contracts[i].localSymbol))
i = i + 1
Use type() to verify that your variable is a list (indicated by the [])
Each entry is instances of HistoricalTickLast. When you say you want to add a "column" that either means adding an attribute to the class, or more like that you want to process this as if it was plain old data (POD) for instance as a list of list or list of dict.
Are you sure histdata is a list?
If it is not a list but is an iterator, you could use list() to convert it to a list.
Also, to add an element at the begining of each interior list you could use map:
I think this code example could help you:
all_hisdata = []
for contract in contracts:
histdata = list(ib.reqHistoricalTicks(
contract,start,"",1000,'TRADES', 1, True, []))
new_histdata = list(
map(lambda e: [contract.localSymbol]+e, histdata)
)
all_hisdata.append(new_histdata)
I am trying to save specific data from my weather station to a dataframe. The code I have retrieves hourly log data as lists with sublists, and simply putting pd.DataFrame does not work due to multiple logs and sublists.
I am trying to make a code that retrieves specific parameters, e.g. tempHigh for each hourly log entry and puts it in a dataframe.
I am able to isolate the 'tempHigh' for the first hour by:
df = wu.hourly()["observations"][0]
x = df["metric"]
x["tempHigh"]
I am afraid I have to deal with my nemesis, Mr. For Loop, to retrieve each hourly log data. I was hoping to get some help on how to attack this problem most efficiently.
The screenshots show the output data structure, which continues in this structure for all hours for the past 7 days. Below I have pasted the output data for the top two log entries.
{
"observations":[
{
"epoch":1607554798,
"humidityAvg":39,
"humidityHigh":44,
"humidityLow":37,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":4,
"dewptHigh":5,
"dewptLow":4,
"heatindexAvg":19,
"heatindexHigh":19,
"heatindexLow":18,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1017.03,
"pressureMin":1016.53,
"pressureTrend":0.0,
"tempAvg":19,
"tempHigh":19,
"tempLow":18,
"windchillAvg":19,
"windchillHigh":19,
"windchillLow":18,
"windgustAvg":8,
"windgustHigh":13,
"windgustLow":2,
"windspeedAvg":6,
"windspeedHigh":10,
"windspeedLow":2
},
"obsTimeLocal":"2020-12-10 00:59:58",
"obsTimeUtc":"2020-12-09T22:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":324
},
{
"epoch":1607558398,
"humidityAvg":48,
"humidityHigh":52,
"humidityLow":44,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":7,
"dewptHigh":8,
"dewptLow":5,
"heatindexAvg":18,
"heatindexHigh":19,
"heatindexLow":17,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1016.93,
"pressureMin":1016.42,
"pressureTrend":-0.31,
"tempAvg":18,
"tempHigh":19,
"tempLow":17,
"windchillAvg":18,
"windchillHigh":19,
"windchillLow":17,
"windgustAvg":10,
"windgustHigh":15,
"windgustLow":4,
"windspeedAvg":8,
"windspeedHigh":13,
"windspeedLow":1
},
"obsTimeLocal":"2020-12-10 01:59:58",
"obsTimeUtc":"2020-12-09T23:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":326
}
]
}
I might have a solution that suits your case. The way I've tackled this challenge is to flatten the entries of the single hourly logs, so not to have a nested dictionary. With 1-dimensional dictionaries (one for each hour), easily a dataframe can be created with all the measures as columns and the date and time as index. From there on you can select whatever columns you'd like ;)
How do we get there and what do I mean by 'flatten the entries'?
The hourly logs come as single dictionaries with single key, value pairs except 'metric' which is another dictionary. What I want is to get rid of the key 'metric' but not its values. Let's look at an example:
# nested dictionary
original = {'a':1, 'b':2, 'foo':{'c':3}}
# flatten original to
flattened = {'a':1, 'b':2, 'c':3} # got rid of key 'foo' but not its value
The below function achieves exactly that, a 1-dimensional or flat dictionary:
def flatten(dic):
#
update = False
for key, val in dic.items():
if isinstance(val, dict):
update = True
break
if update: dic.update(val); dic.pop(key); flatten(dic)
return dic
# With data from your weather station
hourly_log = {'epoch': 1607554798, 'humidityAvg': 39, 'humidityHigh': 44, 'humidityLow': 37, 'lat': 27.389829, 'lon': 33.67048, 'metric': {'dewptAvg': 4, 'dewptHigh': 5, 'dewptLow': 4, 'heatindexAvg': 19, 'heatindexHigh': 19, 'heatindexLow': 18, 'precipRate': 0.0, 'precipTotal': 0.0, 'pressureMax': 1017.03, 'pressureMin': 1016.53, 'pressureTrend': 0.0, 'tempAvg': 19, 'tempHigh': 19, 'tempLow': 18, 'windchillAvg': 19, 'windchillHigh': 19, 'windchillLow': 18, 'windgustAvg': 8, 'windgustHigh': 13, 'windgustLow': 2, 'windspeedAvg': 6, 'windspeedHigh': 10, 'windspeedLow': 2}, 'obsTimeLocal': '2020-12-10 00:59:58', 'obsTimeUtc': '2020-12-09T22:59:58Z', 'qcStatus': -1, 'solarRadiationHigh': 0.0, 'stationID': 'IHURGH2', 'tz': 'Africa/Cairo', 'uvHigh': 0.0, 'winddirAvg': 324}
# Flatten with function
flatten(hourly_log)
>>> {'epoch': 1607554798,
'humidityAvg': 39,
'humidityHigh': 44,
'humidityLow': 37,
'lat': 27.389829,
'lon': 33.67048,
'obsTimeLocal': '2020-12-10 00:59:58',
'obsTimeUtc': '2020-12-09T22:59:58Z',
'qcStatus': -1,
'solarRadiationHigh': 0.0,
'stationID': 'IHURGH2',
'tz': 'Africa/Cairo',
'uvHigh': 0.0,
'winddirAvg': 324,
'dewptAvg': 4,
'dewptHigh': 5,
'dewptLow': 4,
'heatindexAvg': 19,
'heatindexHigh': 19,
'heatindexLow': 18,
'precipRate': 0.0,
'precipTotal': 0.0,
'pressureMax': 1017.03,
'pressureMin': 1016.53,
'pressureTrend': 0.0,
...
Notice: 'metric' is gone but not its values!
Now, a DataFrame can be easily created for each hourly log which can be concatenated to a single DataFrame:
import pandas as pd
hourly_logs = wu.hourly()['observations']
# List of DataFrames for each hour
frames = [pd.DataFrame(flatten(dic), index=[0]).set_index('epoch') for dic in hourly_logs]
# Concatenated to a single one
df = pd.concat(frames)
# With adjusted index as Date and Time
dti = pd.DatetimeIndex(df.index * 10**9)
df.index = pd.MultiIndex.from_arrays([dti.date, dti.time])
# All measures
df.columns
>>> Index(['humidityAvg', 'humidityHigh', 'humidityLow', 'lat', 'lon',
'obsTimeLocal', 'obsTimeUtc', 'qcStatus', 'solarRadiationHigh',
'stationID', 'tz', 'uvHigh', 'winddirAvg', 'dewptAvg', 'dewptHigh',
'dewptLow', 'heatindexAvg', 'heatindexHigh', 'heatindexLow',
'precipRate', 'precipTotal', 'pressureMax', 'pressureMin',
'pressureTrend', 'tempAvg', 'tempHigh', 'tempLow', 'windchillAvg',
'windchillHigh', 'windchillLow', 'windgustAvg', 'windgustHigh',
'windgustLow', 'windspeedAvg', 'windspeedHigh', 'windspeedLow'],
dtype='object')
# Read out specific measures
df[['tempHigh','tempLow','tempAvg']]
>>>
Hopefully this is what you've been looking for!
Pandas accepts a list of dictionaries as input to create a dataframe:
import pandas as pd
input_dict = {"observations":[
{
"epoch":1607554798,
"humidityAvg":39,
"humidityHigh":44,
"humidityLow":37,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":4,
"dewptHigh":5,
"dewptLow":4,
"heatindexAvg":19,
"heatindexHigh":19,
"heatindexLow":18,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1017.03,
"pressureMin":1016.53,
"pressureTrend":0.0,
"tempAvg":19,
"tempHigh":19,
"tempLow":18,
"windchillAvg":19,
"windchillHigh":19,
"windchillLow":18,
"windgustAvg":8,
"windgustHigh":13,
"windgustLow":2,
"windspeedAvg":6,
"windspeedHigh":10,
"windspeedLow":2
},
"obsTimeLocal":"2020-12-10 00:59:58",
"obsTimeUtc":"2020-12-09T22:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":324
},
{
"epoch":1607558398,
"humidityAvg":48,
"humidityHigh":52,
"humidityLow":44,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":7,
"dewptHigh":8,
"dewptLow":5,
"heatindexAvg":18,
"heatindexHigh":19,
"heatindexLow":17,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1016.93,
"pressureMin":1016.42,
"pressureTrend":-0.31,
"tempAvg":18,
"tempHigh":19,
"tempLow":17,
"windchillAvg":18,
"windchillHigh":19,
"windchillLow":17,
"windgustAvg":10,
"windgustHigh":15,
"windgustLow":4,
"windspeedAvg":8,
"windspeedHigh":13,
"windspeedLow":1
},
"obsTimeLocal":"2020-12-10 01:59:58",
"obsTimeUtc":"2020-12-09T23:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":326
}
]
}
observations = input_dict["observations"]
df = pd.DataFrame(observations)
If you now want a list of single "metrics" you need to "flatten" your list of dictionaries column. This does use your "Nemesis" but in a Pythonic way:
temperature_high = [d.get("tempHigh") for d in df["metric"].to_list()]
If you want all the metrics in a dataframe, even simpler, just get the list of dictionaries from the specific column:
metrics = pd.DataFrame(df["metric"].to_list())
As you would probably like the timestamp as an index to denote your entries (your rows), you can pick your column epoch, or the more human obsTimeLocal:
metrics = pd.DataFrame(df["metric"].to_list(), index=df["obsTimeLocal"].to_list())
From here you can read specific metrics of your interest:
metrics[["tempHigh", "tempLow"]]
For the following array;
[[[11, 22, 33]]],[[[32, 12, 3]]], I wanted to extract the 1st row and it should output 11,22,33. However, using the following code, I got [[11, 22, 33]]. How can I remove the double bracket?
df = pd.DataFrame([
[[[11, 22, 33]]],
[[[32, 12, 3]]]
], index=[1, 2], columns=['ColA'])
df[df.index == 1].ColA.item()
Expected output should be in the form of 11,22,33; without the bracket
Use .astype(str) and str.replace with the regex or operator (|). Then we use iat to get the first value:
df['ColA'].astype(str).str.replace('\[|\]', '').iat[0]
Output
'11, 22, 33'
Notice: that the type of your value changed from list to string
Or using native python functions str and replace:
str(df['ColA'].iat[0]).replace('[', '').replace(']', '')