Most efficient way to append list in loop - python

I am new to python.
I have the following code that request data from an API:
histdata = ib.reqHistoricalTicks(contracts[i],start,"",1000,'TRADES', 1, True, [])'
print(histdata)
The data returned is the following price information without the contract symbol:
[HistoricalTickLast(time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.95, size=1, exchange='ISE', specialConditions='f'), HistoricalTickLast(time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.94, size=1, exchange='ISE', specialConditions='f')]
First thing I would like to know is whether this type of string is a list, a list of list, a dictionary, a dataframe or something else in python?
I would like to add a "column" with the contract symbol at the start of each price row.
The data should looks like this :
Symbol
time
tickAttribLast
price
size
exchange
specialConditions
XYZ
2021-03-03 14:30:00+00:00
TickAttribLast(pastLimit=False, unreported=True)
0.95
1
ISE
f
XYZ
2021-03-03 14:30:00+00:00
TickAttribLast(pastLimit=False, unreported=True)
0.94
1
ISE
f
Moreover, I would like to loop through multiple contracts, get the price information, add the contract symbol and merge the contract price with the previous contract price information.
Here is my failed attempt. Could you guide me on what would be the most efficient way to add the contract symbol to each rows in histdata and then append this information in a single list or dataframe?
Thanks in advance for your help!
i = 0
#The variable contracts is a list of contracts, here I loop the first 2 items
for t in contracts[0:1]:
print("processing contract: ", i)
#histdata get the price information of the contract (multiple price rows per contract as shown above)
histdata = ib.reqHistoricalTicks(contracts[i],start,"",1000,'TRADES', 1, True, [])
#failed attempt to add contracts[i].localSymbol at the start of each row
histdata.insert(0,contracts[i].localSymbol)
#failed attempt to append this table with the new contract information
histdata.append(histdata)
i = i + 1
Edit # 1 :
I will try and break down what I am trying to accomplish.
Here is the result of histdata :
[HistoricalTickLast(time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.95, size=1, exchange='ISE', specialConditions='f'), HistoricalTickLast(time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.94, size=1, exchange='ISE', specialConditions='f')]
What is the code needed to add the attribute "Symbol" and give this attribute the value "XYZ" to each HistoricalTickLast entries like this :
[HistoricalTickLast(Symbol='XYZ', time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.95, size=1, exchange='ISE', specialConditions='f'), HistoricalTickLast(Symbol='XYZ', time=datetime.datetime(2021, 3, 3, 14, 30, tzinfo=datetime.timezone.utc), tickAttribLast=TickAttribLast(pastLimit=False, unreported=True), price=0.94, size=1, exchange='ISE', specialConditions='f')]
EDIT #2
I got a little confused with the map function, so I went out and transformed my LastHistoricalTicks instances to dataframe. Now, in addition to adding the attribute 'Symbol' to my first dataframe, I also merge another dataframe that contains the BID/ASK on the the key 'time'. I am sure this must be the least efficient way to do it.
Anyone wants to help me out have a more efficient code? :
histdf = pd.DataFrame()
print("CONTRACTS LENGTH :", len(contracts))
for t in contracts:
print("processing contract: ", i)
histdata = ib.reqHistoricalTicks(contracts[i],start,"",1000,'TRADES', 1,
True, [])
histbidask = ib.reqHistoricalTicks(contracts[i],start,"",1000,'BID_ASK', 1,
True, [])
tempdf = pd.DataFrame(histdata)
tempdf2 =pd.DataFrame(histbidask)
try :
tempdf3 = pd.merge(tempdf,tempdf2, how='inner', on='time')
tempdf3.insert(0,'localSymbol', contracts[i].localSymbol)
histdf = pd.concat([histdf,tempdf3])
except :
myerror["ErrorContracts"].append(format(contracts[i].localSymbol))
i = i + 1

Use type() to verify that your variable is a list (indicated by the [])
Each entry is instances of HistoricalTickLast. When you say you want to add a "column" that either means adding an attribute to the class, or more like that you want to process this as if it was plain old data (POD) for instance as a list of list or list of dict.

Are you sure histdata is a list?
If it is not a list but is an iterator, you could use list() to convert it to a list.
Also, to add an element at the begining of each interior list you could use map:
I think this code example could help you:
all_hisdata = []
for contract in contracts:
histdata = list(ib.reqHistoricalTicks(
contract,start,"",1000,'TRADES', 1, True, []))
new_histdata = list(
map(lambda e: [contract.localSymbol]+e, histdata)
)
all_hisdata.append(new_histdata)

Related

IndexingError: Too many indexers while using iloc

I have a dataframe from which I am trying to add attributes to my graph edges.
dataframe having mean_travel_time which is going to be the attribute for my edge
Plus, I have a data list which consists of source nodes and destination nodes as a tuple, like this.
[(1160, 2399),
(47, 1005)]
Now, while using set_edge_attribute to add attributes, I need my data into a dictionary:
{(1160, 2399):1434.67,
(47, 1005):2286.10,
}
I did something like this:
data_dict={}#Empty dictionary
for i in data:
data_dict[i] = df1['mean_travel_time'].iloc[i]#adding values
But, I am getting error saying too many indexers
Can anyone help me out with the error?
Please provide your data in a format easy to copy:
df = pd.DataFrame({
'index': [1, 9, 12, 18, 26],
'sourceid': [1160, 70, 1190, 620, 1791],
'dstid': [2399, 1005, 4, 103, 1944],
'month': [1] * 5,
'distance': [1434.67, 2286.10, 532.69, 593.20, 779.05]
})
If you are trying to iterate through a list of edges such as (1,2) you need to set an index for your DataFrame first:
df1.set_index(['sourceid', 'dstid'])
You could then access specific edges:
df.set_index(['sourceid', 'dstid']).loc[(1160, 2399)]
Or use a list of edges:
edges = zip(df['sourceid'], df['dstid'])
df.set_index(['sourceid', 'dstid']).loc[edges]
But you don't need to go any of this because, in fact, you can get your entire dict all in one go:
df.set_index(['sourceid', 'dstid'])['mean_travel_time'].to_dict()

PyMongo, fastest way to return and concatenate results?

I'm new to mongoDB and was looking for a very fast way to find data and concatenate that data into one array. Right now, I have:
search_query = {"index" : {"$lt" : datetime.now() : "$gt" : datetime.now() - timedelta(minutes=10)}
filter_query = {"MyData" : 1, "_id" : 0}
search_result = myCollection.find(search_query,filter_query).sort("index",-1).allow_disk_use(True)
The purpose of the above is to find the last 10 minutes of data and return only the MyData entries. A cursor is returned, and each cursor elements consists of a dictionary such as {"MyData" : [15, 0.5,16]}. I would like to concatenate this into one continuous data batch.
all_data = list()
for data in search_result:
all_data.extend(data ["MyData"])
However, this can be very time consuming to iterate over each element.
I have tried the following as well:
all_data = list(search_result)
Although it appears to be faster, it is a list of dictionaries [{"MyData" : [15,12,13]},{"MyData" : [5,6,1]}, ...]
I would still need to do a comprehension list or loop to properly format the data.
In mongodb, is there a fast way to do the following.
Access all cursor elements very fast
Concatenate the data in all the cursor elements
I was looking into aggregation but wasn't sure.
record1 = {"index" : datetime, "MyData" : [10, 11, 12]}
record2 = {"index" : datetime, "MyData" : [13, 14, 15]}
record3 = {"index" : datetime, "MyData" : [16, 17, 18]}

How to concatenate key values of JSON object stored in pandas dataframe cell into a string per row?

My question is:
how to concatenate key values of JSON object stored in pandas dataframe cell into a string per row? Sorry, I feel my problem is pretty straight-forward but I cannot find a good way to phrase it.
My context is:
Let's say I have a pandas dataframe, df, that contains a column named "participants". The cell values are JSON objects, like this for instance:
df['participants'][0] == df.participants[0] ==
[{'participantId': 1,
'championId': 7 },
{'participantId': 2,
'championId': 350 },
{'participantId': 3,
'championId': 266 },
{'participantId': 4,
'championId': 517 },
{'participantId': 5,
'championId': 110, },
...
...
{'participantId': 10,
'championId': 10 }]
df.participants[1] would include totally different information, with the same structure. If anybody's interested, this is part of what the League of Legends RiotWatcher python API spits out for per-game data.
My goal is to, for each participantId, concatenate that into a single string per row in our df, such that we have a new column 'x' that contains a string '7, 350, 266, 517, 110' for each row depending on whatever is in the participants column.
My working solutions are:
for i in range(0, 20): #range of however many rows we have in dataframe, assume 20
y = ''
for j in range(0, 10): #there are always ten participants
this_champion_id = str(df_d1['participants'][i][j].get('championId'))
y += ' '+this_champion_id
df_d1['x'] = y
(Sidenote: I am avoiding using lists, because I've read lists are not vectorized in pandas, which means they are slower. That's why I am using a string here.)
However, as my data is about 100k rows long, this feels like it's not the fastest solution, especially since I think nested for loops are slower right?
Would it be possible to do something like
df['x'] = [str(df_d1['participants'][key][value].get('championId') for key, value in df['participants']] ?
I am thinking a way of using a single for loop would be by leveraging the json library, like:
for i in range(0, 20):
x = str(pd.json_normalize(df_d1.participants[i])['championId'].values)
df['x'] = x
Has anybody ran into something similar? Did you find a painless solution to this problem? My solutions are taking some time to run.
Thank you!
In [16]: df['x'] = df['participants'].map(lambda x: ', '.join(str(i['participantId']) for i in x))
...: print(df['participants'][0])
...: print(df['x'][0])
...:
[{'participantId': 1, 'championId': 7}, {'participantId': 2, 'championId': 350}, {'participantId': 3, 'championId': 266}]
1, 2, 3

Python: Logging specific values to a single pandas dataframe from multiple lists with sublists

I am trying to save specific data from my weather station to a dataframe. The code I have retrieves hourly log data as lists with sublists, and simply putting pd.DataFrame does not work due to multiple logs and sublists.
I am trying to make a code that retrieves specific parameters, e.g. tempHigh for each hourly log entry and puts it in a dataframe.
I am able to isolate the 'tempHigh' for the first hour by:
df = wu.hourly()["observations"][0]
x = df["metric"]
x["tempHigh"]
I am afraid I have to deal with my nemesis, Mr. For Loop, to retrieve each hourly log data. I was hoping to get some help on how to attack this problem most efficiently.
The screenshots show the output data structure, which continues in this structure for all hours for the past 7 days. Below I have pasted the output data for the top two log entries.
{
"observations":[
{
"epoch":1607554798,
"humidityAvg":39,
"humidityHigh":44,
"humidityLow":37,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":4,
"dewptHigh":5,
"dewptLow":4,
"heatindexAvg":19,
"heatindexHigh":19,
"heatindexLow":18,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1017.03,
"pressureMin":1016.53,
"pressureTrend":0.0,
"tempAvg":19,
"tempHigh":19,
"tempLow":18,
"windchillAvg":19,
"windchillHigh":19,
"windchillLow":18,
"windgustAvg":8,
"windgustHigh":13,
"windgustLow":2,
"windspeedAvg":6,
"windspeedHigh":10,
"windspeedLow":2
},
"obsTimeLocal":"2020-12-10 00:59:58",
"obsTimeUtc":"2020-12-09T22:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":324
},
{
"epoch":1607558398,
"humidityAvg":48,
"humidityHigh":52,
"humidityLow":44,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":7,
"dewptHigh":8,
"dewptLow":5,
"heatindexAvg":18,
"heatindexHigh":19,
"heatindexLow":17,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1016.93,
"pressureMin":1016.42,
"pressureTrend":-0.31,
"tempAvg":18,
"tempHigh":19,
"tempLow":17,
"windchillAvg":18,
"windchillHigh":19,
"windchillLow":17,
"windgustAvg":10,
"windgustHigh":15,
"windgustLow":4,
"windspeedAvg":8,
"windspeedHigh":13,
"windspeedLow":1
},
"obsTimeLocal":"2020-12-10 01:59:58",
"obsTimeUtc":"2020-12-09T23:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":326
}
]
}
I might have a solution that suits your case. The way I've tackled this challenge is to flatten the entries of the single hourly logs, so not to have a nested dictionary. With 1-dimensional dictionaries (one for each hour), easily a dataframe can be created with all the measures as columns and the date and time as index. From there on you can select whatever columns you'd like ;)
How do we get there and what do I mean by 'flatten the entries'?
The hourly logs come as single dictionaries with single key, value pairs except 'metric' which is another dictionary. What I want is to get rid of the key 'metric' but not its values. Let's look at an example:
# nested dictionary
original = {'a':1, 'b':2, 'foo':{'c':3}}
# flatten original to
flattened = {'a':1, 'b':2, 'c':3} # got rid of key 'foo' but not its value
The below function achieves exactly that, a 1-dimensional or flat dictionary:
def flatten(dic):
#
update = False
for key, val in dic.items():
if isinstance(val, dict):
update = True
break
if update: dic.update(val); dic.pop(key); flatten(dic)
return dic
# With data from your weather station
hourly_log = {'epoch': 1607554798, 'humidityAvg': 39, 'humidityHigh': 44, 'humidityLow': 37, 'lat': 27.389829, 'lon': 33.67048, 'metric': {'dewptAvg': 4, 'dewptHigh': 5, 'dewptLow': 4, 'heatindexAvg': 19, 'heatindexHigh': 19, 'heatindexLow': 18, 'precipRate': 0.0, 'precipTotal': 0.0, 'pressureMax': 1017.03, 'pressureMin': 1016.53, 'pressureTrend': 0.0, 'tempAvg': 19, 'tempHigh': 19, 'tempLow': 18, 'windchillAvg': 19, 'windchillHigh': 19, 'windchillLow': 18, 'windgustAvg': 8, 'windgustHigh': 13, 'windgustLow': 2, 'windspeedAvg': 6, 'windspeedHigh': 10, 'windspeedLow': 2}, 'obsTimeLocal': '2020-12-10 00:59:58', 'obsTimeUtc': '2020-12-09T22:59:58Z', 'qcStatus': -1, 'solarRadiationHigh': 0.0, 'stationID': 'IHURGH2', 'tz': 'Africa/Cairo', 'uvHigh': 0.0, 'winddirAvg': 324}
# Flatten with function
flatten(hourly_log)
>>> {'epoch': 1607554798,
'humidityAvg': 39,
'humidityHigh': 44,
'humidityLow': 37,
'lat': 27.389829,
'lon': 33.67048,
'obsTimeLocal': '2020-12-10 00:59:58',
'obsTimeUtc': '2020-12-09T22:59:58Z',
'qcStatus': -1,
'solarRadiationHigh': 0.0,
'stationID': 'IHURGH2',
'tz': 'Africa/Cairo',
'uvHigh': 0.0,
'winddirAvg': 324,
'dewptAvg': 4,
'dewptHigh': 5,
'dewptLow': 4,
'heatindexAvg': 19,
'heatindexHigh': 19,
'heatindexLow': 18,
'precipRate': 0.0,
'precipTotal': 0.0,
'pressureMax': 1017.03,
'pressureMin': 1016.53,
'pressureTrend': 0.0,
...
Notice: 'metric' is gone but not its values!
Now, a DataFrame can be easily created for each hourly log which can be concatenated to a single DataFrame:
import pandas as pd
hourly_logs = wu.hourly()['observations']
# List of DataFrames for each hour
frames = [pd.DataFrame(flatten(dic), index=[0]).set_index('epoch') for dic in hourly_logs]
# Concatenated to a single one
df = pd.concat(frames)
# With adjusted index as Date and Time
dti = pd.DatetimeIndex(df.index * 10**9)
df.index = pd.MultiIndex.from_arrays([dti.date, dti.time])
# All measures
df.columns
>>> Index(['humidityAvg', 'humidityHigh', 'humidityLow', 'lat', 'lon',
'obsTimeLocal', 'obsTimeUtc', 'qcStatus', 'solarRadiationHigh',
'stationID', 'tz', 'uvHigh', 'winddirAvg', 'dewptAvg', 'dewptHigh',
'dewptLow', 'heatindexAvg', 'heatindexHigh', 'heatindexLow',
'precipRate', 'precipTotal', 'pressureMax', 'pressureMin',
'pressureTrend', 'tempAvg', 'tempHigh', 'tempLow', 'windchillAvg',
'windchillHigh', 'windchillLow', 'windgustAvg', 'windgustHigh',
'windgustLow', 'windspeedAvg', 'windspeedHigh', 'windspeedLow'],
dtype='object')
# Read out specific measures
df[['tempHigh','tempLow','tempAvg']]
>>>
Hopefully this is what you've been looking for!
Pandas accepts a list of dictionaries as input to create a dataframe:
import pandas as pd
input_dict = {"observations":[
{
"epoch":1607554798,
"humidityAvg":39,
"humidityHigh":44,
"humidityLow":37,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":4,
"dewptHigh":5,
"dewptLow":4,
"heatindexAvg":19,
"heatindexHigh":19,
"heatindexLow":18,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1017.03,
"pressureMin":1016.53,
"pressureTrend":0.0,
"tempAvg":19,
"tempHigh":19,
"tempLow":18,
"windchillAvg":19,
"windchillHigh":19,
"windchillLow":18,
"windgustAvg":8,
"windgustHigh":13,
"windgustLow":2,
"windspeedAvg":6,
"windspeedHigh":10,
"windspeedLow":2
},
"obsTimeLocal":"2020-12-10 00:59:58",
"obsTimeUtc":"2020-12-09T22:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":324
},
{
"epoch":1607558398,
"humidityAvg":48,
"humidityHigh":52,
"humidityLow":44,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":7,
"dewptHigh":8,
"dewptLow":5,
"heatindexAvg":18,
"heatindexHigh":19,
"heatindexLow":17,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1016.93,
"pressureMin":1016.42,
"pressureTrend":-0.31,
"tempAvg":18,
"tempHigh":19,
"tempLow":17,
"windchillAvg":18,
"windchillHigh":19,
"windchillLow":17,
"windgustAvg":10,
"windgustHigh":15,
"windgustLow":4,
"windspeedAvg":8,
"windspeedHigh":13,
"windspeedLow":1
},
"obsTimeLocal":"2020-12-10 01:59:58",
"obsTimeUtc":"2020-12-09T23:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":326
}
]
}
observations = input_dict["observations"]
df = pd.DataFrame(observations)
If you now want a list of single "metrics" you need to "flatten" your list of dictionaries column. This does use your "Nemesis" but in a Pythonic way:
temperature_high = [d.get("tempHigh") for d in df["metric"].to_list()]
If you want all the metrics in a dataframe, even simpler, just get the list of dictionaries from the specific column:
metrics = pd.DataFrame(df["metric"].to_list())
As you would probably like the timestamp as an index to denote your entries (your rows), you can pick your column epoch, or the more human obsTimeLocal:
metrics = pd.DataFrame(df["metric"].to_list(), index=df["obsTimeLocal"].to_list())
From here you can read specific metrics of your interest:
metrics[["tempHigh", "tempLow"]]

Python 3.x: Perform analysis on dictionary of dataframes in loops

I have a dataframe (df) whose column names are ["Home", "Season", "Date", "Consumption", "Temp"]. Now what I'm trying to do is perform calculations on these dataframe by "Home", "Season", "Temp" and "Consumption".
In[56]: df['Home'].unique().tolist()
Out[56]: [1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
In[57]: df['Season'].unique().tolist()
Out[57]: ['Spring', 'Summer', 'Autumn', 'Winter']
Here is what is done so far:
series = {}
for i in df['Home'].unique().tolist():
for j in df["Season"].unique().tolist():
series[i, j] = df[(df["Home"] == i) & (df["Consumption"] >= 0) & (df["Season"] == j)]
for key, value in series.items():
value["Corr"] = value["Temp"].corr(value["Consumption"])
Here is the dictionary of dataframes named "Series" as an output of loop.
What I expected from last loop is to give me a dictionary of dataframes with a new column i.e. "Corr" added that would have correlated values for "Temp" and "Consumption", but instead it gives a single dataframe for last home in the iteration i.e. 23.
To simply add sixth column named "Corr" in all dataframes in a dictionary that would be a correlation between "Temp" and "Consumption". Can you help me with the above? I'm somehow missing the use of keys in the last loop. Thanks in advance!
All of those loops are entirely unnecessary! Simply call:
df.groupby(['Home', 'Season'])['Consumption', 'Temp'].corr()
(thanks #jezrael for the correction)
One of the answer on How to find the correlation between a group of values in a pandas dataframe column
helped. Avoiding all unnecessary loops. Thanks #jezrael and #JoshFriedlander for suggesting groupby method. Upvote (y).
Posting solution here:
df = df[df["Consumption"] >= 0]
corrs = (df[["Home", "Season", "Temp"]]).groupby(
["Home", "Season"]).corrwith(
df["Consumption"]).rename(
columns = {"Temp" : "Corr"}).reset_index()
df = pd.merge(df, corrs, how = "left", on = ["Home", "Season"])

Categories

Resources