I have a dataframe from which I am trying to add attributes to my graph edges.
dataframe having mean_travel_time which is going to be the attribute for my edge
Plus, I have a data list which consists of source nodes and destination nodes as a tuple, like this.
[(1160, 2399),
(47, 1005)]
Now, while using set_edge_attribute to add attributes, I need my data into a dictionary:
{(1160, 2399):1434.67,
(47, 1005):2286.10,
}
I did something like this:
data_dict={}#Empty dictionary
for i in data:
data_dict[i] = df1['mean_travel_time'].iloc[i]#adding values
But, I am getting error saying too many indexers
Can anyone help me out with the error?
Please provide your data in a format easy to copy:
df = pd.DataFrame({
'index': [1, 9, 12, 18, 26],
'sourceid': [1160, 70, 1190, 620, 1791],
'dstid': [2399, 1005, 4, 103, 1944],
'month': [1] * 5,
'distance': [1434.67, 2286.10, 532.69, 593.20, 779.05]
})
If you are trying to iterate through a list of edges such as (1,2) you need to set an index for your DataFrame first:
df1.set_index(['sourceid', 'dstid'])
You could then access specific edges:
df.set_index(['sourceid', 'dstid']).loc[(1160, 2399)]
Or use a list of edges:
edges = zip(df['sourceid'], df['dstid'])
df.set_index(['sourceid', 'dstid']).loc[edges]
But you don't need to go any of this because, in fact, you can get your entire dict all in one go:
df.set_index(['sourceid', 'dstid'])['mean_travel_time'].to_dict()
My question is:
how to concatenate key values of JSON object stored in pandas dataframe cell into a string per row? Sorry, I feel my problem is pretty straight-forward but I cannot find a good way to phrase it.
My context is:
Let's say I have a pandas dataframe, df, that contains a column named "participants". The cell values are JSON objects, like this for instance:
df['participants'][0] == df.participants[0] ==
[{'participantId': 1,
'championId': 7 },
{'participantId': 2,
'championId': 350 },
{'participantId': 3,
'championId': 266 },
{'participantId': 4,
'championId': 517 },
{'participantId': 5,
'championId': 110, },
...
...
{'participantId': 10,
'championId': 10 }]
df.participants[1] would include totally different information, with the same structure. If anybody's interested, this is part of what the League of Legends RiotWatcher python API spits out for per-game data.
My goal is to, for each participantId, concatenate that into a single string per row in our df, such that we have a new column 'x' that contains a string '7, 350, 266, 517, 110' for each row depending on whatever is in the participants column.
My working solutions are:
for i in range(0, 20): #range of however many rows we have in dataframe, assume 20
y = ''
for j in range(0, 10): #there are always ten participants
this_champion_id = str(df_d1['participants'][i][j].get('championId'))
y += ' '+this_champion_id
df_d1['x'] = y
(Sidenote: I am avoiding using lists, because I've read lists are not vectorized in pandas, which means they are slower. That's why I am using a string here.)
However, as my data is about 100k rows long, this feels like it's not the fastest solution, especially since I think nested for loops are slower right?
Would it be possible to do something like
df['x'] = [str(df_d1['participants'][key][value].get('championId') for key, value in df['participants']] ?
I am thinking a way of using a single for loop would be by leveraging the json library, like:
for i in range(0, 20):
x = str(pd.json_normalize(df_d1.participants[i])['championId'].values)
df['x'] = x
Has anybody ran into something similar? Did you find a painless solution to this problem? My solutions are taking some time to run.
Thank you!
In [16]: df['x'] = df['participants'].map(lambda x: ', '.join(str(i['participantId']) for i in x))
...: print(df['participants'][0])
...: print(df['x'][0])
...:
[{'participantId': 1, 'championId': 7}, {'participantId': 2, 'championId': 350}, {'participantId': 3, 'championId': 266}]
1, 2, 3
I am trying to save specific data from my weather station to a dataframe. The code I have retrieves hourly log data as lists with sublists, and simply putting pd.DataFrame does not work due to multiple logs and sublists.
I am trying to make a code that retrieves specific parameters, e.g. tempHigh for each hourly log entry and puts it in a dataframe.
I am able to isolate the 'tempHigh' for the first hour by:
df = wu.hourly()["observations"][0]
x = df["metric"]
x["tempHigh"]
I am afraid I have to deal with my nemesis, Mr. For Loop, to retrieve each hourly log data. I was hoping to get some help on how to attack this problem most efficiently.
The screenshots show the output data structure, which continues in this structure for all hours for the past 7 days. Below I have pasted the output data for the top two log entries.
{
"observations":[
{
"epoch":1607554798,
"humidityAvg":39,
"humidityHigh":44,
"humidityLow":37,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":4,
"dewptHigh":5,
"dewptLow":4,
"heatindexAvg":19,
"heatindexHigh":19,
"heatindexLow":18,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1017.03,
"pressureMin":1016.53,
"pressureTrend":0.0,
"tempAvg":19,
"tempHigh":19,
"tempLow":18,
"windchillAvg":19,
"windchillHigh":19,
"windchillLow":18,
"windgustAvg":8,
"windgustHigh":13,
"windgustLow":2,
"windspeedAvg":6,
"windspeedHigh":10,
"windspeedLow":2
},
"obsTimeLocal":"2020-12-10 00:59:58",
"obsTimeUtc":"2020-12-09T22:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":324
},
{
"epoch":1607558398,
"humidityAvg":48,
"humidityHigh":52,
"humidityLow":44,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":7,
"dewptHigh":8,
"dewptLow":5,
"heatindexAvg":18,
"heatindexHigh":19,
"heatindexLow":17,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1016.93,
"pressureMin":1016.42,
"pressureTrend":-0.31,
"tempAvg":18,
"tempHigh":19,
"tempLow":17,
"windchillAvg":18,
"windchillHigh":19,
"windchillLow":17,
"windgustAvg":10,
"windgustHigh":15,
"windgustLow":4,
"windspeedAvg":8,
"windspeedHigh":13,
"windspeedLow":1
},
"obsTimeLocal":"2020-12-10 01:59:58",
"obsTimeUtc":"2020-12-09T23:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":326
}
]
}
I might have a solution that suits your case. The way I've tackled this challenge is to flatten the entries of the single hourly logs, so not to have a nested dictionary. With 1-dimensional dictionaries (one for each hour), easily a dataframe can be created with all the measures as columns and the date and time as index. From there on you can select whatever columns you'd like ;)
How do we get there and what do I mean by 'flatten the entries'?
The hourly logs come as single dictionaries with single key, value pairs except 'metric' which is another dictionary. What I want is to get rid of the key 'metric' but not its values. Let's look at an example:
# nested dictionary
original = {'a':1, 'b':2, 'foo':{'c':3}}
# flatten original to
flattened = {'a':1, 'b':2, 'c':3} # got rid of key 'foo' but not its value
The below function achieves exactly that, a 1-dimensional or flat dictionary:
def flatten(dic):
#
update = False
for key, val in dic.items():
if isinstance(val, dict):
update = True
break
if update: dic.update(val); dic.pop(key); flatten(dic)
return dic
# With data from your weather station
hourly_log = {'epoch': 1607554798, 'humidityAvg': 39, 'humidityHigh': 44, 'humidityLow': 37, 'lat': 27.389829, 'lon': 33.67048, 'metric': {'dewptAvg': 4, 'dewptHigh': 5, 'dewptLow': 4, 'heatindexAvg': 19, 'heatindexHigh': 19, 'heatindexLow': 18, 'precipRate': 0.0, 'precipTotal': 0.0, 'pressureMax': 1017.03, 'pressureMin': 1016.53, 'pressureTrend': 0.0, 'tempAvg': 19, 'tempHigh': 19, 'tempLow': 18, 'windchillAvg': 19, 'windchillHigh': 19, 'windchillLow': 18, 'windgustAvg': 8, 'windgustHigh': 13, 'windgustLow': 2, 'windspeedAvg': 6, 'windspeedHigh': 10, 'windspeedLow': 2}, 'obsTimeLocal': '2020-12-10 00:59:58', 'obsTimeUtc': '2020-12-09T22:59:58Z', 'qcStatus': -1, 'solarRadiationHigh': 0.0, 'stationID': 'IHURGH2', 'tz': 'Africa/Cairo', 'uvHigh': 0.0, 'winddirAvg': 324}
# Flatten with function
flatten(hourly_log)
>>> {'epoch': 1607554798,
'humidityAvg': 39,
'humidityHigh': 44,
'humidityLow': 37,
'lat': 27.389829,
'lon': 33.67048,
'obsTimeLocal': '2020-12-10 00:59:58',
'obsTimeUtc': '2020-12-09T22:59:58Z',
'qcStatus': -1,
'solarRadiationHigh': 0.0,
'stationID': 'IHURGH2',
'tz': 'Africa/Cairo',
'uvHigh': 0.0,
'winddirAvg': 324,
'dewptAvg': 4,
'dewptHigh': 5,
'dewptLow': 4,
'heatindexAvg': 19,
'heatindexHigh': 19,
'heatindexLow': 18,
'precipRate': 0.0,
'precipTotal': 0.0,
'pressureMax': 1017.03,
'pressureMin': 1016.53,
'pressureTrend': 0.0,
...
Notice: 'metric' is gone but not its values!
Now, a DataFrame can be easily created for each hourly log which can be concatenated to a single DataFrame:
import pandas as pd
hourly_logs = wu.hourly()['observations']
# List of DataFrames for each hour
frames = [pd.DataFrame(flatten(dic), index=[0]).set_index('epoch') for dic in hourly_logs]
# Concatenated to a single one
df = pd.concat(frames)
# With adjusted index as Date and Time
dti = pd.DatetimeIndex(df.index * 10**9)
df.index = pd.MultiIndex.from_arrays([dti.date, dti.time])
# All measures
df.columns
>>> Index(['humidityAvg', 'humidityHigh', 'humidityLow', 'lat', 'lon',
'obsTimeLocal', 'obsTimeUtc', 'qcStatus', 'solarRadiationHigh',
'stationID', 'tz', 'uvHigh', 'winddirAvg', 'dewptAvg', 'dewptHigh',
'dewptLow', 'heatindexAvg', 'heatindexHigh', 'heatindexLow',
'precipRate', 'precipTotal', 'pressureMax', 'pressureMin',
'pressureTrend', 'tempAvg', 'tempHigh', 'tempLow', 'windchillAvg',
'windchillHigh', 'windchillLow', 'windgustAvg', 'windgustHigh',
'windgustLow', 'windspeedAvg', 'windspeedHigh', 'windspeedLow'],
dtype='object')
# Read out specific measures
df[['tempHigh','tempLow','tempAvg']]
>>>
Hopefully this is what you've been looking for!
Pandas accepts a list of dictionaries as input to create a dataframe:
import pandas as pd
input_dict = {"observations":[
{
"epoch":1607554798,
"humidityAvg":39,
"humidityHigh":44,
"humidityLow":37,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":4,
"dewptHigh":5,
"dewptLow":4,
"heatindexAvg":19,
"heatindexHigh":19,
"heatindexLow":18,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1017.03,
"pressureMin":1016.53,
"pressureTrend":0.0,
"tempAvg":19,
"tempHigh":19,
"tempLow":18,
"windchillAvg":19,
"windchillHigh":19,
"windchillLow":18,
"windgustAvg":8,
"windgustHigh":13,
"windgustLow":2,
"windspeedAvg":6,
"windspeedHigh":10,
"windspeedLow":2
},
"obsTimeLocal":"2020-12-10 00:59:58",
"obsTimeUtc":"2020-12-09T22:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":324
},
{
"epoch":1607558398,
"humidityAvg":48,
"humidityHigh":52,
"humidityLow":44,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":7,
"dewptHigh":8,
"dewptLow":5,
"heatindexAvg":18,
"heatindexHigh":19,
"heatindexLow":17,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1016.93,
"pressureMin":1016.42,
"pressureTrend":-0.31,
"tempAvg":18,
"tempHigh":19,
"tempLow":17,
"windchillAvg":18,
"windchillHigh":19,
"windchillLow":17,
"windgustAvg":10,
"windgustHigh":15,
"windgustLow":4,
"windspeedAvg":8,
"windspeedHigh":13,
"windspeedLow":1
},
"obsTimeLocal":"2020-12-10 01:59:58",
"obsTimeUtc":"2020-12-09T23:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":326
}
]
}
observations = input_dict["observations"]
df = pd.DataFrame(observations)
If you now want a list of single "metrics" you need to "flatten" your list of dictionaries column. This does use your "Nemesis" but in a Pythonic way:
temperature_high = [d.get("tempHigh") for d in df["metric"].to_list()]
If you want all the metrics in a dataframe, even simpler, just get the list of dictionaries from the specific column:
metrics = pd.DataFrame(df["metric"].to_list())
As you would probably like the timestamp as an index to denote your entries (your rows), you can pick your column epoch, or the more human obsTimeLocal:
metrics = pd.DataFrame(df["metric"].to_list(), index=df["obsTimeLocal"].to_list())
From here you can read specific metrics of your interest:
metrics[["tempHigh", "tempLow"]]
I have a dataframe (df) whose column names are ["Home", "Season", "Date", "Consumption", "Temp"]. Now what I'm trying to do is perform calculations on these dataframe by "Home", "Season", "Temp" and "Consumption".
In[56]: df['Home'].unique().tolist()
Out[56]: [1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
In[57]: df['Season'].unique().tolist()
Out[57]: ['Spring', 'Summer', 'Autumn', 'Winter']
Here is what is done so far:
series = {}
for i in df['Home'].unique().tolist():
for j in df["Season"].unique().tolist():
series[i, j] = df[(df["Home"] == i) & (df["Consumption"] >= 0) & (df["Season"] == j)]
for key, value in series.items():
value["Corr"] = value["Temp"].corr(value["Consumption"])
Here is the dictionary of dataframes named "Series" as an output of loop.
What I expected from last loop is to give me a dictionary of dataframes with a new column i.e. "Corr" added that would have correlated values for "Temp" and "Consumption", but instead it gives a single dataframe for last home in the iteration i.e. 23.
To simply add sixth column named "Corr" in all dataframes in a dictionary that would be a correlation between "Temp" and "Consumption". Can you help me with the above? I'm somehow missing the use of keys in the last loop. Thanks in advance!
All of those loops are entirely unnecessary! Simply call:
df.groupby(['Home', 'Season'])['Consumption', 'Temp'].corr()
(thanks #jezrael for the correction)
One of the answer on How to find the correlation between a group of values in a pandas dataframe column
helped. Avoiding all unnecessary loops. Thanks #jezrael and #JoshFriedlander for suggesting groupby method. Upvote (y).
Posting solution here:
df = df[df["Consumption"] >= 0]
corrs = (df[["Home", "Season", "Temp"]]).groupby(
["Home", "Season"]).corrwith(
df["Consumption"]).rename(
columns = {"Temp" : "Corr"}).reset_index()
df = pd.merge(df, corrs, how = "left", on = ["Home", "Season"])