Related
Ive got a little issue while coding a script that takes a CSV string and is supposed to select a column name and value based on the input. The CSV string contains Names of NBA players, their Universities etc. Now when the input is "name" && "Andre Brown", it should search for those values in the given CSV string. I have a rough code laid out - but I am unsure on how to implement the where method. Any ideas?
import csv
import pandas as pd
import io
class MySelectQuery:
def __init__(self, table, columns, where):
self.table = table
self.columns = columns
self.where = where
def __str__(self):
return f"SELECT {self.columns} FROM {self.table} WHERE {self.where}"
csvString = "name,year_start,year_end,position,height,weight,birth_date,college\nAlaa Abdelnaby,1991,1995,F-C,6-10,240,'June 24, 1968',Duke University\nZaid Abdul-Aziz,1969,1978,C-F,6-9,235,'April 7, 1946',Iowa State University\nKareem Abdul-Jabbar,1970,1989,C,7-2,225,'April 16, 1947','University of California, Los Angeles\nMahmoud Abdul-Rauf,1991,2001,G,6-1,162,'March 9, 1969',Louisiana State University\n"
df = pd.read_csv(io.StringIO(csvString), error_bad_lines=False)
where = "name = 'Alaa Abdelnaby' AND year_start = 1991"
df = df.query(where)
print(df)
The CSV string is being transformed into a pandas Dataframe, which should then find the values based on the input - however I get the error "name 'where' not defined". I believe everything until the df = etc. part is correct, now I need help implementing the where method. (Ive seen one other solution on SO but wasnt able to understand or figure that out)
# importing pandas
import pandas as pd
record = {
'Name': ['Ankit', 'Amit', 'Aishwarya', 'Priyanka', 'Priya', 'Shaurya' ],
'Age': [21, 19, 20, 18, 17, 21],
'Stream': ['Math', 'Commerce', 'Science', 'Math', 'Math', 'Science'],
'Percentage': [88, 92, 95, 70, 65, 78]}
# create a dataframe
dataframe = pd.DataFrame(record, columns = ['Name', 'Age', 'Stream', 'Percentage'])
print("Given Dataframe :\n", dataframe)
options = ['Math', 'Science']
# selecting rows based on condition
rslt_df = dataframe[(dataframe['Age'] == 21) &
dataframe['Stream'].isin(options)]
print('\nResult dataframe :\n', rslt_df)
Output:
Source: https://www.geeksforgeeks.org/selecting-rows-in-pandas-dataframe-based-on-conditions/
Sometimes Googling does the trick ;)
You need the double = there. So should be:
where = "name == 'Alaa Abdelnaby' AND year_start == 1991"
I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?
It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen
Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code
This is probably an easy one for the python pros. So, please forgive my naivety.
Here is my data:
0 xyz#tim.com 13239169023 jane bo
1 tim#tim.com 13239169023 lane ko
2 jim#jim.com 13239169023 john do
Here is what I get as output:
[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]
My Code:
df = pd.read_csv('profiles.csv')
print(df)
data = df.to_json(orient="records")
print(data)
Output I want:
{"profiles":[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]}
Adding below does NOT work.
output = {"profiles": data}
It adds single quotes on the data and profiles in NOT in double quotes (basically NOT a valid JSON), Like so:
{'profiles': '[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]'}
You can use df.to_dict to output to a dictionary instead of a json-formatted string:
import pandas as pd
df = pd.read_csv('data.csv')
data = df.to_dict(orient='records')
output = {'profiles': data}
print(output)
Returns:
{'profiles': [{'0': 1, 'xyz#tim.com': 'tim#tim.com', '13239169023': 13239169023, 'jane': 'lane', 'bo': 'ko'}, {'0': 2, 'xyz#tim.com': 'jim#jim.com', '13239169023': 13239169023, 'jane': 'john', 'bo': 'do'}]}
I think I found a solution.
Changes:
data = df.to_dict(orient="records")
output = {}
output["profiles"] = data
I am trying to save specific data from my weather station to a dataframe. The code I have retrieves hourly log data as lists with sublists, and simply putting pd.DataFrame does not work due to multiple logs and sublists.
I am trying to make a code that retrieves specific parameters, e.g. tempHigh for each hourly log entry and puts it in a dataframe.
I am able to isolate the 'tempHigh' for the first hour by:
df = wu.hourly()["observations"][0]
x = df["metric"]
x["tempHigh"]
I am afraid I have to deal with my nemesis, Mr. For Loop, to retrieve each hourly log data. I was hoping to get some help on how to attack this problem most efficiently.
The screenshots show the output data structure, which continues in this structure for all hours for the past 7 days. Below I have pasted the output data for the top two log entries.
{
"observations":[
{
"epoch":1607554798,
"humidityAvg":39,
"humidityHigh":44,
"humidityLow":37,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":4,
"dewptHigh":5,
"dewptLow":4,
"heatindexAvg":19,
"heatindexHigh":19,
"heatindexLow":18,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1017.03,
"pressureMin":1016.53,
"pressureTrend":0.0,
"tempAvg":19,
"tempHigh":19,
"tempLow":18,
"windchillAvg":19,
"windchillHigh":19,
"windchillLow":18,
"windgustAvg":8,
"windgustHigh":13,
"windgustLow":2,
"windspeedAvg":6,
"windspeedHigh":10,
"windspeedLow":2
},
"obsTimeLocal":"2020-12-10 00:59:58",
"obsTimeUtc":"2020-12-09T22:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":324
},
{
"epoch":1607558398,
"humidityAvg":48,
"humidityHigh":52,
"humidityLow":44,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":7,
"dewptHigh":8,
"dewptLow":5,
"heatindexAvg":18,
"heatindexHigh":19,
"heatindexLow":17,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1016.93,
"pressureMin":1016.42,
"pressureTrend":-0.31,
"tempAvg":18,
"tempHigh":19,
"tempLow":17,
"windchillAvg":18,
"windchillHigh":19,
"windchillLow":17,
"windgustAvg":10,
"windgustHigh":15,
"windgustLow":4,
"windspeedAvg":8,
"windspeedHigh":13,
"windspeedLow":1
},
"obsTimeLocal":"2020-12-10 01:59:58",
"obsTimeUtc":"2020-12-09T23:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":326
}
]
}
I might have a solution that suits your case. The way I've tackled this challenge is to flatten the entries of the single hourly logs, so not to have a nested dictionary. With 1-dimensional dictionaries (one for each hour), easily a dataframe can be created with all the measures as columns and the date and time as index. From there on you can select whatever columns you'd like ;)
How do we get there and what do I mean by 'flatten the entries'?
The hourly logs come as single dictionaries with single key, value pairs except 'metric' which is another dictionary. What I want is to get rid of the key 'metric' but not its values. Let's look at an example:
# nested dictionary
original = {'a':1, 'b':2, 'foo':{'c':3}}
# flatten original to
flattened = {'a':1, 'b':2, 'c':3} # got rid of key 'foo' but not its value
The below function achieves exactly that, a 1-dimensional or flat dictionary:
def flatten(dic):
#
update = False
for key, val in dic.items():
if isinstance(val, dict):
update = True
break
if update: dic.update(val); dic.pop(key); flatten(dic)
return dic
# With data from your weather station
hourly_log = {'epoch': 1607554798, 'humidityAvg': 39, 'humidityHigh': 44, 'humidityLow': 37, 'lat': 27.389829, 'lon': 33.67048, 'metric': {'dewptAvg': 4, 'dewptHigh': 5, 'dewptLow': 4, 'heatindexAvg': 19, 'heatindexHigh': 19, 'heatindexLow': 18, 'precipRate': 0.0, 'precipTotal': 0.0, 'pressureMax': 1017.03, 'pressureMin': 1016.53, 'pressureTrend': 0.0, 'tempAvg': 19, 'tempHigh': 19, 'tempLow': 18, 'windchillAvg': 19, 'windchillHigh': 19, 'windchillLow': 18, 'windgustAvg': 8, 'windgustHigh': 13, 'windgustLow': 2, 'windspeedAvg': 6, 'windspeedHigh': 10, 'windspeedLow': 2}, 'obsTimeLocal': '2020-12-10 00:59:58', 'obsTimeUtc': '2020-12-09T22:59:58Z', 'qcStatus': -1, 'solarRadiationHigh': 0.0, 'stationID': 'IHURGH2', 'tz': 'Africa/Cairo', 'uvHigh': 0.0, 'winddirAvg': 324}
# Flatten with function
flatten(hourly_log)
>>> {'epoch': 1607554798,
'humidityAvg': 39,
'humidityHigh': 44,
'humidityLow': 37,
'lat': 27.389829,
'lon': 33.67048,
'obsTimeLocal': '2020-12-10 00:59:58',
'obsTimeUtc': '2020-12-09T22:59:58Z',
'qcStatus': -1,
'solarRadiationHigh': 0.0,
'stationID': 'IHURGH2',
'tz': 'Africa/Cairo',
'uvHigh': 0.0,
'winddirAvg': 324,
'dewptAvg': 4,
'dewptHigh': 5,
'dewptLow': 4,
'heatindexAvg': 19,
'heatindexHigh': 19,
'heatindexLow': 18,
'precipRate': 0.0,
'precipTotal': 0.0,
'pressureMax': 1017.03,
'pressureMin': 1016.53,
'pressureTrend': 0.0,
...
Notice: 'metric' is gone but not its values!
Now, a DataFrame can be easily created for each hourly log which can be concatenated to a single DataFrame:
import pandas as pd
hourly_logs = wu.hourly()['observations']
# List of DataFrames for each hour
frames = [pd.DataFrame(flatten(dic), index=[0]).set_index('epoch') for dic in hourly_logs]
# Concatenated to a single one
df = pd.concat(frames)
# With adjusted index as Date and Time
dti = pd.DatetimeIndex(df.index * 10**9)
df.index = pd.MultiIndex.from_arrays([dti.date, dti.time])
# All measures
df.columns
>>> Index(['humidityAvg', 'humidityHigh', 'humidityLow', 'lat', 'lon',
'obsTimeLocal', 'obsTimeUtc', 'qcStatus', 'solarRadiationHigh',
'stationID', 'tz', 'uvHigh', 'winddirAvg', 'dewptAvg', 'dewptHigh',
'dewptLow', 'heatindexAvg', 'heatindexHigh', 'heatindexLow',
'precipRate', 'precipTotal', 'pressureMax', 'pressureMin',
'pressureTrend', 'tempAvg', 'tempHigh', 'tempLow', 'windchillAvg',
'windchillHigh', 'windchillLow', 'windgustAvg', 'windgustHigh',
'windgustLow', 'windspeedAvg', 'windspeedHigh', 'windspeedLow'],
dtype='object')
# Read out specific measures
df[['tempHigh','tempLow','tempAvg']]
>>>
Hopefully this is what you've been looking for!
Pandas accepts a list of dictionaries as input to create a dataframe:
import pandas as pd
input_dict = {"observations":[
{
"epoch":1607554798,
"humidityAvg":39,
"humidityHigh":44,
"humidityLow":37,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":4,
"dewptHigh":5,
"dewptLow":4,
"heatindexAvg":19,
"heatindexHigh":19,
"heatindexLow":18,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1017.03,
"pressureMin":1016.53,
"pressureTrend":0.0,
"tempAvg":19,
"tempHigh":19,
"tempLow":18,
"windchillAvg":19,
"windchillHigh":19,
"windchillLow":18,
"windgustAvg":8,
"windgustHigh":13,
"windgustLow":2,
"windspeedAvg":6,
"windspeedHigh":10,
"windspeedLow":2
},
"obsTimeLocal":"2020-12-10 00:59:58",
"obsTimeUtc":"2020-12-09T22:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":324
},
{
"epoch":1607558398,
"humidityAvg":48,
"humidityHigh":52,
"humidityLow":44,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":7,
"dewptHigh":8,
"dewptLow":5,
"heatindexAvg":18,
"heatindexHigh":19,
"heatindexLow":17,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1016.93,
"pressureMin":1016.42,
"pressureTrend":-0.31,
"tempAvg":18,
"tempHigh":19,
"tempLow":17,
"windchillAvg":18,
"windchillHigh":19,
"windchillLow":17,
"windgustAvg":10,
"windgustHigh":15,
"windgustLow":4,
"windspeedAvg":8,
"windspeedHigh":13,
"windspeedLow":1
},
"obsTimeLocal":"2020-12-10 01:59:58",
"obsTimeUtc":"2020-12-09T23:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":326
}
]
}
observations = input_dict["observations"]
df = pd.DataFrame(observations)
If you now want a list of single "metrics" you need to "flatten" your list of dictionaries column. This does use your "Nemesis" but in a Pythonic way:
temperature_high = [d.get("tempHigh") for d in df["metric"].to_list()]
If you want all the metrics in a dataframe, even simpler, just get the list of dictionaries from the specific column:
metrics = pd.DataFrame(df["metric"].to_list())
As you would probably like the timestamp as an index to denote your entries (your rows), you can pick your column epoch, or the more human obsTimeLocal:
metrics = pd.DataFrame(df["metric"].to_list(), index=df["obsTimeLocal"].to_list())
From here you can read specific metrics of your interest:
metrics[["tempHigh", "tempLow"]]
I have a dataframe df
df
Object Action Cost1 Cost2
0 123 renovate 10000 2000
1 456 do something 0 10
2 789 review 1000 50
and a dictionary (called dictionary)
dictionary
{'Object_new': ['Object'],
'Action_new': ['Action'],
'Total_Cost': ['Cost1', 'Cost2']}
Further, I have a (at the beginning empty) dataframe df_new that should contain almost the identicall information as df, except that the column names need to be different (naming according to the dictionary) and that some columns from df should be consolidated (e.g. a sum-operation) based on the dictionary.
The result should look like this:
df_new
Object_new Action_new Total_Cost
0 123 renovate 12000
1 456 do something 10
2 789 review 1050
How can I achieve this result using only the dictionary? I tried to use the .map() function but could not figure out how to perform the sum-operation with it.
The code to reproduce both dataframes and the dictionary are attached:
# import libraries
import pandas as pd
### create df
data_df = {'Object': [123, 456, 789],
'Action': ['renovate', 'do something', 'review'],
'Cost1': [10000, 0, 1000],
'Cost2': [2000, 10, 50],
}
df = pd.DataFrame(data_df)
### create dictionary
dictionary = {'Object_new':['Object'],
'Action_new':['Action'],
'Total_Cost' : ['Cost1', 'Cost2']}
### create df_new
# data_df_new = pd.DataFrame(columns=['Object_new', 'Action_new', 'Total_Cost' ])
data_df_new = {'Object_new': [123, 456, 789],
'Action_new': ['renovate', 'do something', 'review'],
'Total_Cost': [12000, 10, 1050],
}
df_new = pd.DataFrame(data_df_new)
A play with groupby:
inv_dict = {x:k for k,v in dictionary.items() for x in v}
df_new = df.groupby(df.columns.map(inv_dict),
axis=1).sum()
Output:
Action_new Object_new Total_Cost
0 renovate 123 12000
1 do something 456 10
2 review 789 1050
Given the complexity of your algorithm, I would suggest performing a Series addition operation to solve this problem.
Why? In Pandas, every column in a DataFrame works as a Series under the hood.
data_df_new = {
'Object_new': df['Object'],
'Action_new': df['Action'],
'Total_Cost': (df['Cost1'] + df['Cost2']) # Addition of two series
}
df_new = pd.DataFrame(data_df_new)
Running this code will map every value contained in your dataset, which will be stored in our dictionary.
You can use an empty data frame to copy the new column and use the to_dict to convert it to a dictionary.
import pandas as pd
import numpy as np
data_df = {'Object': [123, 456, 789],
'Action': ['renovate', 'do something', 'review'],
'Cost1': [10000, 0, 1000],
'Cost2': [2000, 10, 50],
}
df = pd.DataFrame(data_df)
print(df)
MyEmptydf = pd.DataFrame()
MyEmptydf['Object_new']=df['Object']
MyEmptydf['Action_new']=df['Action']
MyEmptydf['Total_Cost'] = df['Cost1'] + df['Cost2']
print(MyEmptydf)
dictionary = MyEmptydf.to_dict(orient="index")
print(dictionary)
you can run the code here:https://repl.it/repls/RealisticVillainousGlueware
If you trying to entirely avoid pandas and only use the dictionary this should solve it
Object = []
totalcost = []
action = []
for i in range(0,3):
Object.append(data_df['Object'][i])
totalcost.append(data_df['Cost1'][i]+data_df['Cost2'][i])
action.append(data_df['Action'][i])
dict2 = {'Object':Object, 'Action':action, 'TotalCost':totalcost}