I have a pandas dataframe that I want to convert each row to a json message to upload. I figured this would be a great usecase for the apply method but I'm having a slight problem, it's not sending the column names.
Here's my data:
Industry CountryName periodDate predicted
0 Advertising Agencies USA 1995.0 144565.060000
1 Advertising Agencies USA 1996.0 165903.120000
2 Advertising Agencies USA 2001.0 326320.740300
When I use apply I lose the column names(industry, countryName, periodDate, etc)
def sendAggData(row):
uploadDataJson = row.to_json(orient='records')
print(json.loads(uploadDataJson))
aggValue.apply(sendAggData, axis=1)
I get this result:
['Advertising Agencies', 'USA', 1995.0, 144565.06]
['Advertising Agencies', 'USA', 1996.0, 165903.12]
['Advertising Agencies', 'USA', 2001.0, 326320.7403]
I want this as a json message, so I'd like the column name on it. so something like {'Industry': 'Advertising Agencies', 'CountryName':'USA'....} - previously I got this to work using a for loop for each row but was told that apply is the more pandas way :-) Any suggestion of what I can do to use apply correctly?
You can just do:
df.apply(lambda x: x.to_json(), axis=1)
Which gives you:
0 {"Industry":0,"CountryName":"Advertising Agenc...
1 {"Industry":1,"CountryName":"Advertising Agenc...
2 {"Industry":2,"CountryName":"Advertising Agenc...
dtype: object
However, what's about df.to_dict('records') which gives:
[{'Industry': 0, 'CountryName': 'Advertising Agencies', 'periodDate': 'USA 1995.0', 'predicted': 144565.06},
{'Industry': 1, 'CountryName': 'Advertising Agencies', 'periodDate': 'USA 1996.0', 'predicted': 165903.12},
{'Industry': 2, 'CountryName': 'Advertising Agencies', 'periodDate': 'USA 2001.0', 'predicted': 326320.7403}]
Related
I have a large dataframe with lots of columns, each with their own header. I'd like to group them in sub headers so readability is more clear. For example, here are my column headers:
df1 = df1[['Date', 'Time', 'USA', 'Canada', 'SD', 'HD']]
I'd like have headers above certain columns to output like this:
When Country Channel
Date Time USA Canada SD HD
However, I'm not sure how to go about this. Any assistance or direction is much appreciated.
Thanks!
You can use:
df.columns = pd.MultiIndex.from_tuples([('When', 'Date'), ('When', 'Time'),
('Country', 'USA'), ('Country', 'Canada'),
('Channel', 'SD'), ('Channel', 'HD')])
I pulled a list of historical option price of AAPL from the RobinHoood function robin_stocks.get_option_historicals(). The data was returned in a form of dictional of list of dictionary as shown below.
I am having difficulties to convert the below object (named historicalData) into a DataFrame. Can someone please help?
historicalData = {'data_points': [{'begins_at': '2020-10-05T13:30:00Z',
'open_price': '1.430000',
'close_price': '1.430000',
'high_price': '1.430000',
'low_price': '1.430000',
'volume': 0,
'session': 'reg',
'interpolated': False},
{'begins_at': '2020-10-05T13:40:00Z',
'open_price': '1.430000',
'close_price': '1.340000',
'high_price': '1.440000',
'low_price': '1.320000',
'volume': 0,
'session': 'reg',
'interpolated': False}],
'open_time': '0001-01-01T00:00:00Z',
'open_price': '0.000000',
'previous_close_time': '0001-01-01T00:00:00Z',
'previous_close_price': '0.000000',
'interval': '10minute',
'span': 'week',
'bounds': 'regular',
'id': '22b49380-8c50-4c76-8fb1-a4d06058f91e',
'instrument': 'https://api.robinhood.com/options/instruments/22b49380-8c50-4c76-8fb1-a4d06058f91e/'}
I tried the below code code but that didn't help:
import pandas as pd
df = pd.DataFrame(historicalData)
df
You didn't write that you want only data_points (as in the
other answer), so I assume that you want your whole dictionary
converted to a DataFrame.
To do it, start with your code:
df = pd.DataFrame(historicalData)
It creates a DataFrame, with data_points "exploded" to
consecutive rows, but they are still dictionaries.
Then rename open_price column to open_price_all:
df.rename(columns={'open_price': 'open_price_all'}, inplace=True)
The reason is to avoid duplicated column names after join
to be performed soon (data_points contain also open_price
attribute and I want the corresponding column from data_points
to "inherit" this name).
The next step is to create a temporary DataFrame - a split of
dictionaries in data_points to individual columns:
wrk = df.data_points.apply(pd.Series)
Print wrk to see the result.
And the last step is to join df with wrk and drop
data_points column (not needed any more, since it was
split into separate columns):
result = df.join(wrk).drop(columns=['data_points'])
This is pretty easy to solve with the below. I have chucked the dataframe to a list via list comprehension
import pandas as pd
df_list = [pd.DataFrame(dic.items(), columns=['Parameters', 'Value']) for dic in historicalData['data_points']]
You then could do:
df_list[0]
which will yield
Parameters Value
0 begins_at 2020-10-05T13:30:00Z
1 open_price 1.430000
2 close_price 1.430000
3 high_price 1.430000
4 low_price 1.430000
5 volume 0
6 session reg
7 interpolated False
I have a dataframe with GDP data. The first few columns contain important data about the countries (which I have renamed it the way I wanted it) but it then goes into a long list of columns displaying a column per year from 1960 to 2015 with each year's GDP. In addition, the columns' names have got messed up and they are named sequentially with the word 'Unnamed' i.e 'Unnamed: 4', 'Unnamed: 5', etc.
My idea was to rename all the 'Unnamed' columns to each of the years (from 1960 to 2015). For example {'Unnamed 4': 1960, 'Unnamed 5': 1961, etc}. So I tried to write the code below:
GDP = pd.read_csv('world_bank.csv')
GDP = GDP.rename(columns={"Data Source": "Country", "World Development Indicators": "Country Code", "Unnamed: 2": "Indicator name", "Unnamed: 3": "Indicator Code"})
GDP = GDP.replace({'Data Source': {'Korea, Rep.': 'South Korea', 'Iran, Islamic Rep.': 'Iran', 'Hong Kong SAR, China': 'Hong Kong'}})
#Below is what I wrote to try to iterate through
GDP = GDP.rename(columns={["Unnamed: "+str(i)+": "+str(j) for i in range(4, 60) for j in range(1960, 2016)]})
But when I use that code it give this error:
TypeError: unhashable type: 'list'
Any thoughts how to do this?
You can directly use dict comprehension in python like:
GDP.rename(columns = {"Unnamed: "+str(i): str(1956+i) for i in range(4, 60)})
You should pass a dictionary to the rename function containing existing column names as keys and the replacing ones as values. You can see an example in the documentation.
I have JSON output from m3inference package in python like this:
{'input': {'description': 'Bundeskanzlerin',
'id': '2631881902',
'img_path': '/root/m3/cache/angelamerkeicdu_224x224.jpg',
'lang': 'de',
'name': 'Angela Merkel',
'screen_name': 'angelamerkeicdu'},
'output': {'age': {'19-29': 0.0,
'30-39': 0.0001,
'<=18': 0.0001,
'>=40': 0.9998},
'gender': {'female': 0.9991, 'male': 0.0009},
'org': {'is-org': 0.0032, 'non-org': 0.9968}}}
I store it in:
org = pd.DataFrame.from_dict(json_normalize(org['output']), orient='columns')
gender.male gender.female age.<=18 ... age.>=40 org.non-org org.is-org
0 0.0009 0.9991 0.0000 ... 0.9998 0.9968 0.0032
i dont know where is the 0 value in the first column coming from, I save org.isorg column to isorg
isorg = org['org.is-org']
but when i append it to panda data frame dtypes is object, the value is change to
0 0.0032 Name: org.is-org, dtype: float64
not 0.0032
How to fix this?
"i dont know where 0 value in first column coming from then i save org.isorg column to isorg"
That "0" is an index to your dataframe. Unless you specify your dataframe index, pandas will auto create the index. You can change you index instead.
code example:
org.set_index('gender.male', inplace=True)
Index is like an address to your data. It is how any data point across the dataframe or series can be accessed.
I have a list of a list of dictionaries. I managed to access each list-element within the outer list and convert the dictionary via pandas into a data-frame. I then save the DF and later concat it. That's a perfect result. But I need a loop to do that for big data.
Here is my MWE which works fine in principle.
import pandas as pd
mwe = [
[{"name": "Norway", "population": 5223256, "area": 323802.0, "gini": 25.8}],
[{"name": "Switzerland", "population": 8341600, "area": 41284.0, "gini": 33.7}],
[{"name": "Australia", "population": 24117360, "area": 7692024.0, "gini": 30.5}],
]
df0 = pd.DataFrame.from_dict(mwe[0])
df1 = pd.DataFrame.from_dict(mwe[1])
df2 = pd.DataFrame.from_dict(mwe[2])
frames = [df0, df1, df2]
result = pd.concat(frames)
It creates a nice table.
Here is what I tried to create a list of data frames:
for i in range(len(mwe)):
frame = pd.DataFrame()
frame = pd.DataFrame.from_dict(mwe[i])
frames = []
frames.append(frame)
Addendum: Thanks for all the answers. They are working on my MWE. Which made me notice that there are some strange entries in my dataset. No solution works for my dataset, since I have an inner-list element which contains two dictionaries (due to non unique data retrieval):
....
[{'name': 'United States Minor Outlying Islands', 'population': 300},
{'name': 'United States of America',
'population': 323947000,
'area': 9629091.0,
'gini': 48.0}],
...
How can I drop the entry for "United States Minor Outlying Islands"?
You could get each dict out of the containing list and just have a list of dict:
import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
# use x.pop() so that you aren't carrying around copies of the data
# for a "big data" application
df = pd.DataFrame([x.pop() for x in mwe])
df.head()
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
By bringing the list comprehension into the dataframe declaration, that list is temporary, and you don't have to worry about the cleanup. pop will also consume the dictionaries out of mwe, minimizing the amount of copies you are carrying around in memory
As a note, when doing this, mwe will then look like:
mwe
[[], [], []]
Because the contents of the sub-lists have been popped out
EDIT: New Question Content
If your data contains duplicates, or at least entries you don't want, and the undesired entries don't have matching columns to the rest of the dataset (which appears to be the case), it becomes a bit trickier to avoid copying data as above:
mwe.append([{'name': 'United States Minor Outlying Islands', 'population': 300}, {'name': 'United States of America', 'population': 323947000, 'area': 9629091.0, 'gini': 48.0}])
key_check = {}.fromkeys(["name", "population", "area", "gini"])
# the easy way but copies data
df = pd.DataFrame([item for item in data
for data in mwe
if item.keys()==key_check.keys()])
Since you'll still have the data hanging around in mwe. It might be better to use a generator
def get_filtered_data(mwe):
for data in mwe:
while data: # when data is empty, the while loop will end
item = data.pop() # still consumes data out of mwe
if item.keys() == key_check.keys():
yield item # will minimize data copying through lazy evaluation
df = pd.DataFrame([x for x in get_filtered_data(mwe)])
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
3 9629091.0 48.0 United States of America 323947000
Again, this is under the assumption that non-desired entries have invalid columns, which appears to be the case here, specifically. Otherwise, this will at least flatten out the data structure so you can filter it with pandas later
Create and empty DataFrame and loop over the list using df.append on each loop:
>>> import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
>>> df = pd.DataFrame()
>>> for country in mwe:
... df = df.append(country)
...
>>> df
area gini name population
0 323802.0 25.8 Norway 5223256
0 41284.0 33.7 Switzerland 8341600
0 7692024.0 30.5 Australia 24117360
Try this :
df = pd.DataFrame(columns = ['name', 'population', 'area', 'gini'])
for i in range(len(mwe)):
df.loc[i] = list(mwe[i][0].values())
Output :
name pop area gini
0 Norway 5223256 323802.0 25.8
1 Switzerland 8341600 41284.0 33.7
2 Australia 24117360 7692024.0 30.5