Pandas DataFrame take automatically wrong value as index - python

I tried to create DataFrames from a JSON file.
I have a list named "Series_participants" containing a part of this JSON file. My list look like thise when i print it.
participantId 1
championId 76
stats {'item0': 3265, 'item2': 3143, 'totalUnitsHeal...
teamId 100
timeline {'participantId': 1, 'csDiffPerMinDeltas': {'1...
spell1Id 4
spell2Id 12
highestAchievedSeasonTier SILVER
dtype: object
<class 'list'>
After i tri to convert this list to a DataFrame like this
pd.DataFrame(Series_participants)
But pandas use values of "stats" and "timeline" as index for the DataFrame. I expected to have automatic index range (0, ..., n)
EDIT 1:
participantId championId stats teamId timeline spell1Id spell2Id highestAchievedSeasonTier
0 1 76 3265 100 NaN 4 12 SILVER
I want to have a dataframe with "stats" & "timeline" colomns containing dicts of their values as in the Series display.
What is my error ?
EDIT 2:
I have tried to create manually the DataFrame but pandas didn't take my choices in consideration and finally take indexes of "stats" key of the Series.
here is my code :
for j in range(0,len(df.participants[0])):
for i in range(0,len(df.participants[0][0])):
Series_participants = pd.Series(df.participants[0][i])
test = {'participantId':Series_participants.values[0],'championId':Series_participants.values[1],'stats':Series_participants.values[2],'teamId':Series_participants.values[3],'timeline':Series_participants.values[4],'spell1Id':Series_participants.values[5],'spell2Id':Series_participants.values[6],'highestAchievedSeasonTier':Series_participants.values[7]}
if j == 0:
df_participants = pd.DataFrame(test)
else:
df_participants.append(test, ignore_index=True)
The double loop is to parse all "participant" of my JSON file.
LAST EDIT :
I achieved what i wanted with the following code :
for i in range(0,len(df.participants[0])):
Series_participants = pd.Series(df.participants[0][i])
df_test = pd.DataFrame(data=[Series_participants.values], columns=['participantId','championId','stats','teamId','timeline','spell1Id','spell2Id','highestAchievedSeasonTier'])
if i == 0:
df_participants = pd.DataFrame(df_test)
else:
df_participants = df_participants.append(df_test, ignore_index=True)
print(df_participants)
Thanks to all for your help !

For efficiency, you should try and manipulate your data as you construct your dataframe rather than as a separate step.
However, to split apart your dictionary keys and values you can use a combination of numpy.repeat and itertools.chain. Here's a minimal example:
df = pd.DataFrame({'A': [1, 2],
'B': [{'key1': 'val0', 'key2': 'val9'},
{'key1': 'val1', 'key2': 'val2'}],
'C': [{'key3': 'val10', 'key4': 'val8'},
{'key3': 'val3', 'key4': 'val4'}]})
import numpy as np
from itertools import chain
chainer = chain.from_iterable
lens = df['B'].map(len)
res = pd.DataFrame({'A': np.repeat(df['A'], lens),
'B': list(chainer(df['B'].map(lambda x: x.values())))})
res.index = chainer(df['B'].map(lambda x: x.keys()))
print(res)
A B
key1 1 val0
key2 1 val9
key1 2 val1
key2 2 val2

If you try to input lists, series or arrays containing dicts into the object constructor, it doesn't recognise what you're trying to do. One way around this is manually setting:
df.at['a', 'b'] = {'x':value}
Note, the above will only work if the columns and indexes are already created in your DataFrame.

Updated per comments: Pandas data frames can hold dictionaries, but it is not recommended.
Pandas is interpreting that you want one index for each of the your dictionary keys and then broadcasting the single item columns across them.
So to help with what you are trying to do I would recommend reading in your dictionaries items as columns. Which is what data frames are typically used for and very good at.
Example Error due to pandas trying to read in the dictionary by key, value pair:
df = pd.DataFrame(columns= ['a', 'b'], index=['a', 'b'])
df.loc['a','a'] = {'apple': 2}
returns
ValueError: Incompatible indexer with Series
Per jpp in the comments below (When using the constructor method):
"They can hold arbitrary types, e.g.
df.iat[0, 0] = {'apple': 2}
However, it's not recommended to use Pandas in this way."

Related

Creating a dataframe from dictionary with arbitrary length values (using recycled keys as column value)

I am struggling with converting a dictionary into a dataframe.
There are already a lot of answers showing how to do it in the "wide format" like https://stackoverflow.com/a/52819186/6912069 but I would like to do something different, preferably not using loops.
Consider the following example:
I have a dictionary like this one
d_test = {'A': [1, 2], 'B': [3]}
and I'd like to get a dataframe like
index id values
0 A 1
1 A 2
2 B 3
The index can be a normal consecutive integer column. By recycling I mean turning 'A'=[1, 2] into two rows having A in the id column and the values in the values column. This way I would have a "long format" dataframe of the dictionary items.
It seems to be a very basic thing to do, but I was wondering if there is an elegant pythonic way to achieve this. Many thanks for your help.
I would create 2 lists. One from the keys, and other one from the values of the dictionary. As you defined the lists you can pass the lists into the DataFrame.
import pandas as pd
dic = {'A': [1, 2], 'B': [3], 'D': [4, 5, 6]}
keys = []
values = []
for key, value in dic.items():
for v in value:
keys.append(key)
values.append(v)
df = pd.DataFrame(
{'id': keys,
'values': values,
})
print(df)

How to generate multiple pandas dataframe from ordereddict?

I have an Ordered Dictionary, where the keys are the worksheet names, and the values contain the the worksheet items. Thus, the question: How do I use each of the keys and convert to an individual dataframe?
import pandas as pd
powerbipath = 'PowerBI_Ingestion.xlsx' dfs = pd.read_excel(powerbipath, None)
values=[] for idx, eachdf in enumerate(dfs):
eachdf=dfs[eachdf]
new_list1.append(eachdf)
eachdf = pd.DataFrame(new_list1[idx])
Examples I have seen only show how to convert from an ordered dictionary to 1 pandas dataframe. I want to convert to multiple dataframes. Thus, if there are 5 keys, there will be 5 dataframes.
You may want to do something like this, (Assuming your dictionary looks like 'd') :
d = {'first': [1, 2], 'second': [3, 4]}
for i in d:
df = pd.DataFrame(d.get(i), columns=[i])
print(df)
Output looks like :
first
0 1
1 2
second
0 3
1 4
Here is a basic answer using one of these ideas
keys = df["key_column"].unique
df_array = {}
for k in keys :
df_array[k] = dfs[dfs['key_column'] == k]
There might be more efficient way to do it though.

Is there any way to remove column and rows numbers from DataFrame.from_dict?

So, I have a problem with my dataframe from dictionary - python actually "names" my rows and columns with numbers.
Here's my code:
a = dict()
dfList = [x for x in df['Marka'].tolist() if str(x) != 'nan']
dfSet = set(dfList)
dfList123 = list(dfSet)
for i in range(len(dfList123)):
number = dfList.count(dfList123[i])
a[dfList123[i]]=number
sorted_by_value = sorted(a.items(), key=lambda kv: kv[1], reverse=True)
dataframe=pd.DataFrame.from_dict(sorted_by_value)
print(dataframe)
I've tried to rename columns like this:
dataframe=pd.DataFrame.from_dict(sorted_by_value, orient='index', columns=['A', 'B', 'C']), but it gives me a error:
AttributeError: 'list' object has no attribute 'values'
Is there any way to fix it?
Edit:
Here's the first part of my data frame:
0 1
0 VW 1383
1 AUDI 1053
2 VOLVO 789
3 BMW 749
4 OPEL 621
5 MERCEDES BENZ 593
...
The 1st rows and columns are exactly what I need to remove/rename
index and columns are properties of your dataframe
As long as len(df.index) > 0 and len(df.columns) > 0, i.e. your dataframe has nonzero rows and nonzero columns, you cannot get rid of the labels from your pd.DataFrame object. Whether the dataframe is constructed from a dictionary, or otherwise, is irrelevant.
What you can do is remove them from a representation of your dataframe, with output either as a Python str object or a CSV file. Here's a minimal example:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
print(df)
# 0 1 2
# 0 1 2 3
# 1 4 5 6
# output to string without index or headers
print(df.to_string(index=False, header=False))
# 1 2 3
# 4 5 6
# output to csv without index or headers
df.to_csv('file.csv', index=False, header=False)
By sorting the dict_items object (a.items()), you have created a list.
You can check this with type(sorted_by_value). Then, when you try to use the pd.DataFrame.from_dict() method, it fails because it is expecting a dictionary, which has 'values', but instead receives a list.
Probably the smallest fix you can make to the code is to replace the line:
dataframe=pd.DataFrame.from_dict(sorted_by_value)
with:
dataframe = pd.DataFrame(dict(sorted_by_value), index=[0]).
(The index=[0] argument is required here because pd.DataFrame expects a dictionary to be in the form {'key1': [list1, of, values], 'key2': [list2, of, values]} but instead sorted_by_value is converted to the form {'key1': value1, 'key2': value2}.)
Another option is to use pd.DataFrame(sorted_by_value) to generate a dataframe directly from the sorted items, although you may need to tweak sorted_by_value or the result to get the desired dataframe format.
Alternatively, look at collections.OrderedDict (the documentation for which is here) to avoid sorting to a list and then converting back to a dictionary.
Edit
Regarding naming of columns and the index, without seeing the data/desired result it's difficult to give specific advice. The options above will allow remove the error and allow you to create a dataframe, the columns of which can then be renamed using dataframe.columns = [list, of, column, headings]. For the index, look at pd.DataFrame.set_index(drop=True) (docs) and pd.DataFrame.reset_index() (docs).

Looping through Pandas dict within dataframe

I have a dataframe with a column who's rows each contain a dict.
I would like to extract those dict's and turn them into dataframes so I can merge them together.
What's the best way to do this?
Something like:
for row in dataframe.column:
dataframe_loop = pd.DataFrame(dataframe['column'].iloc(row), columns=['A','B'])
dataframe_result = dataframe_result.append(dataframe_loop)
import pandas as pd
d = {'col': pd.Series([{'a':1}, {'b':2}, {'c':3}])}
df = pd.DataFrame(d)
>>>print(df)
col
0 {'a': 1}
1 {'b': 2}
2 {'c': 3}
res = {}
for row in df.iterrows():
res.update(row[1]['col'])
>>>print(res)
{'b': 2, 'a': 1, 'c': 3}
If your column contains dicts and you want to make a dataframe out of those dicts, you can just convert the column to a list of dicts and make that into a dataframe directly:
pd.DataFrame(dataframe['column'].tolist())
The dictionary keys will become columns. If you want other behavior, you'll need to specify that.
I don't know what your dict in dataframe.column looks like. If it looks like the dictionary below, I think you can use pandas.concat to concentrate dictionaries together.
import pandas as pd
# create a dummy dataframe
dataframe = pd.DataFrame({'column':[{'A':[1,2,3], 'B':[4,5,6]}, \
{'A':[7,8,9], 'B':[10,11,12]}, \
{'A':[13,14,15], 'B':[16,17,18]}]})
#print(dataframe)
res = pd.concat([pd.DataFrame(row, columns=['A', 'B']) for row in dataframe.column], ignore_index=True)
print(res)

Pandas DataFrame to List of Lists

It's easy to turn a list of lists into a pandas dataframe:
import pandas as pd
df = pd.DataFrame([[1,2,3],[3,4,5]])
But how do I turn df back into a list of lists?
lol = df.what_to_do_now?
print lol
# [[1,2,3],[3,4,5]]
You could access the underlying array and call its tolist method:
>>> df = pd.DataFrame([[1,2,3],[3,4,5]])
>>> lol = df.values.tolist()
>>> lol
[[1L, 2L, 3L], [3L, 4L, 5L]]
If the data has column and index labels that you want to preserve, there are a few options.
Example data:
>>> df = pd.DataFrame([[1,2,3],[3,4,5]], \
columns=('first', 'second', 'third'), \
index=('alpha', 'beta'))
>>> df
first second third
alpha 1 2 3
beta 3 4 5
The tolist() method described in other answers is useful but yields only the core data - which may not be enough, depending on your needs.
>>> df.values.tolist()
[[1, 2, 3], [3, 4, 5]]
One approach is to convert the DataFrame to json using df.to_json() and then parse it again. This is cumbersome but does have some advantages, because the to_json() method has some useful options.
>>> df.to_json()
{
"first":{"alpha":1,"beta":3},
"second":{"alpha":2,"beta":4},"third":{"alpha":3,"beta":5}
}
>>> df.to_json(orient='split')
{
"columns":["first","second","third"],
"index":["alpha","beta"],
"data":[[1,2,3],[3,4,5]]
}
Cumbersome but may be useful.
The good news is that it's pretty straightforward to build lists for the columns and rows:
>>> columns = [df.index.name] + [i for i in df.columns]
>>> rows = [[i for i in row] for row in df.itertuples()]
This yields:
>>> print(f"columns: {columns}\nrows: {rows}")
columns: [None, 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]
If the None as the name of the index is bothersome, rename it:
df = df.rename_axis('stage')
Then:
>>> columns = [df.index.name] + [i for i in df.columns]
>>> print(f"columns: {columns}\nrows: {rows}")
columns: ['stage', 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]
I wanted to preserve the index, so I adapted the original answer to this solution:
list_df = df.reset_index().values.tolist()
Now you can paste it somewhere else (e.g. to paste into a Stack Overflow question) and latter recreate it:
pd.Dataframe(list_df, columns=['name1', ...])
pd.set_index(['name1'], inplace=True)
I don't know if it will fit your needs, but you can also do:
>>> lol = df.values
>>> lol
array([[1, 2, 3],
[3, 4, 5]])
This is just a numpy array from the ndarray module, which lets you do all the usual numpy array things.
I had this problem: how do I get the headers of the df to be in row 0 for writing them to row 1 in the excel (using xlsxwriter)? None of the proposed solutions worked, but they pointed me in the right direction. I just needed one line more of code
# get csv data
df = pd.read_csv(filename)
# combine column headers and list of lists of values
lol = [df.columns.tolist()] + df.values.tolist()
Maybe something changed but this gave back a list of ndarrays which did what I needed.
list(df.values)
Not quite relate to the issue but another flavor with same expectation
converting data frame series into list of lists to plot the chart using create_distplot in Plotly
hist_data=[]
hist_data.append(map_data['Population'].to_numpy().tolist())
"df.values" returns a numpy array. This does not preserve the data types. An integer might be converted to a float.
df.iterrows() returns a series which also does not guarantee to preserve the data types. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html
The code below converts to a list of list and preserves the data types:
rows = [list(row) for row in df.itertuples()]
If you wish to convert a Pandas DataFrame to a table (list of lists) and include the header column this should work:
import pandas as pd
def dfToTable(df:pd.DataFrame) -> list:
return [list(df.columns)] + df.values.tolist()
Usage (in REPL):
>>> df = pd.DataFrame(
[["r1c1","r1c2","r1c3"],["r2c1","r2c2","r3c3"]]
, columns=["c1", "c2", "c3"])
>>> df
c1 c2 c3
0 r1c1 r1c2 r1c3
1 r2c1 r2c2 r3c3
>>> dfToTable(df)
[['c1', 'c2', 'c3'], ['r1c1', 'r1c2', 'r1c3'], ['r2c1', 'r2c2', 'r3c3']]
The solutions presented so far suffer from a "reinventing the wheel" approach. Quoting #AMC:
If you're new to the library, consider double-checking whether the functionality you need is already offered by those Pandas objects.
If you convert a dataframe to a list of lists you will lose information - namely the index and columns names.
My solution: use to_dict()
dict_of_lists = df.to_dict(orient='split')
This will give you a dictionary with three lists: index, columns, data. If you decide you really don't need the columns and index names, you get the data with
dict_of_lists['data']
We can use the DataFrame.iterrows() function to iterate over each of the rows of the given Dataframe and construct a list out of the data of each row:
# Empty list
row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =[rows.Date, rows.Event, rows.Cost]
# append the list to the final list
row_list.append(my_list)
# Print
print(row_list)
We can successfully extract each row of the given data frame into a list
This is very simple:
import numpy as np
list_of_lists = np.array(df)
Note: I have seen many cases on Stack Overflow where converting a Pandas Series or DataFrame to a NumPy array or plain Python lists is entirely unecessary. If you're new to the library, consider double-checking whether the functionality you need is already offered by those Pandas objects.
To quote a comment by #jpp:
In practice, there's often no need to convert the NumPy array into a list of lists.
If a Pandas DataFrame/Series won't work, you can use the built-in DataFrame.to_numpy and Series.to_numpy methods.
A function I wrote that allows including the index column or the header row:
def df_to_list_of_lists(df, index=False, header=False):
rows = []
if header:
rows.append(([df.index.name] if index else []) + [e for e in df.columns])
for row in df.itertuples():
rows.append([e for e in row] if index else [e for e in row][1:])
return rows

Categories

Resources