Pandas DataFrame to List of Lists - python

It's easy to turn a list of lists into a pandas dataframe:
import pandas as pd
df = pd.DataFrame([[1,2,3],[3,4,5]])
But how do I turn df back into a list of lists?
lol = df.what_to_do_now?
print lol
# [[1,2,3],[3,4,5]]

You could access the underlying array and call its tolist method:
>>> df = pd.DataFrame([[1,2,3],[3,4,5]])
>>> lol = df.values.tolist()
>>> lol
[[1L, 2L, 3L], [3L, 4L, 5L]]

If the data has column and index labels that you want to preserve, there are a few options.
Example data:
>>> df = pd.DataFrame([[1,2,3],[3,4,5]], \
columns=('first', 'second', 'third'), \
index=('alpha', 'beta'))
>>> df
first second third
alpha 1 2 3
beta 3 4 5
The tolist() method described in other answers is useful but yields only the core data - which may not be enough, depending on your needs.
>>> df.values.tolist()
[[1, 2, 3], [3, 4, 5]]
One approach is to convert the DataFrame to json using df.to_json() and then parse it again. This is cumbersome but does have some advantages, because the to_json() method has some useful options.
>>> df.to_json()
{
"first":{"alpha":1,"beta":3},
"second":{"alpha":2,"beta":4},"third":{"alpha":3,"beta":5}
}
>>> df.to_json(orient='split')
{
"columns":["first","second","third"],
"index":["alpha","beta"],
"data":[[1,2,3],[3,4,5]]
}
Cumbersome but may be useful.
The good news is that it's pretty straightforward to build lists for the columns and rows:
>>> columns = [df.index.name] + [i for i in df.columns]
>>> rows = [[i for i in row] for row in df.itertuples()]
This yields:
>>> print(f"columns: {columns}\nrows: {rows}")
columns: [None, 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]
If the None as the name of the index is bothersome, rename it:
df = df.rename_axis('stage')
Then:
>>> columns = [df.index.name] + [i for i in df.columns]
>>> print(f"columns: {columns}\nrows: {rows}")
columns: ['stage', 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]

I wanted to preserve the index, so I adapted the original answer to this solution:
list_df = df.reset_index().values.tolist()
Now you can paste it somewhere else (e.g. to paste into a Stack Overflow question) and latter recreate it:
pd.Dataframe(list_df, columns=['name1', ...])
pd.set_index(['name1'], inplace=True)

I don't know if it will fit your needs, but you can also do:
>>> lol = df.values
>>> lol
array([[1, 2, 3],
[3, 4, 5]])
This is just a numpy array from the ndarray module, which lets you do all the usual numpy array things.

I had this problem: how do I get the headers of the df to be in row 0 for writing them to row 1 in the excel (using xlsxwriter)? None of the proposed solutions worked, but they pointed me in the right direction. I just needed one line more of code
# get csv data
df = pd.read_csv(filename)
# combine column headers and list of lists of values
lol = [df.columns.tolist()] + df.values.tolist()

Maybe something changed but this gave back a list of ndarrays which did what I needed.
list(df.values)

Not quite relate to the issue but another flavor with same expectation
converting data frame series into list of lists to plot the chart using create_distplot in Plotly
hist_data=[]
hist_data.append(map_data['Population'].to_numpy().tolist())

"df.values" returns a numpy array. This does not preserve the data types. An integer might be converted to a float.
df.iterrows() returns a series which also does not guarantee to preserve the data types. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html
The code below converts to a list of list and preserves the data types:
rows = [list(row) for row in df.itertuples()]

If you wish to convert a Pandas DataFrame to a table (list of lists) and include the header column this should work:
import pandas as pd
def dfToTable(df:pd.DataFrame) -> list:
return [list(df.columns)] + df.values.tolist()
Usage (in REPL):
>>> df = pd.DataFrame(
[["r1c1","r1c2","r1c3"],["r2c1","r2c2","r3c3"]]
, columns=["c1", "c2", "c3"])
>>> df
c1 c2 c3
0 r1c1 r1c2 r1c3
1 r2c1 r2c2 r3c3
>>> dfToTable(df)
[['c1', 'c2', 'c3'], ['r1c1', 'r1c2', 'r1c3'], ['r2c1', 'r2c2', 'r3c3']]

The solutions presented so far suffer from a "reinventing the wheel" approach. Quoting #AMC:
If you're new to the library, consider double-checking whether the functionality you need is already offered by those Pandas objects.
If you convert a dataframe to a list of lists you will lose information - namely the index and columns names.
My solution: use to_dict()
dict_of_lists = df.to_dict(orient='split')
This will give you a dictionary with three lists: index, columns, data. If you decide you really don't need the columns and index names, you get the data with
dict_of_lists['data']

We can use the DataFrame.iterrows() function to iterate over each of the rows of the given Dataframe and construct a list out of the data of each row:
# Empty list
row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =[rows.Date, rows.Event, rows.Cost]
# append the list to the final list
row_list.append(my_list)
# Print
print(row_list)
We can successfully extract each row of the given data frame into a list

This is very simple:
import numpy as np
list_of_lists = np.array(df)

Note: I have seen many cases on Stack Overflow where converting a Pandas Series or DataFrame to a NumPy array or plain Python lists is entirely unecessary. If you're new to the library, consider double-checking whether the functionality you need is already offered by those Pandas objects.
To quote a comment by #jpp:
In practice, there's often no need to convert the NumPy array into a list of lists.
If a Pandas DataFrame/Series won't work, you can use the built-in DataFrame.to_numpy and Series.to_numpy methods.

A function I wrote that allows including the index column or the header row:
def df_to_list_of_lists(df, index=False, header=False):
rows = []
if header:
rows.append(([df.index.name] if index else []) + [e for e in df.columns])
for row in df.itertuples():
rows.append([e for e in row] if index else [e for e in row][1:])
return rows

Related

How to rename the first column of a pandas dataframe?

I have come across this question many a times over internet however not many answers are there except for few of the likes of the following:
Cannot rename the first column in pandas DataFrame
I approached the same using following:
df = df.rename(columns={df.columns[0]: 'Column1'})
Is there a better or cleaner way of doing the rename of the first column of a pandas dataframe? Or any specific column number?
You're already using a cleaner way in pandas.
It is sad that:
df.columns[0] = 'Column1'
Is impossible because Index objects do not support mutable assignments. It would give an TypeError.
You still could do iterable unpacking:
df.columns = ['Column1', *df.columns[1:]]
Or:
df = df.set_axis(['Column1', *df.columns[1:]], axis=1)
Not sure if cleaner, but possible idea is convert to list and set by indexing new value:
df = pd.DataFrame(columns=[4,7,0,2])
arr = df.columns.tolist()
arr[0] = 'Column1'
df.columns = arr
print (df)
Empty DataFrame
Columns: [Column1, 7, 0, 2]
Index: []

Select values from different Dataframe column based on a list of index

Here is my issue, I have a dataframe, let's say:
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
I also have a list of index:
idx = [1, 2]
I would like to store in a list, the corresponding value in each column.
Meaning I want the first value of the col1 and the second value of the col2.
I'm sure there is a simple answer to my issue however I'm mixing everything up with iloc and cannot find a way of developing a optimized method in my case (I have 1000 rows and 4 columns).
IIUC, you can try:
you can extract the complete rows and then pick the diagonal elements
result = np.diag(df.values[idx])
Alternative:
convert the dataframe to numpy array.
use numpy indexing to access the required values.
result = df.values[idx, range(len(df.columns))]
OUTPUT:
array([6, 3])
Use:
list(df.values[idx, range(len(idx))])
Output:
[6, 3]
Here is a different way:
df.stack().loc[list(zip(idx,df.columns[:(len(idx))]))].to_numpy()

Get Non Empty Columnin Pandas Dataframe

In the following pandas dataframe there are missing values in different columns for each row.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, None], 'col2': [None, 4, 5], 'col3': [3, None, None]}
df = pd.DataFrame(data=d)
df
I know I can use this to locate which columns are not empty in the ith row
df.iloc[0].notnull()
And then something like the following to find which specific columns are not empty.
np.where(df.iloc[0].notnull())
However, how can I then use those values as indices to return the non missing columns in the ith row?
For example, in the 0th row I'd like to return back columns
df.iloc[0, [0,2]]
This isn't quite right, but I'm guessing is somewhere along these lines?
df.iloc[0, np.where(df.iloc[0].notnull())]
** Edit
I realize I can do this
df.iloc[0, np.where(df.iloc[0].notnull())[0].tolist()]
And this returns the expected result. However, is this the most efficient approach?
Here's a way using np.isnan
# set row number
row_number = 0
# get dataframe
df.loc[row_number, ~np.isnan(df.values)[row_number]]

pandas dataframe, how to add new row efficiently

I would like to know how to add a new row efficiently to the dataframe.
Assuming I have a empty dataframe
"A" "B"
columns = ['A','B']
user_list = pd.DataFrame(columns=columns)
I want to add one row like {A=3, B=4} to the dataframe, how to do that in most efficient way?
columns = ['A', 'B']
user_list = pd.DataFrame(np.zeros((1000, 2)) + np.nan, columns=columns)
user_list.iloc[0] = [3, 4]
user_list.iloc[1] = [4, 5]
Pandas doesn't have built-in resizing, but it will ignore nan's pretty well. You'll have to manage your own resizing, though :/

Filter numpy ndarray (matrix) according to column values

This question is about filtering a NumPy ndarray according to some column values.
I have a fairly large NumPy ndarray (300000, 50) and I am filtering it according to values in some specific columns. I have ndtypes so I can access each column by name.
The first column is named category_code and I need to filter the matrix to return only rows where category_code is in ("A", "B", "C").
The result would need to be another NumPy ndarray whose columns are still accessible by the dtype names.
Here is what I do now:
index = numpy.asarray([row['category_code'] in ('A', 'B', 'C') for row in data])
filtered_data = data[index]
List comprehension like:
list = [row for row in data if row['category_code'] in ('A', 'B', 'C')]
filtered_data = numpy.asarray(list)
wouldn't work because the dtypes I originally had are no longer accessible.
Are there any better / more Pythonic way of achieving the same result?
Something that could look like:
filtered_data = data.where({'category_code': ('A', 'B','C'})
Thanks!
You can use the NumPy-based library, Pandas, which has a more generally useful implementation of ndarrays:
>>> # import the library
>>> import pandas as PD
Create some sample data as python dictionary, whose keys are the column names and whose values are the column values as a python list; one key/value pair per column
>>> data = {'category_code': ['D', 'A', 'B', 'C', 'D', 'A', 'C', 'A'],
'value':[4, 2, 6, 3, 8, 4, 3, 9]}
>>> # convert to a Pandas 'DataFrame'
>>> D = PD.DataFrame(data)
To return just the rows in which the category_code is either B or C, two steps conceptually, but can easily be done in a single line:
>>> # step 1: create the index
>>> idx = (D.category_code== 'B') | (D.category_code == 'C')
>>> # then filter the data against that index:
>>> D.ix[idx]
category_code value
2 B 6
3 C 3
6 C 3
Note the difference between indexing in Pandas versus NumPy, the library upon which Pandas is built. In NumPy, you would just place the index inside the brackets, indicating which dimension you are indexing with a ",", and using ":" to indicate that you want all of the values (columns) in the other dimension:
>>> D[idx,:]
In Pandas, you call the the data frame's ix method, and place only the index inside the brackets:
>>> D.loc[idx]
If you can choose, I strongly recommend pandas: it has "column indexing" built-in plus a lot of other features. It is built on numpy.

Categories

Resources