Suppose I have the following dataframe:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.DataFrame(np.random.randn(8, 2), index=index, columns=[0, 1])
s
0 1
first second
bar one -0.012581 1.421286
two -0.048482 -0.153656
baz one -2.616540 -1.368694
two -1.989319 1.627848
foo one -0.404563 -1.099314
two -2.006166 0.867398
qux one -0.843150 -1.045291
two 2.129620 -2.697217
I know select a sub-dataframe by indexing:
temp = s.loc[('bar', slice(None)), slice(None)].copy()
temp
0 1
first second
bar one -0.012581 1.421286
two -0.048482 -0.153656
However, if I look at the index, the values of the original index still appear:
temp.index
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0], [0, 1]],
names=[u'first', u'second'])
This does not happen with normal dataframes. If you index, the remaining copy (or even the view) contains only the selected index/columns. This is annoying because I might often do lots of filtering on big dataframes and at the end I would like to know the index of what's left by just doing
df.index
df
This also happens for multiindex columns. Is there a proper way to update the index/columns and drop the empty entries?
To be clear, I want the filtered dataframe to have the same structure (multiindex index and columns). For example, I want to do:
temp = s.loc[(('bar', 'foo'), slice(None)), :]
but the index still has 'baz' and 'qux' values:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 2, 2], [0, 1, 0, 1]],
names=[u'first', u'second'])
To make clear the effect I would like to see, I wrote this snippet to eliminate redundant entries:
import pandas as pd
def update_multiindex(df):
if isinstance(df.columns, pd.MultiIndex):
new_df = {key: df.loc[:, key] for key in df.columns if not df.loc[:, key].empty}
new_df = pd.DataFrame(new_df)
else:
new_df = df.copy()
if isinstance(df.index, pd.MultiIndex):
new_df = {key: new_df.loc[key, :] for key in new_df.index if not new_df.loc[key, :].empty}
new_df = pd.DataFrame(new_df).T
return new_df
temp = update_multiindex(temp).index
temp
MultiIndex(levels=[[u'bar', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Two points. First, I think you may want to do something that is actually bad for you. I know it's annoying that you have a lot of extra cruft in your filtered indices, but if you rebuild the indices to exclude the missing categorical values, then your new indices will be incompatible with each other and the original index.
That said, I suspect (but do not know) that MultiIndex used this way is built on top of CategoricalIndex, which has the method remove_unused_levels(). It may be wrapped by MultiIndex, but I cannot tell, because...
Second, MultiIndex is notably missing from the pandas API documentation. I do not use MultiIndex, but you might consider looking for and/or opening a ticket on GitHub about this if you do use it regularly. Beyond that, you may have to grunnel through the source code if you want to find exact information on the features available with MultiIndex.
If I understand correctly your usage pattern you may be able to get the best of both worlds. I'm focusing on:
This is annoying because I might often do lots of filtering on big
dataframes and at the end I would like to know the index of what's
left by just doing
df.index
df
This also happens for multiindex columns. Is there a
proper way to update the index/columns and drop the empty entries?
Consideration (1) is that you want to know the index of what's left. Consideration (2) is that as mentioned above, if you trim the multiindex you can't merge any data back into your original, and also its a bunch of nonobvious steps that aren't really encouraged.
The underlying fundamental is that index does NOT return updated contents for a multiindex if any rows or columns have been deleted and this is not considered a bug because that's not the approved use of MultiIndexes (read more: github.com/pydata/pandas/issues/3686). The valid API access for the current contents of a MultiIndex is get_level_values.
So would it fit your needs to adjust your practice to use this?
df.index.get_level_values(-put your level name or number here-)
For Multiindexes this is the approved API access technique and there are some good reasons for this. If you use get_level_values instead of just .index you'll be able to get the current contents while ALSO preserving all the information in case you want to re-merge modified data or otherwise match against the original indices for comparisons, grouping, etc...
Does that fit your needs?
There is a difference between the index of s and the index of temp:
In [25]: s.index
Out[25]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'first', u'second'])
In [26]: temp.index
Out[26]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0], [0, 1]],
names=[u'first', u'second'])
Notices that the labels in the MultiIndex are different.
Try using droplevel.
temp.index = temp.index.droplevel()
>>> temp
0 1
second
one 0.450819 -1.071271
two -0.371563 0.411808
>>> temp.index
Index([u'one', u'two'], dtype='object')
When dealing with columns, it's the same thing:
df.columns = df.columns.droplevel()
You can also use xs and set the drop_level parameter to True (default value is False):
>>> s.xs('bar', drop_level=True)
0 1
second
one 0.450819 -1.071271
two -0.371563 0.411808
Related
I would want to find a way in python to merge the files on 'seq' but return all the ones with the same id, in this example only the lines with id 2 would be removed.
File one:
seq,id
CSVGPPNNEQFF,0
CTVGPPNNEQFF,0
CTVGPPNNERFF,0
CASRGEAAGFYEQYF,1
RASRGEAAGFYEQYF,1
CASRGGAAGFYEQYF,1
CASSDLILYYEQYF,2
CASSDLILYYTQYF,2
CASSGSYEQYF,3
CASSGSYEQYY,3
File two:
seq
CSVGPPNNEQFF
CASRGEAAGFYEQYF
CASSGSYEQYY
Output:
seq,id
CSVGPPNNEQFF,0
CTVGPPNNEQFF,0
CTVGPPNNERFF,0
CASRGEAAGFYEQYF,1
RASRGEAAGFYEQYF,1
CASRGGAAGFYEQYF,1
CASSGSYEQYF,3
CASSGSYEQYY,3
I have tried:
df3 = df1.merge(df2.groupby('seq',as_index=False)[['seq']].agg(','.join),how='right')
output:
seq,id
CASRGEAAGFYEQYF,1
CASSGSYEQYY,3
CSVGPPNNEQFF,0
Does anyone have any advice how to solve this?
Do you want to merge two dataframes, or just take subset of the first dataframe according to which id is included in the second dataframe (by seq)? Anyway, this gives the required result.
df1 = pd.DataFrame({
'seq': [
'CSVGPPNNEQFF',
'CTVGPPNNEQFF',
'CTVGPPNNERFF',
'CASRGEAAGFYEQYF',
'RASRGEAAGFYEQYF',
'CASRGGAAGFYEQYF',
'CASSDLILYYEQYF',
'CASSDLILYYTQYF',
'CASSGSYEQYF',
'CASSGSYEQYY'
],
'id': [0, 0, 0, 1, 1, 1, 2, 2, 3, 3]
})
df2 = pd.DataFrame({
'seq': [
'CSVGPPNNEQFF',
'CASRGEAAGFYEQYF',
'CASSGSYEQYY'
]
})
df3 = df1.loc[df1['id'].isin(df1['id'][df1['seq'].isin(df2['seq'])])]
Explanation: df1['id'][df1['seq'].isin(df2['seq'])] takes those values of id from df1 that contain at least one seq that is included in df2. Then all rows with those values of id are taken from df1.
You can use the isin() pandas method, code shall looks as follow :
df1.loc[df1['seq'].isin(df2['seq'])]
Assuming, both objects are pandas dataframe and 'seq' is a column.
I have a Dataframe with a pandas MultiIndex:
In [1]: import pandas as pd
In [2]: multi_index = pd.MultiIndex.from_product([['CAN','USA'],['total']],names=['country','sex'])
In [3]: df = pd.DataFrame({'pop':[35,318]},index=multi_index)
In [4]: df
Out[4]:
pop
country sex
CAN total 35
USA total 318
Then I remove some rows from that DataFrame:
In [5]: df = df.query('pop > 100')
In [6]: df
Out[6]:
pop
country sex
USA total 318
But when I consult the MutliIndex, it still has both countries in its levels.
In [7]: df.index.levels[0]
Out[7]: Index([u'CAN', u'USA'], dtype='object')
I can fix this myself in a rather strange way:
In [8]: idx_names = df.index.names
In [9]: df = df.reset_index(drop=False)
In [10]: df = df.set_index(idx_names)
In [11]: df
Out[11]:
pop
country sex
USA total 318
In [12]: df.index.levels[0]
Out[12]: Index([u'USA'], dtype='object')
But this seems rather messy. Is there a better way I'm missing?
From version pandas 0.20.0+ use MultiIndex.remove_unused_levels:
print (df.index)
MultiIndex(levels=[['CAN', 'USA'], ['total']],
labels=[[1], [0]],
names=['country', 'sex'])
df.index = df.index.remove_unused_levels()
print (df.index)
MultiIndex(levels=[['USA'], ['total']],
labels=[[0], [0]],
names=['country', 'sex'])
This is something that has bitten me before. Dropping columns or rows does NOT change the underlying MultiIndex, for performance and philosophical reasons, and this is officially not considered a bug (read more here). The short answer is that the developers say "that's not what the MultiIndex is for". If you need a list of the contents of a MultiIndex level after modification, for example for iteration or to check to see if something is included, you can use:
df.index.get_level_values(<levelname>)
This returns the current active values within that index level.
So I guess the "trick" here is that the API native way to do it is to use get_level_values instead of just .index or .columns
I will be surprised if there is a more "built-in" way to eliminate the unused country than to re-create the index in the way you're doing (or some similar way). If you look at your index before and after the slice:
In [165]: df.index
Out[165]:
MultiIndex(levels=[[u'CAN', u'USA'], [u'total']],
labels=[[0, 1], [0, 0]],
names=[u'country', u'sex'])
In [166]: df = df.query('pop > 100')
In [167]: df.index
Out[167]:
MultiIndex(levels=[[u'CAN', u'USA'], [u'total']],
labels=[[1], [0]],
names=[u'country', u'sex'])
you can see that the labels - which are indexes into the level values - have updated but not the level values. This may be an imperfect analogy, but it strikes me that the level values are analogous to an enumerated column in a database table, while the labels are analogous to the actual values of rows in the table. If you delete all the rows in a table with a value of "CAN", it doesn't change the fact that "CAN" is still a valid choice based on the column definition. To remove "CAN" from the enumeration, you have to alter the column definition; this is the equivalent of reindexing the dataframe in pandas.
I am getting really strange results with a pandas DataFrame grouping operation. What I want to do is group by index (my index is non-unique), and then fill null values appropriately. This works in many cases but in some instances I am getting a strange behavior where an empty DataFrame is all that is returned:
df = pd.DataFrame(columns=['sample', 'cooling_rate'],
index=['SYd', 'SYd', 'XNa', 'Xna', 'Qza_new', 'Qza_new'],
data=[['SYd', 3], ['SYd', 3], ['XNa', 3],
['XNa', 3], ['val1', 'val3'], ['val1', None]])
res = df.groupby(df.index).fillna('1')
#Empty DataFrame
#Columns: []
#Index: []
However, if I change the DataFrame ever so slightly, by renaming the index item 'QZa_new' to 'qza_new':
df = pd.DataFrame(columns=['sample', 'cooling_rate'],
index=['SYd', 'SYd', 'XNa', 'Xna', 'qza_new', 'qza_new'],
data=[['SYd', 3], ['SYd', 3], ['XNa', 3],
['XNa', 3], ['val1', 'val3'], ['val1', None]])
res = df.groupby(df.index).fillna('1')
# sample cooling_rate
#SYd SYd 3
#SYd SYd 3
#XNa XNa 3
#Xna XNa 3
#qza_new val1 val3
#qza_new val1 1
The result is a properly grouped, filled DataFrame as expected. I can't make any sense of this behavior, and I'm not getting any sort of "error".
With more experimentation, it appears that the key is definitely in my DataFrame index line:
index=['SYd', 'SYd', 'XNa', 'Xna', 'qza_new', 'qza_new'],
It appears that the second to last value has to be earlier in the alphabet than the last value. In other words,
index=['SYd', 'SYd', 'XNa', 'XNa', 'a', 'b']
works and returns a filled in DataFrame, but:
index=['SYd', 'SYd', 'XNa', 'XNa', 'c', 'b']
returns an empty DataFrame. But why?
I suspect I must be missing something obvious, but I have no idea why I'm seeing this behavior.
Update:
This issue appears to be known: https://github.com/pandas-dev/pandas/issues/14955 Hopefully will be fixed next release.
I have a Multiindexed DataFrame containing the explanatory variables df and a DataFrame containing the response variables df_Y
# Create DataFrame for explanatory variables
np.arrays = [['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
[1, 2, 3, 1, 2, 3]]
df = pd.DataFrame(np.random.randn(6,2),
index=pd.MultiIndex.from_tuples(zip(*np.arrays)),
columns=['X1', 'X2'])
# Create DataFrame for response variables
df_Y = pd.DataFrame([1, 2, 3], columns=['Y'])
I am able to perform regression on just the single level DataFrame with index foo
df_X = df.ix['foo'] # using only 'foo'
reg = linear_model.Ridge().fit(df_X, df_Y)
reg.coef_
Problem: However since the Y variables is the same for both levels foo and bar, so we can have twice as many regression samples if we also include bar.
What is the best way to reshape/collapse/unstack the multilevel DataFrame so we can make use of all the data for our regression? Other levels may have lesser rows that df_Y
Sorry for the confusing wording, I am unsure of the correct terms/phrasing
The first index can be dropped and then a join will work:
df.index = df.index.drop_level()
df = df.join(df_Y)
It's easy to turn a list of lists into a pandas dataframe:
import pandas as pd
df = pd.DataFrame([[1,2,3],[3,4,5]])
But how do I turn df back into a list of lists?
lol = df.what_to_do_now?
print lol
# [[1,2,3],[3,4,5]]
You could access the underlying array and call its tolist method:
>>> df = pd.DataFrame([[1,2,3],[3,4,5]])
>>> lol = df.values.tolist()
>>> lol
[[1L, 2L, 3L], [3L, 4L, 5L]]
If the data has column and index labels that you want to preserve, there are a few options.
Example data:
>>> df = pd.DataFrame([[1,2,3],[3,4,5]], \
columns=('first', 'second', 'third'), \
index=('alpha', 'beta'))
>>> df
first second third
alpha 1 2 3
beta 3 4 5
The tolist() method described in other answers is useful but yields only the core data - which may not be enough, depending on your needs.
>>> df.values.tolist()
[[1, 2, 3], [3, 4, 5]]
One approach is to convert the DataFrame to json using df.to_json() and then parse it again. This is cumbersome but does have some advantages, because the to_json() method has some useful options.
>>> df.to_json()
{
"first":{"alpha":1,"beta":3},
"second":{"alpha":2,"beta":4},"third":{"alpha":3,"beta":5}
}
>>> df.to_json(orient='split')
{
"columns":["first","second","third"],
"index":["alpha","beta"],
"data":[[1,2,3],[3,4,5]]
}
Cumbersome but may be useful.
The good news is that it's pretty straightforward to build lists for the columns and rows:
>>> columns = [df.index.name] + [i for i in df.columns]
>>> rows = [[i for i in row] for row in df.itertuples()]
This yields:
>>> print(f"columns: {columns}\nrows: {rows}")
columns: [None, 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]
If the None as the name of the index is bothersome, rename it:
df = df.rename_axis('stage')
Then:
>>> columns = [df.index.name] + [i for i in df.columns]
>>> print(f"columns: {columns}\nrows: {rows}")
columns: ['stage', 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]
I wanted to preserve the index, so I adapted the original answer to this solution:
list_df = df.reset_index().values.tolist()
Now you can paste it somewhere else (e.g. to paste into a Stack Overflow question) and latter recreate it:
pd.Dataframe(list_df, columns=['name1', ...])
pd.set_index(['name1'], inplace=True)
I don't know if it will fit your needs, but you can also do:
>>> lol = df.values
>>> lol
array([[1, 2, 3],
[3, 4, 5]])
This is just a numpy array from the ndarray module, which lets you do all the usual numpy array things.
I had this problem: how do I get the headers of the df to be in row 0 for writing them to row 1 in the excel (using xlsxwriter)? None of the proposed solutions worked, but they pointed me in the right direction. I just needed one line more of code
# get csv data
df = pd.read_csv(filename)
# combine column headers and list of lists of values
lol = [df.columns.tolist()] + df.values.tolist()
Maybe something changed but this gave back a list of ndarrays which did what I needed.
list(df.values)
Not quite relate to the issue but another flavor with same expectation
converting data frame series into list of lists to plot the chart using create_distplot in Plotly
hist_data=[]
hist_data.append(map_data['Population'].to_numpy().tolist())
"df.values" returns a numpy array. This does not preserve the data types. An integer might be converted to a float.
df.iterrows() returns a series which also does not guarantee to preserve the data types. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html
The code below converts to a list of list and preserves the data types:
rows = [list(row) for row in df.itertuples()]
If you wish to convert a Pandas DataFrame to a table (list of lists) and include the header column this should work:
import pandas as pd
def dfToTable(df:pd.DataFrame) -> list:
return [list(df.columns)] + df.values.tolist()
Usage (in REPL):
>>> df = pd.DataFrame(
[["r1c1","r1c2","r1c3"],["r2c1","r2c2","r3c3"]]
, columns=["c1", "c2", "c3"])
>>> df
c1 c2 c3
0 r1c1 r1c2 r1c3
1 r2c1 r2c2 r3c3
>>> dfToTable(df)
[['c1', 'c2', 'c3'], ['r1c1', 'r1c2', 'r1c3'], ['r2c1', 'r2c2', 'r3c3']]
The solutions presented so far suffer from a "reinventing the wheel" approach. Quoting #AMC:
If you're new to the library, consider double-checking whether the functionality you need is already offered by those Pandas objects.
If you convert a dataframe to a list of lists you will lose information - namely the index and columns names.
My solution: use to_dict()
dict_of_lists = df.to_dict(orient='split')
This will give you a dictionary with three lists: index, columns, data. If you decide you really don't need the columns and index names, you get the data with
dict_of_lists['data']
We can use the DataFrame.iterrows() function to iterate over each of the rows of the given Dataframe and construct a list out of the data of each row:
# Empty list
row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =[rows.Date, rows.Event, rows.Cost]
# append the list to the final list
row_list.append(my_list)
# Print
print(row_list)
We can successfully extract each row of the given data frame into a list
This is very simple:
import numpy as np
list_of_lists = np.array(df)
Note: I have seen many cases on Stack Overflow where converting a Pandas Series or DataFrame to a NumPy array or plain Python lists is entirely unecessary. If you're new to the library, consider double-checking whether the functionality you need is already offered by those Pandas objects.
To quote a comment by #jpp:
In practice, there's often no need to convert the NumPy array into a list of lists.
If a Pandas DataFrame/Series won't work, you can use the built-in DataFrame.to_numpy and Series.to_numpy methods.
A function I wrote that allows including the index column or the header row:
def df_to_list_of_lists(df, index=False, header=False):
rows = []
if header:
rows.append(([df.index.name] if index else []) + [e for e in df.columns])
for row in df.itertuples():
rows.append([e for e in row] if index else [e for e in row][1:])
return rows