Pandas Column Names of MultiIndex DataFrame - strange behaviour - python

I observed some strange pandas behavior with MultiIndex dataFrames.columns
Construction a MultiIndex dataframe:
a=[0,.25, .5, .75]
b=[1, 2, 3, 4]
c=[5, 6, 7, 8]
d=[1, 2, 3, 5]
df=pd.DataFrame(data={('a','a'):a, ('b', 'b'):b, ('c', 'c'):c, ('d', 'd'):d})
produces this dataFrame
a b c d
a b c d
0 0.00 1 5 1
1 0.25 2 6 2
2 0.50 3 7 3
3 0.75 4 8 5
Creating a new variable with a subset of the original dataFrame
df1=df.copy().loc[:,[('a', 'a'), ('b', 'b')]]
produces like expected:
a b
a b
0 0.00 1
1 0.25 2
2 0.50 3
but accessing the column names of this new dataFrame produces some unexpected output:
print df1.columns
MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [u'a', u'b', u'c', u'd']],
labels=[[0, 1], [0, 1]])
so ('b', 'b') and ('c', 'c') is still contained.
In contrast
print df1.columns.tolist()
returns like expected:
[('a', 'a'), ('b', 'b')]
can anybody explain me the reason for this behavior??

I think you need MultiIndex.remove_unused_levels what is new function in 0.20.0 version.
Docs.
print (df1.columns)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['a', 'b', 'c', 'd']],
labels=[[0, 1], [0, 1]])
print (df1.columns.remove_unused_levels())
MultiIndex(levels=[['a', 'b'], ['a', 'b']],
labels=[[0, 1], [0, 1]])

Related

column is not getting dropped

Why column A is not getting dropped in train,valid,test data frames?
import pandas as pd
train = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
test = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
valid = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
for df in [train,valid,test]:
df = df.drop(['A'],axis=1)
print('A' in train.columns)
print('A' in test.columns)
print('A' in valid.columns)
#True
#True
#True
You can use inplace=True parameter, because DataFrame.drop function working also inplace:
for df in [train,valid,test]:
df.drop(['A'],axis=1, inplace=True)
print('A' in train.columns)
False
print('A' in test.columns)
False
print('A' in valid.columns)
False
Reason why is not removed column is df is not assign back, so DataFrames are not changed.
Another idea is create list of DataFrames and assign each changed DataFrame back:
L = [train,valid,test]
for i in range(len(L)):
L[i] = L[i].drop(['A'],axis=1)
print (L)
[ B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e]

Combining 3 Arrays into 1 Matrix (Python 3)

I have 3 arrays of equal length (e.g.):
[a, b, c]
[1, 2, 3]
[i, ii, iii]
I would like to combine them into a matrix:
|a, 1, i |
|b, 2, ii |
|c, 3, iii|
The problem I have is that when I use codes such as dstack, hstack or concatenate. I get them numerically added or stacked in a fashion that I can work with.
You could use zip():
which maps the similar index of multiple containers so that they can be used just using as single entity.
a1 = ['a', 'b', 'c']
b1 = ['1', '2', '3']
c1 = ['i', 'ii', 'iii']
print(list(zip(a1,b1,c1)))
OUTPUT:
[('a', '1', 'i'), ('b', '2', 'ii'), ('c', '3', 'iii')]
EDIT:
I just thought of stepping forward, how about flattening the list afterwards and then use numpy.reshape
flattened_list = []
#flatten the list
for x in res:
for y in x:
flattened_list.append(y)
#print(flattened_list)
import numpy as np
data = np.array(flattened_list)
shape = (3, 3)
print(data.reshape( shape ))
OUTPUT:
[['a' '1' 'i']
['b' '2' 'ii']
['c' '3' 'iii']]
OR
for one liners out there:
#flatten the list
for x in res:
for y in x:
flattened_list.append(y)
# print(flattened_list)
print([flattened_list[i:i+3] for i in range(0, len(flattened_list), 3)])
OUTPUT:
[['a', '1', 'i'], ['b', '2', 'ii'], ['c', '3', 'iii']]
OR
As suggested by #norok2
print(list(zip(*zip(a1, b1, c1))))
OUTPUT:
[('a', 'b', 'c'), ('1', '2', '3'), ('i', 'ii', 'iii')]
Assuming that you have 3 numpy arrays:
>>> a, b, c = np.random.randint(0, 9, 9).reshape(3, 3)
>>> print(a, b, c)
[4 1 4] [5 8 5] [3 0 2]
then you can stack them vertically (i.e. along the first dimension), and then transpose the resulting matrix to get the order you need:
>>> np.vstack((a, b, c)).T
array([[4, 5, 3],
[1, 8, 0],
[4, 5, 2]])
A slightly more verbose example is to instead stack horizontally, but this requires that your arrays are made into 2D using reshape:
>>> np.hstack((a.reshape(3, 1), b.reshape(3, 1), c.reshape(3, 1)))
array([[4, 5, 3],
[1, 8, 0],
[4, 5, 2]])
this gives you a list of tuples, which might not be what you want:
>>> list(zip([1,2,3],[4,5,6],[7,8,9]))
[(1, 4, 7), (2, 5, 8), (3, 6, 9)]
this gives you a numpy array:
>>> from numpy import array
>>> array([[1,2,3],[4,5,6],[7,8,9]]).transpose()
array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])
If you have different data types in each array, then it would make sense to use pandas for this:
# Iterative approach, using concat
import pandas as pd
my_arrays = [['a', 'b', 'c'], [1, 2, 3], ['i', 'ii', 'iii']]
df1 = pd.concat([pd.Series(array) for array in my_arrays], axis=1)
# Named arrays
array1 = ['a', 'b', 'c']
array2 = [1, 2, 3]
array3 = ['i', 'ii', 'iii']
df2 = pd.DataFrame({'col1': array1,
'col2': array2,
'col3': array3})
Now you have the structure you desired, with appropriate data types for each column:
print(df1)
# 0 1 2
# 0 a 1 i
# 1 b 2 ii
# 2 c 3 iii
print(df2)
# col1 col2 col3
# 0 a 1 i
# 1 b 2 ii
# 2 c 3 iii
print(df1.dtypes)
# 0 object
# 1 int64
# 2 object
# dtype: object
print(df2.dtypes)
# col1 object
# col2 int64
# col3 object
# dtype: object
You can extract the numpy array with the .values attribute:
df1.values
# array([['a', 1, 'i'],
# ['b', 2, 'ii'],
# ['c', 3, 'iii']], dtype=object)

Dataframe groupby when specific values are encountered on a given row

I have a dataframe and I would like to group(or slice)it. The dataframe is in a form of
A B C
a b 1
a b 0
a b 1
a b 2
a b 0
a e 3
a e 3
f g 6
f g 7
f g 0
I would like to first group the dataframe on column A and B. Then,each group is further split by a certain value into smaller groups with consecutive rows. For example,after grouping the dataframe by columns A and B,I would like to refine the grouping on the third level each time I encounter a 0 in column C. So the grouped dataframe is like
A B C
a b 1
a b 0
a b 1
a b 2
a b 0
a e 3
a e 3
f g 6
f g 7
f g 0
Grouping a dataframe by column values like columns A and B in the example is simple but I dont know how to further group on level 3 into consecutive rows with certain cut points. Thank you in advance if you could help.
To do so the approach is alway the same: create an extra column (or several sometimes) that represents your specific grouping logic, then group against it:
df.groupby(['A', 'B', 'cut_point']).groups
Out[139]:
{('a', 'b', 0.0): Int64Index([0, 1], dtype='int64'),
('a', 'b', 1.0): Int64Index([2, 3, 4], dtype='int64'),
('a', 'e', 2.0): Int64Index([5, 6], dtype='int64'),
('f', 'g', 2.0): Int64Index([7, 8, 9], dtype='int64')}
df['cut_point'] = (df.C==0).cumsum().shift().fillna(0)
df.groupby(['A', 'B', 'cut_point']).groups
Out[141]:
{('a', 'b', 0.0): Int64Index([0, 1], dtype='int64'),
('a', 'b', 1.0): Int64Index([2, 3, 4], dtype='int64'),
('a', 'e', 2.0): Int64Index([5, 6], dtype='int64'),
('f', 'g', 2.0): Int64Index([7, 8, 9], dtype='int64')}

python pandas elegant dataframe access rows 2:end

I have a dataframe, dF = pd.DataFrame(X) where X is a numpy array of doubles. I want to remove the last row from the dataframe. I know for the first row I can do something like this dF.ix[1:]. I want to do something similar for the last row. I know in matlab you could do something like this dF[1:end-1]. What is a good and readable way to do this with pandas?
The end goal is to achieve this:
first matrix
1 2 3
4 5 6
7 8 9
second matrix
a b c
d e f
g h i
now get rid of first row of first matrix and last row of second matrix and horizontally concatentate them like so:
4 5 6 a b c
7 8 9 d e f
done. In matlab a = firstMatrix. b = secondMatrix. c = [a[2:end,:] b[1:end-1,:]] where c is the resulting matrix.
you can do it this way:
In [129]: df1
Out[129]:
c1 c2 c3
0 1 2 3
1 4 5 6
2 7 8 9
In [130]: df2
Out[130]:
c1 c2 c3
0 a b c
1 d e f
2 g h i
In [131]: df1.iloc[1:].reset_index(drop=1).join(df2.iloc[:-1].reset_index(drop=1), rsuffix='_2')
Out[131]:
c1 c2 c3 c1_2 c2_2 c3_2
0 4 5 6 a b c
1 7 8 9 d e f
Or a pure NumPy solution:
In [132]: a1 = df1.values
In [133]: a1
Out[133]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], dtype=int64)
In [134]: a2 = df2.values
In [135]: a2
Out[135]:
array([['a', 'b', 'c'],
['d', 'e', 'f'],
['g', 'h', 'i']], dtype=object)
In [136]: a1[1:]
Out[136]:
array([[4, 5, 6],
[7, 8, 9]], dtype=int64)
In [137]: a2[:-1]
Out[137]:
array([['a', 'b', 'c'],
['d', 'e', 'f']], dtype=object)
In [138]: np.concatenate((a1[1:], a2[:-1]), axis=1)
Out[138]:
array([[4, 5, 6, 'a', 'b', 'c'],
[7, 8, 9, 'd', 'e', 'f']], dtype=object)

Pandas duplicated indexes still shows correct elements

I have a pandas DataFrame like this:
test = pd.DataFrame({'score1' : pandas.Series(['a', 'b', 'c', 'd', 'e']), 'score2' : pandas.Series(['b', 'a', 'k', 'n', 'c'])})
Output:
score1 score2
0 a b
1 b a
2 c k
3 d n
4 e c
I then split the score1 and score2 columns and concatenate them together:
In (283): frame1 = test[['score1']]
frame2 = test[['score2']]
frame2.rename(columns={'score2': 'score1'}, inplace=True)
test = pandas.concat([frame1, frame2])
test
Out[283]:
score1
0 a
1 b
2 c
3 d
4 e
0 b
1 a
2 k
3 n
4 c
Notice the duplicate indexes. Now if I do a groupby and then retrieve a group using get_group(), pandas is still able to retrieve the elements with the correct index, even though the indexes are duplicated!
In (283): groups = test.groupby('score1')
groups.get_group('a') # Get group with key a
Out[283]:
score1
0 a
1 a
In (283): groups.get_group('b') # Get group with key b
Out[283]:
score1
1 b
0 b
I understand that pandas uses an inverted index data structure for storing the groups, which looks like this:
In (284): groups.groups
Out[284]: {'a': [0, 1], 'b': [1, 0], 'c': [2, 4], 'd': [3], 'e': [4], 'k': [2], 'n': [3]}
If both a and b are stored at index 0, how does pandas show me the elements correctly when I do get_group()?
This is into the internals (i.e., don't rely on this API!) but the way it works now is that there is a Grouping object which stores the groups in terms of positions, rather than index labels.
In [25]: gb = test.groupby('score1')
In [26]: gb.grouper
Out[26]: <pandas.core.groupby.BaseGrouper at 0x4162b70>
In [27]: gb.grouper.groupings
Out[27]: [Grouping(score1)]
In [28]: gb.grouper.groupings[0]
Out[28]: Grouping(score1)
In [29]: gb.grouper.groupings[0].indices
Out[29]:
{'a': array([0, 6], dtype=int64),
'b': array([1, 5], dtype=int64),
'c': array([2, 9], dtype=int64),
'd': array([3], dtype=int64),
'e': array([4], dtype=int64),
'k': array([7], dtype=int64),
'n': array([8], dtype=int64)}
See here for where it's actually implemented.
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2091

Categories

Resources