Change some, but not all, pandas multiindex column names - python

Suppose I have a data frame with multiindex column names that looks like this:
A B
'1.5' '2.3' '8.4' b1
r1 1 2 3 a
r2 4 5 6 b
r3 7 8 9 10
How would I change the just the column names under 'A' from strings to floats, without modifying 'b1', to get the following?
A B
1.5 2.3 8.4 b1
r1 1 2 3 a
r2 4 5 6 b
r3 7 8 9 10
In the real use case, under 'A' there would be thousands of columns with names that should be floats (they represent the wavelengths for a spectrometer) and the data in the data frame represents multiple different observations.
Thanks!

# build the DataFrame (sideways at first, then transposed)
arrays = [['A','A','A','B'],['1.5', '2.3', '8.4', 'b1']]
tuples = list( zip(*arrays) )
data1 = np.array([[1,2,3,'a'], [4,5,6,'b'], [7,8,9,10]])
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(data1.T, index=index).T
Printing df.columns gives the existing column names.
Out[84]:
MultiIndex(levels=[[u'A', u'B'], [u'1.5', u'2.3', u'8.4', u'b1']],
labels=[[0, 0, 0, 1], [0, 1, 2, 3]],
names=[u'first', u'second'])
Now change the column names
# make new column titles (probably more pythonic ways to do this)
A_cols = [float(i) for i in df['A'].columns]
B_cols = [i for i in df['B'].columns]
cols = A_cols + B_cols
# set levels
levels = [df.columns.levels[0],cols]
df.columns.set_levels(levels,inplace=True)
Gives the following output
Out[86]:
MultiIndex(levels=[[u'A', u'B'], [1.5, 2.3, 8.4, u'b1']],
labels=[[0, 0, 0, 1], [0, 1, 2, 3]],
names=[u'first', u'second'])

Related

Python Pandas - How to get the (iloc) position of one or more filtered rows in a dataframe

Using this example
df = pd.DataFrame({'letters':
['A', 'B', 'C', 'D', 'E', 'F']},
index=[10, 20, 30, 40, 50, 30])
With df.iloc[x] I can get the row x in the dataframe. For example.
df.iloc[3]
returns
letters D
Name: 40, dtype: object
When I filter the dataframe like
df2 = df.iloc[1:3]
I get for df2
letters
20 B
30 C
Now assume that I didn't know how the filter was applied and I need to find out the values for the filtered rows (1 and 2).
What's the best way to get the list of positions, that allow me to access a filtered result like this result in via the original dataframe using df.iloc ? How do I get the position numbers?
I am looking for the result
[1, 2]
Note: I had a good suggestion:
df.index.get_indexer_for((df2.index))
which doesn't work, if the index is not unique.
Int64Index([1, 2, 5], dtype='int64')
Because we have to incorporate the value as well if we want to handle cases like df.iloc[[1,5]], where you'd need to get 5 from "30 F", I think the easiest way is to leverage a merge:
In [172]: df.reset_index().reset_index().merge(df.iloc[1:3].reset_index())
Out[172]:
level_0 index letters
0 1 20 B
1 2 30 C
In [173]: df.reset_index().reset_index().merge(df.iloc[1:3].reset_index())["level_0"].values
Out[173]: array([1, 2], dtype=int64)
In [174]: df.reset_index().reset_index().merge(df.iloc[[1,5]].reset_index())
Out[174]:
level_0 index letters
0 1 20 B
1 5 30 F
In [175]: df.reset_index().reset_index().merge(df.iloc[[1,5]].reset_index())["level_0"].values
Out[175]: array([1, 5], dtype=int64)
In the case where it's not possible to uniquely recover original positions because of duplicate rows, you'll get all of them:
In [179]: df.iloc[-1, 0] = "C"
In [180]: df.reset_index().reset_index().merge(df.iloc[[1,2]].reset_index())
Out[180]:
level_0 index letters
0 1 20 B
1 2 30 C
2 5 30 C
In [181]: df.reset_index().reset_index().merge(df.iloc[[1,2]].reset_index())["level_0"].values
Out[181]: array([1, 2, 5], dtype=int64)
but you can decide how you want to drop duplicates after the merge.

How to sort numpy array by absolute value of a column?

What I have now:
import numpy as np
# 1) Read CSV with headers
data = np.genfromtxt("big.csv", delimiter=',', names=True)
# 2) Get absolute values for column in a new ndarray
new_ndarray = np.absolute(data["target_column_name"])
# 3) Append column in new_ndarray to data
# I'm having trouble here. Can't get hstack, concatenate, append, etc; to work
# 4) Sort by new column and obtain a new ndarray
data.sort(order="target_column_name_abs")
I would like:
A solution for 3): To be able to add this new "abs" column to the original ndarray or
Another approach to be able to sort a csv file by the absolute values of a column.
Here is a way to do it.
First, let's create a sample array:
In [39]: a = (np.arange(12).reshape(4, 3) - 6)
In [40]: a
Out[40]:
array([[-6, -5, -4],
[-3, -2, -1],
[ 0, 1, 2],
[ 3, 4, 5]])
Ok, lets say
In [41]: col = 1
which is the column we want to sort by,
and here is the sorting code - using Python's sorted:
In [42]: b = sorted(a, key=lambda row: np.abs(row[col]))
Let's convert b from list to array, and we have:
In [43]: np.array(b)
Out[43]:
array([[ 0, 1, 2],
[-3, -2, -1],
[ 3, 4, 5],
[-6, -5, -4]])
Which is the array with the rows sorted according to
the absolute value of column 1.
Here's a solution using pandas:
In [117]: import pandas as pd
In [118]: df = pd.read_csv('test.csv')
In [119]: df
Out[119]:
a b
0 1 -3
1 2 2
2 3 -1
3 4 4
In [120]: df['c'] = abs(df['b'])
In [121]: df
Out[121]:
a b c
0 1 -3 3
1 2 2 2
2 3 -1 1
3 4 4 4
In [122]: df.sort_values(by='c')
Out[122]:
a b c
2 3 -1 1
1 2 2 2
0 1 -3 3
3 4 4 4

MultiColumns get lost when indexing and re-indexing

Create some data
cols = pd.MultiIndex.from_product([['what', 'why'], ['me', 'you']])
df = pd.DataFrame(columns=cols)
df.loc[0, :] = [1, 2, 3, 4]
What do we have?
In[8]: df
Out[8]:
what why
me you me you
0 1 2 3 4
Set one (or more) columns as index:
In[11]: df.set_index(('what', 'me'))
Out[11]:
what why
you me you
(what, me)
1 2 3 4
Let's reset that index:
In[12]: df.set_index(('what', 'me')).reset_index()
Out[12]:
(what, me) what why
you me you
0 1 2 3 4
And in particular,
In[13]: df.set_index(('what', 'me')).reset_index().columns
Out[13]:
MultiIndex(levels=[['what', 'why', ('what', 'me')], ['me', 'you', '']],
labels=[[2, 0, 1, 1], [2, 1, 0, 1]])
Is there any way to use these (multi) columns as indices without losing the column structure?

Insert list of lists into single column of pandas df

I am trying to place multiple lists into a single column of a Pandas df. My list of lists is very long, so I cannot do so manually.
The desired out put would look like this:
list_of_lists = [[1,2,3],[3,4,5],[5,6,7],...]
df = pd.DataFrame(list_of_lists)
>>> df
0
0 [1,2,3]
1 [3,4,5]
2 [5,6,7]
3 ...
Thank you for the assistance.
You can assign it by wrapping it in a Series vector if you're trying to add to an existing df:
In [7]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[7]:
a b c
0 -1.675422 -0.696623 -1.025674
1 0.032192 0.582190 0.214029
2 -0.134230 0.991172 -0.177654
3 -1.688784 1.275275 0.029581
4 -0.528649 0.858710 -0.244512
In [9]:
df['new_col'] = pd.Series([[1,2,3],[3,4,5],[5,6,7]])
df
Out[9]:
a b c new_col
0 -1.675422 -0.696623 -1.025674 [1, 2, 3]
1 0.032192 0.582190 0.214029 [3, 4, 5]
2 -0.134230 0.991172 -0.177654 [5, 6, 7]
3 -1.688784 1.275275 0.029581 NaN
4 -0.528649 0.858710 -0.244512 NaN
What about
df = pd.DataFrame({0: [[1,2,3],[3,4,5],[5,6,7]]})
The above solutions were helpful but wanted to add a little bit in case they didn't quite do the trick for someone...
pd.Series will not accept a np.ndarray that looks like a list-of-lists, e.g. one-hot labels array([[1, 0, 0], [0, 1, 0], ..., [0, 0, 1]]).
So in this case one can wrap the variable with list():
df['new_col'] = pd.Series(list(one-hot-labels))

Pandas div using index

I am sometimes struggling a bit to understand pandas datastructures and it seems to be the case again. Basically, I've got:
1 pivot table, major axis being a serial number
a Serie using the same index
I would like to divide each column of my pivot table by the value in the Serie using index to match the lines. I've tried plenty of combinations... without being successful so far :/
import pandas as pd
df = pd.DataFrame([['123', 1, 1, 3], ['456', 2, 3, 4], ['123', 4, 5, 6]], columns=['A', 'B', 'C', 'D'])
pt = pd.pivot_table(df, rows=['A', 'B'], cols='C', values='D', fill_value=0)
serie = pd.Series([5, 5, 5], index=['123', '678', '345'])
pt.div(serie, axis='index')
But I am only getting NaN. I guess it's because columns names are not matching but that's why I was using index as the axis. Any ideas on what I am doing wrong?
Thanks
You say "using the same index", but they're not the same: pt has a multiindex, and serie only an index:
>>> pt.index
MultiIndex(levels=[[u'123', u'456'], [1, 2, 4]],
labels=[[0, 0, 1], [0, 2, 1]],
names=[u'A', u'B'])
And you haven't told the division that you want to align on the A part of the index. You can pass that information using level:
>>> pt.div(serie, level='A', axis='index')
C 1 3 5
A B
123 1 0.6 0 0.0
4 0.0 0 1.2
456 2 NaN NaN NaN
[3 rows x 3 columns]

Categories

Resources