When create a multiindex using from_tuples, the create index object has a different order than the input tuple
I am trying to add a column level to a data frame, using pd.MultiIndex.from_tuples method, but the levels is different from what I expected.
df = pd.DataFrame({'x_1':[1, 2], 'x_2':[3, 4], 'x_10':[3, 4], 'y_1':[5, 6], 'y_2':[7, 8], 'y_10':[1, 2]})
df = df.reindex(columns=['x_1', 'x_2', 'x_10', 'y_1', 'y_2', 'y_10'])
index = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
print(index)
MultiIndex(levels=[['x', 'y'], ['1', '10', '2']],
labels=[[0, 0, 0, 1, 1, 1], [0, 2, 1, 0, 2, 1]])
When I add the level to the dataframe and perform stacking, the order is not what I want.
df.columns = index
df.stack()
x y
0 1 1 5
10 3 1
2 3 7
1 1 2 6
10 4 2
2 4 8
I expect the index levels look like:
MultiIndex(levels=[['x', 'y'], ['1', '2', '10']])
and stacking will look like the following:
df.stack()
x y
0 1 1 5
2 3 7
10 3 1
1 1 2 6
2 4 8
10 4 2
You can reindex at a specific level, passing the level values from your column prior to the call to stack:
In[177]:
df.stack().reindex(df.columns.get_level_values(1).unique(), level=1)
Out[177]:
x y
0 1 1 5
2 3 7
10 3 1
1 1 2 6
2 4 8
10 4 2
Note that this has performance issues because an index is expected to be sorted for fast lookups
The index you have constructed is actually ordered as specified. When you print(index) you are seeing how Pandas stores the index internally. Using index.values unravels this representation to give an array of indices aligned with your dataframe.
print(index.values)
# array([('x', '1'), ('x', '2'), ('x', '10'), ('y', '1'), ('y', '2'),
# ('y', '10')], dtype=object)
df.columns = index
print(df)
# x y
# 1 2 10 1 2 10
# 0 1 3 3 5 7 1
# 1 2 4 4 6 8 2
The real issue is pd.DataFrame.stack applies sorting and, since you have defined strings, '10' comes before '2'. To maintain ordering as you desire after stack, make sure you use integers:
def splitter(x):
strng, num = x.split('_')
return strng, int(num)
index = pd.MultiIndex.from_tuples(df.columns.map(splitter))
df.columns = index
print(df.stack())
# x y
# 0 1 1 5
# 2 3 7
# 10 3 1
# 1 1 2 6
# 2 4 8
# 10 4 2
Related
I have the following df
list_columns = ['A', 'B', 'C']
list_data = [
[1, '2', 3],
[4, '4', 5],
[1, '2', 3],
[4, '4', 6]
]
df = pd.DataFrame(columns=list_columns, data=list_data)
I want to check if multiple columns exist, and if not to create them.
Example:
If B,C,D do not exist, create them(For the above df it will create only D column)
I know how to do this with one column:
if 'D' not in df:
df['D']=0
Is there a way to test if all my columns exist, and if not create the one that are missing? And not to make an if for each column
Here loop is not necessary - use DataFrame.reindex with Index.union:
cols = ['B','C','D']
df = df.reindex(df.columns.union(cols, sort=False), axis=1, fill_value=0)
print (df)
A B C D
0 1 2 3 0
1 4 4 5 0
2 1 2 3 0
3 4 4 6 0
Just to add, you can unpack the set diff between your columns and the list with an assign and ** unpacking.
import numpy as np
cols = ['B','C','D','E']
df.assign(**{col : 0 for col in np.setdiff1d(cols,df.columns.values)})
A B C D E
0 1 2 3 0 0
1 4 4 5 0 0
2 1 2 3 0 0
3 4 4 6 0 0
I have a dataframe df:
A B
first second
bar one 0.0 0.0
two 0.0 0.0
foo one 0.0 0.0
two 0.0 0.0
I transform it to another one where values are tuples:
A B
first second
bar one (6, 1, 0) (0, 9, 3)
two (9, 3, 4) (6, 2, 1)
foo one (1, 9, 0) (4, 0, 0)
two (6, 1, 5) (8, 3, 5)
My question is how can I get it (expanded) to be like below where tuples values become columns with multiindex? Can I do it during transform or should I do it as an additional step after transform?
A B
m n k m n k
first second
bar one 6 1 0 0 9 3
two 9 3 4 6 2 1
foo one 1 9 0 4 0 0
two 6 1 5 8 3 5
Code for the above:
import numpy as np
import pandas as pd
np.random.seed(123)
def expand(s):
# complex logic of `result` has been replaced with `np.random`
result = [tuple(np.random.randint(10, size=3)) for i in s]
return result
index = pd.MultiIndex.from_product([['bar', 'foo'], ['one', 'two']], names=['first', 'second'])
df = pd.DataFrame(np.zeros((4, 2)), index=index, columns=['A', 'B'])
print(df)
expanded = df.groupby(['second']).transform(expand)
print(expanded)
Try this:
df_lst = []
for col in df.columns:
expanded_splt = expanded.apply(lambda x: pd.Series(x[col]),axis=1)
columns = pd.MultiIndex.from_product([[col], ['m', 'n', 'k']])
expanded_splt.columns = columns
df_lst.append(expanded_splt)
pd.concat(df_lst, axis=1)
Output:
A B
m n k m n k
first second
bar one 6 1 0 0 9 3
two 9 3 4 6 2 1
foo one 1 9 0 4 0 0
two 6 1 5 8 3 5
Finally I found time to find an answer that suits me.
expanded_data = expanded.agg(lambda x: np.concatenate(x), axis=1).to_numpy()
expanded_data = np.stack(expanded_data)
column_index = pd.MultiIndex.from_product([expanded.columns, ['m', 'n', 'k']])
exploded = pd.DataFrame(expanded_data, index=expanded.index, columns=column_index)
print(exploded)
A B
m n k m n k
first second
bar one 6 1 0 0 9 3
two 9 3 4 6 2 1
foo one 1 9 0 4 0 0
two 6 1 5 8 3 5
I have identified specific columns I want to select as my predictors for my model based on some analysis. I have captured those column numbers and stored it in a list. I have roughly 80 columns and want to loop through and drop the columns not in this specific list. X_train is the column in which I want to do this. Here is my code:
cols_selected = [24, 4, 7, 50, 2, 60, 46, 53, 48, 61]
cols_drop = []
for x in range(len(X_train.columns)):
if x in cols_selected:
pass
else:
X_train.drop([x])
When running this, I am faced with the following error while highlighting the code: X_train.drop([x]):
KeyError: '[3] not found in axis'
I am sure it is something very simple that I am missing. I tried including the inplace=True or axis=1 statements along with this and all of them had the same error message (while the value inside the [] changed with those error codes).
Any help would be great!
Edit: Here is the addition to get this working:
cols_selected = [24, 4, 7, 50, 2, 60, 46, 53, 48, 61]
cols_drop = []
for x in range(len(X_train.columns)):
if x in cols_selected:
pass
else:
cols_drop.append(x)
X_train = X_train.drop(X_train.columns[[cols_drop]], axis=1)
According to the documentation of drop:
Remove rows or columns by specifying label names and corresponding
axis, or by specifying directly index or column names
You can not drop columns by simply using the index of the column. You need the name of the columns. Also the axis parameter has to be set to 1 or columns Replace X_train.drop([x]) with X_train=X_train.drop(X_train.columns[x], axis='columns') to make your example work.
I am just assuming as per the question litle:
Example DataFrame:
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Dropping Specific columns B & C:
>>> df.drop(['B', 'C'], axis=1)
# df.drop(['B', 'C'], axis=1, inplace=True) <-- to make the change the df itself , use inplace=True
A D
0 0 3
1 4 7
2 8 11
If you are trying to drop them by column numbers (Dropping by index) then try like below:
>>> df.drop(df.columns[[1, 2]], axis=1)
A D
0 0 3
1 4 7
2 8 11
OR
>>> df.drop(columns=['B', 'C'])
A D
0 0 3
1 4 7
2 8 11
Also, in addition to #pygo pointing out that df.drop takes a keyword arg to designate the axis, try this:
X_train = X_train[[col for col in X_train.columns if col in cols_selected]]
Here is an example:
>>> import numpy as np
>>> import pandas as pd
>>> cols_selected = ['a', 'c', 'e']
>>> X_train = pd.DataFrame(np.random.randint(low=0, high=10, size=(20, 5)), columns=['a', 'b', 'c', 'd', 'e'])
>>> X_train
a b c d e
0 4 0 3 5 9
1 8 8 6 7 2
2 1 0 2 0 2
3 3 8 0 5 9
4 5 9 7 8 0
5 1 9 3 5 9 ...
>>> X_train = X_train[[col for col in X_train.columns if col in cols_selected]]
>>> X_train
a c e
0 4 3 9
1 8 6 2
2 1 2 2
3 3 0 9
4 5 7 0
5 1 3 9 ...
I am trying to create a new df which summarises my key information, by taking that information from 3 (say) other dataframes.
dfdate = {'x1': [2, 4, 7, 5, 6],
'x2': [2, 2, 2, 6, 7],
'y1': [3, 1, 4, 5, 9]}
dfdate = pd.DataFrame(df, index=range(0:4))
dfqty = {'x1': [1, 2, 6, 6, 8],
'x2': [3, 1, 1, 7, 5],
'y1': [2, 4, 3, 2, 8]}
dfqty = pd.DataFrame(df2, range(0:4))
dfprices = {'x1': [0, 2, 2, 4, 4],
'x2': [2, 0, 0, 3, 4],
'y1': [1, 3, 2, 1, 3]}
dfprices = pd.DataFrame(df3, range(0:4))
Let us say the above 3 dataframes are my data. Say, some dates, qty, and prices of goods. My new df is to be constructed from the above data:
rng = len(dfprices.columns)*len(dfprices.index) # This is the len of new df
dfnew = pd.DataFrame(np.nan,index=range(0,rng),columns=['Letter', 'Number', 'date', 'qty', 'price])
Now, this is where I struggle to piece my stuff together. I am trying to take all the data in dfdate and put it into a column in the new df. same with dfqty and dfprice. (so 3x5 matricies essentially goto a 1x15 vector and are placed into the new df).
As well as that, I need a couple of columns in dfnew as identifiers, from the names of the columns of the old df.
Ive tried for loops but to no avail, and don't know how to convert a df to series. But my desired output is:
dfnew:
'Lettercol','Numbercol', 'date', 'qty', 'price'
0 X 1 2 1 0
1 X 1 4 2 2
2 X 1 7 6 2
3 X 1 5 6 4
4 X 1 6 8 4
5 X 2 2 3 2
6 X 2 2 1 0
7 X 2 2 1 0
8 X 2 6 7 3
9 X 2 7 5 4
10 Y 1 3 2 1
11 Y 1 1 4 3
12 Y 1 4 3 2
13 Y 1 5 2 1
14 Y 1 9 8 3
where the numbers 0-14 are the index.
letter = letter from col header in DFs
number = number from col header in DFs
next 3 columns are data from the orig df's
(don't ask why the original data is in that funny format :)
thanks so much. my last Q wasn't well received so have tried to make this one better, thanks
Use:
#list of DataFrames
dfs = [dfdate, dfqty, dfprices]
#list comprehension with reshape
comb = [x.unstack() for x in dfs]
#join together
df = pd.concat(comb, axis=1, keys=['date', 'qty', 'price'])
#remove second level of MultiIndex and index to column
df = df.reset_index(level=1, drop=True).reset_index().rename(columns={'index':'col'})
#extract all values without first by indexing [1:] and first letter by [0]
df['Number'] = df['col'].str[1:]
df['Letter'] = df['col'].str[0]
cols = ['Letter', 'Number', 'date', 'qty', 'price']
#change order of columns
df = df.reindex(columns=cols)
print (df)
Letter Number date qty price
0 x 1 2 1 0
1 x 1 4 2 2
2 x 1 7 6 2
3 x 1 5 6 4
4 x 1 6 8 4
5 x 2 2 3 2
6 x 2 2 1 0
7 x 2 2 1 0
8 x 2 6 7 3
9 x 2 7 5 4
10 y 1 3 2 1
11 y 1 1 4 3
12 y 1 4 3 2
13 y 1 5 2 1
14 y 1 9 8 3
I have a dataframe and want to sort all columns independently in descending or ascending order.
import pandas as pd
data = {'a': [5, 2, 3, 6],
'b': [7, 9, 1, 4],
'c': [1, 5, 4, 2]}
df = pd.DataFrame.from_dict(data)
a b c
0 5 7 1
1 2 9 5
2 3 1 4
3 6 4 2
When I use sort_values() for this it does not work as expected (to me) and only sorts one column:
foo = df.sort_values(by=['a', 'b', 'c'], ascending=[False, False, False])
a b c
3 6 4 2
0 5 7 1
2 3 1 4
1 2 9 5
I can get the desired result if I use the solution from this answer which applies a lambda function:
bar = df.apply(lambda x: x.sort_values().values)
print(bar)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
But this looks a bit heavy-handed to me.
What's actually happening in the sort_values() example above and how can I sort all columns in my dataframe in a pandas-way without the lambda function?
You can use numpy.sort with DataFrame constructor:
df1 = pd.DataFrame(np.sort(df.values, axis=0), index=df.index, columns=df.columns)
print (df1)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
EDIT:
Answer with descending order:
arr = df.values
arr.sort(axis=0)
arr = arr[::-1]
print (arr)
[[6 9 5]
[5 7 4]
[3 4 2]
[2 1 1]]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df1)
a b c
0 6 9 5
1 5 7 4
2 3 4 2
3 2 1 1
sort_values will sort the entire data frame by the columns order you pass to it. In your first example you are sorting the entire data frame with ['a', 'b', 'c']. This will sort first by 'a', then by 'b' and finally by 'c'.
Notice how, after sorting by a, the rows maintain the same. This is the expected result.
Using lambda you are passing each column to it, this means sort_values will apply to a single column, and that's why this second approach sorts the columns as you would expect. In this case, the rows change.
If you don't want to use lambda nor numpy you can get around using this:
pd.DataFrame({x: df[x].sort_values().values for x in df.columns.values})
Output:
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5