Related
Consider a dataframe like pivoted, where replicates of some data are given as lists in a dataframe:
d = {'Compound': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'Conc': [1, 0.5, 0.1, 1, 0.5, 0.1, 2, 1, 0.5, 0.1],
'Data': [[100, 90, 80], [50, 40, 30], [10, 9.7, 8],
[20, 15, 10], [3, 4, 5, 6], [100, 110, 80],
[30, 40, 50, 20], [10, 5, 9, 3], [2, 1, 2, 2], [1, 1, 0]]}
df = pd.DataFrame(data=d)
pivoted = df.pivot(index='Conc', columns='Compound', values='Data')
This df can be written to an excel file as such:
with pd.ExcelWriter('output.xlsx') as writer:
pivoted.to_excel(writer, sheet_name='Sheet1', index_label='Conc')
How can this instead be written where replicate data are given in side-by-side cells? Desired excel file:
Then you need to pivot your data in a slightly different way, first explode the Data column, and deduplicate with groupby.cumcount:
(df.explode('Data')
.assign(n=lambda d: d.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'n'], values='Data')
.droplevel('n', axis=1).rename_axis(columns=None)
)
Output:
A A A B B B B C C C C
Conc
0.1 10 9.7 8 100 110 80 NaN 1 1 0 NaN
0.5 50 40 30 3 4 5 6 2 1 2 2
1.0 100 90 80 20 15 10 NaN 10 5 9 3
2.0 NaN NaN NaN NaN NaN NaN NaN 30 40 50 20
Beside the #mozway's answer, just for formatting, you can use:
piv = (df.explode('Data').assign(col=lambda x: x.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'col'], values='Data')
.rename_axis(None))
piv.columns = pd.Index([i if j == 0 else '' for i, j in piv.columns], name='Conc')
piv.to_excel('file.xlsx')
Let's assume I have the following DataFrame:
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df
Out[10]:
a b c d
0 1 1 f 10
1 1 1 f 20
2 2 1 f 30
3 2 1 e 40
4 2 2 f 50
5 2 2 f 60
6 3 1 f 70
7 3 1 e 80
8 3 2 f 90
9 3 2 f 100
In the following I want to take the values of column a and b, where c='e' and use those values to select respective rows of df (which would filter rows 2, 3, 6, 7). The idea is to create a list of tuples and index df by that list:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new
Out[13]:
c d
a b
1 1 f 10
1 f 20
2 1 f 30
1 e 40
2 f 50
2 f 60
3 1 f 70
1 e 80
2 f 90
2 f 100
list_tup
Out[14]: [(2, 1), (3, 1)]
df.loc[list_tup]
Results in an TypeError: unhashable type: 'writeable void-scalar', which I don't understand. Any suggestions? I'm pretty new to python and pandas, hence I assume that I miss something fundamental.
I believe it's better to groupby().transform() and boolean indexing in this use case:
valids = (df['c'].eq('e') # check if `c` is 'e`
.groupby([df['a'],df['b']]) # group by `a` and `b`
.transform('any') # check if `True` occurs in the group
# use the same label for all rows in group
)
# filter with `boolean indexing
df[valids]
Output:
a b c d
2 2 1 f 30
3 2 1 e 40
6 3 1 f 70
7 3 1 e 80
A similar idea with groupby().filter() which is more readable but can be slightly slower:
df.groupby(['a','b']).filter(lambda x: x['c'].eq('e').any())
You could try an innner join.
import pandas as pd
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df.merge(df.loc[df['c']=='e', ['a','b']], on=['a','b'])
Output
a b c d
0 2 1 f 30
1 2 1 e 40
2 3 1 f 70
3 3 1 e 80
May be try a MultiIndex:
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Full code:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Outputs:
c d
a b
2 1 f 30
1 e 40
3 1 f 70
1 e 80
Let us try
m = df[['a','b']].apply(tuple,1).isin(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False).tolist())
out = df[m].copy()
Suppose I have the following data frame
from pandas import DataFrame
Cars = { 'value': [10, 31, 661, 1, 51, 61, 551],
'action1': [1, 1, 1, 1, 1, 1, 1],
'price1': [ 12,0, 15,3, 0, 12,0],
'action2': [2, 2, 2, 2, 2, 2, 2],
'price2': [ 0, 16, 19, 0, 1, 10,0],
'action3': [3, 3, 3, 3, 3, 3, 3],
'price3': [ 14, 36, 9, 0, 0, 0,0]
}
df = DataFrame(Cars,columns= ['value', 'action1', 'price1', 'action2', 'price2', 'action3', 'price3'])
print (df)
How can I select randomly value (action and price) among 3 columns? As a result I want to have a dataframe that will look something like this one?
RandCars = {'value': [10, 31, 661, 1, 51, 61, 551],
'action': [1, 3, 1, 3, 1, 2, 2],
'price': [ 12, 36, 15, 0, 3, 10, 0]
}
df2 = DataFrame(RandCars, columns = ['value','action', 'price'])
print(df2)
Use:
#get columns names not starting by action or price
cols = df.columns[~df.columns.str.startswith(('action','price'))]
print (cols)
Index(['value'], dtype='object')
#convert filtered columns to 2 numpy arrays
arr1 = df.filter(regex='^action').values
arr2 = df.filter(regex='^price').values
#pandas 0.24+
#arr1 = df.filter(regex='^action').to_numpy()
#arr2 = df.filter(regex='^price').to_numpy()
i, c = arr1.shape
#create random positions of both DataFrames to new df
idx = np.random.choice(np.arange(c), i)
df3 = pd.DataFrame({'action': arr1[np.arange(len(df)), idx],
'price': arr2[np.arange(len(df)), idx]},
index=df.index)
print (df3)
action price
0 2 0
1 3 36
2 3 9
3 1 3
4 3 0
5 1 12
6 1 0
#add all another columns by join
df4 = df[cols].join(df3)
print (df4)
value action price
0 10 2 0
1 31 3 36
2 661 3 9
3 1 1 3
4 51 3 0
5 61 1 12
6 551 1 0
I would like to reorder the columns in a dataframe, and keep the underlying values in the right columns.
For example this is the dataframe I have
cols = [ ['Three', 'Two'],['A', 'D', 'C', 'B']]
header = pd.MultiIndex.from_product(cols)
df = pd.DataFrame([[1,4,3,2,5,8,7,6]]*4,index=np.arange(1,5),columns=header)
df.loc[:,('One','E')] = 9
df.loc[:,('One','F')] = 10
>>> df
And I would like to change it as follows:
header2 = pd.MultiIndex(levels=[['One', 'Two', 'Three'], ['E', 'F', 'A', 'B', 'C', 'D']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2], [0, 1, 2, 3, 4, 5, 2, 3, 4, 5]])
df2 = pd.DataFrame([[9,10,1,2,3,4,5,6,7,8]]*4,index=np.arange(1,5), columns=header2)
>>>>df2
First, define a categorical ordering on the top level. Then, call sort_index on the first axis with both levels.
v = pd.Categorical(df.columns.get_level_values(0),
categories=['One', 'Two', 'Three'],
ordered=True)
v2 = pd.Categorical(df.columns.get_level_values(1),
categories=['E', 'F', 'C', 'B', 'A', 'D'],
ordered=True)
df.columns = pd.MultiIndex.from_arrays([v, v2])
df = df.sort_index(axis=1, level=[0, 1])
df
One Two Three
E F C B A D C B A D
1 9 10 7 6 5 8 3 2 1 4
2 9 10 7 6 5 8 3 2 1 4
3 9 10 7 6 5 8 3 2 1 4
4 9 10 7 6 5 8 3 2 1 4
What would be the workaround (or the more tidy way) to insert a column in a pandas dataframe where some indices are duplicated?
For example, having the following dataframe:
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df1 = df1.set_index([0])
df1
1 2
0
1 51 R
2 51 R
3 74 R
4 29 R
1 39 F
2 3 F
3 14 F
4 16 F
how can I insert the column foo from df2 (below) in df1?
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df2 = df2.set_index([0])
df2
foo 2
0
1 5 R
2 5 R
3 7 R
4 2 R
1 3 F
3 1 F
4 1 F
Note that the index 2 is missing from category F.
I would like the result to be something like:
1 foo 2
0
1 51 5 R
2 51 5 R
3 74 7 R
4 29 2 R
1 39 3 F
2 3 NaN F
3 14 1 F
4 16 1 F
I tried the DataFrame.insert method but am getting
df1.insert(2, 'FOO', df2['foo'])
ValueError: cannot reindex from a duplicate axis
The index and column 2 uniquely define a row on both data frames, you can do a join on the two columns (after resetting the index):
df1.reset_index().merge(df2.reset_index(), how='left', on=[0,2]).set_index([0])
# 1 2 foo
#0
#1 51 R 5.0
#2 51 R 5.0
#3 74 R 7.0
#4 29 R 2.0
#1 39 F 3.0
#2 3 F NaN
#3 14 F 1.0
#4 16 F 1.0
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df1 = df1.set_index([0, 2])
df2 = df2.set_index([0, 2])
df1.join(df2, how='left').reset_index(level=2)
2 1 foo
0
1 R 51 5.0
2 R 51 5.0
3 R 74 7.0
4 R 29 2.0
1 F 39 3.0
2 F 3 NaN
3 F 14 1.0
4 F 16 1.0
You're very close...
As you already know based on your question, you can't do this for reasons clearly stated in the error, because you have a repeated index. If you must have column '0' as the index, then don't set it as the index before your merge, set it after:
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df = df1.merge(df2, how='left')
df.set_index([0])