Goal
If sub-column min equals to sub-column max and if min and max sub-column do not equal to each other in any of the column (ao, his, cyp1a2s, cyp3a4s in this case), drop the row.
Example
arrays = [np.array(['ao', 'ao', 'hia', 'hia', 'cyp1a2s', 'cyp1a2s', 'cyp3a4s', 'cyp3a4s']),
np.array(['min', 'max', 'min', 'max', 'min', 'max', 'min', 'max'])]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['',''])
df = pd.DataFrame(np.array([[1, 1, 0, 0, float('nan'), float('nan'), 0, 0],
[1, 1, 0, 0, float('nan'), 1, 0, 0],
[0, 2, 0, 0, float('nan'), float('nan'), 1, 1],]), index=['1', '2', '3'], columns=index)
df
ao hia cyp1a2s cyp3a4s
min max min max min max min max
1 1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0
2 1.0 1.0 0.0 0.0 NaN 1.0 0.0 0.0
3 0.0 2.0 0.0 0.0 NaN NaN 1.0 1.0
Want
df = pd.DataFrame(np.array([[1, 1, 0, 0, float('nan'), float('nan'), 0, 0]]), index=['1'], columns=index)
df
ao hia cyp1a2s cyp3a4s
min max min max min max min max
1 1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0
Attempt
df.apply(lambda x: x['min'].map(str) == x['max'].map(str), axis=1)
KeyError: ('min', 'occurred at index 1')
Note
The actual dataframe has 50+ columns.
Use DataFrame.xs for DataFrame by second levels of MultiIndex, replace NaNs:
df1 = df.xs('min', axis=1, level=1).fillna('nan')
df2 = df.xs('max', axis=1, level=1).fillna('nan')
Or convert data to strings:
df1 = df.xs('min', axis=1, level=1).astype('str')
df2 = df.xs('max', axis=1, level=1).astype('str')
Compare Dataframes by DataFrame.eq and test if all Trues by DataFrame.all and last filter by boolean indexing:
df = df[df1.eq(df2).all(axis=1)]
print (df)
ao hia cyp1a2s cyp3a4s
min max min max min max min max
1 1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0
The reason df.apply() didn't work is you needed to reference 2 levels of columns.
Also .map(str) was invalid for mapping from float64... used .astype(str)
The following work for >1 columns:
eqCols = ['cyp1a2s','hia']
neqCols = list(set(df.xs('min', level=1, axis=1).columns) - set(eqCols))
EQ = lambda r,c : r[c]['min'].astype(str) == r[c]['max'].astype(str)
df[df.apply(lambda r: ([EQ(r,c) for c in eqCols][0]) & ([(not EQ(r,c)) for c in neqCols][0]), axis=1)]
Related
I have two dataframes like below,
import numpy as np
import pandas as pd
df1 = pd.DataFrame({1: np.zeros(5), 2: np.zeros(5)}, index=['a','b','c','d','e'])
and
df2 = pd.DataFrame({'category': [1,1,2,2], 'value':[85,46, 39, 22]}, index=[0, 1, 3, 4])
The value from second dataframe is required to be assigned in first dataframe such that the index and column relationship is maintained. The second dataframe index is iloc based and has column category which is actually containing column names of first dataframe. The value is value to be assigned.
Following is the my solution with expected output,
for _category in df2['category'].unique():
df1.loc[df1.iloc[df2[df2['category'] == _category].index.tolist()].index, _category] = df2[df2['category'] == _category]['value'].values
Is there a pythonic way of doing so without the for loop?
One option is to pivot and update:
df3 = df1.reset_index()
df3.update(df2.pivot(columns='category', values='value'))
df3 = df3.set_index('index').rename_axis(None)
Alternative, reindex df2 (in two steps, numerical and by label), and combine_first with df1:
df3 = (df2
.pivot(columns='category', values='value')
.reindex(range(max(df2.index)+1))
.set_axis(df1.index)
.combine_first(df1)
)
output:
1 2
a 85.0 0.0
b 46.0 0.0
c 0.0 0.0
d 0.0 39.0
e 0.0 22.0
Here's one way by replacing the 0s in df1 with NaN; pivotting df2 and filling in the NaNs in df1 with df2:
out = (df1.replace(0, pd.NA).reset_index()
.fillna(df2.pivot(columns='category', values='value'))
.set_index('index').rename_axis(None).fillna(0))
Output:
1 2
a 85.0 0.0
b 46.0 0.0
c 0.0 0.0
d 0.0 39.0
e 0.0 22.0
I have the following sample DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame({'Tom': [2, np.nan, np.nan],
'Ron': [np.nan, 5, np.nan],
'Jim': [np.nan, np.nan, 6],
'Mat': [7, np.nan, np.nan],},
index=['Min', 'Max', 'Avg'])
that looks like this where each row have only one non-null value
Tom Ron Jim Mat
Min 2.0 NaN NaN 7.0
Max NaN 5.0 NaN NaN
Avg NaN NaN 6.0 NaN
Desired Outcome
For each column, I want to have the non-null value and then append the index of the corresponding non-null value to the name of the column. So the final result should look like this
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
My attempt
Using list comprehensions: Find the non-null value, and append the corresponding index to the column name and then create a new DataFrame
values = [df[col][~pd.isna(df[col])].values[0] for col in df.columns]
# [2.0, 5.0, 6.0, 7.0]
new_cols = [col + '_{}'.format(df[col][~pd.isna(df[col])].index[0]) for col in df.columns]
# ['Tom_Min', 'Ron_Max', 'Jim_Avg', 'Mat_Min']
df_new = pd.DataFrame([values], columns=new_cols)
My question
Is there some in-built functionality in pandas which can do this without using for loops and list comprehensions?
If there is only one non missing value is possible use DataFrame.stack with convert Series to DataFrame and then flatten MultiIndex, for correct order is used DataFrame.swaplevel with DataFrame.reindex:
df = df.stack().to_frame().T.swaplevel(1,0, axis=1).reindex(df.columns, level=0, axis=1)
df.columns = df.columns.map('_'.join)
print (df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
Use:
s = df.T.stack()
s.index = s.index.map('_'.join)
df = s.to_frame().T
Result:
# print(df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
I'm trying to update locations in a DataFrame based on an idxmax series from a different DataFrame. So for each row in df0, change the value of the column with the highest value in df1 to 1. Is this possible without having to loop though each row?
Here's the code I'm trying to improve:
df0 = pd.DataFrame.from_dict({
'2020-05-04': [0, 0, 0],
'2020-05-05': [0, 0, 0],
'2020-05-06': [0, 0, 0]},
orient='index', columns=['a','b','c'])
df1 = pd.DataFrame.from_dict({
'2020-05-04': [1.1, 1.0, 1.0],
'2020-05-05': [1.0, 1.2, 1.0],
'2020-05-06': [1.0, 1.0, 1.4]},
orient='index', columns=['a','b','c'])
for date in df0.index:
df0.loc[date, df1.idxmax(axis=1)[date]] = 1
Thanks for your suggestions!
Yes, that's possible using unstack and fillna:
# get 'coordinates' of max value
df0 = df1.idxmax(axis=1).rename('col').to_frame()
# set your 'marker' value
df0['val'] = 1
# col val
# 2020-05-04 a 1
# 2020-05-05 b 1
# 2020-05-06 c 1
df0 = df0.set_index('col', append=True).unstack().fillna(0)
# val
# col a b c
# 2020-05-04 1.0 0.0 0.0
# 2020-05-05 0.0 1.0 0.0
# 2020-05-06 0.0 0.0 1.0
The DataFrame now has a MultiIndex for columns. To drop the val level, run
df0.droplevel(0, axis=1)
Consider the following example (the two elements of interest are final_df and pivot_df. The rest of the code is just to construct these two df's):
import numpy
import pandas
numpy.random.seed(0)
input_df = pandas.concat([pandas.Series(numpy.round_(numpy.random.random_sample(10,), 2)),
pandas.Series(numpy.random.randint(0, 2, 10))], axis = 1)
input_df.columns = ['key', 'val']
pivot_df = input_df.pivot(columns = 'key', values = 'val')\
.fillna(method = 'pad')\
.cumsum()
index_df = pivot_df.notnull()\
.multiply(pivot_df.columns, axis = 1)\
.replace({0.0: numpy.nan})\
.values
final_df = numpy.delete(numpy.partition(index_df, 3, axis = 1),
numpy.s_[3:index_df.shape[1]], axis = 1)
final_df.sort(axis = 1)
final_df = pandas.DataFrame(final_df)
final_df contains as many rows as pivot_df. I want to use these two to construct a third df: bingo_df.
bingo_df should have the same dimensions as final_df. Then, the cells of bingo_df should contain:
Whenever the entry (row = i, col = j) of final_df is numpy.nan,
the entry (i,j) of bingo_df should be numpy.nan as well.
Otherwise, [whenever the entry (i, j) of final_df is not numpy.nan] the entry (i,j) of bingo_df should be the value at cell [i, final_df[i, j].value] of pivot_df (in fact final_df[i, j].value is either the name of a column of pivot_df or numpy.nan)
Expected ouput:
so the first row of final_df is
0.55, nan, nan.
So I'm expecting the first row of bingo_df to be:
0.0, nan, nan
because the value in cell (row = 0, col = 0.55) of pivot_df is 0
(and the two subsequent numpy.nan in the first row of final_df should also be numpy.nan in bingo_df)
so the second row of final_df is
0.55, 0.72, nan
So I'm expecting the second row of bingo_df to be:
0.0, 1.0, nan
because the value in cell (row = 1, col = 0.55) of pivot_df is 0.0
and the value in cell (row = 1, col = 0.72) of pivot_df is 1.0
IIUC lookup
s=final_df.stack()
pd.Series(pivot_df.lookup(s.index.get_level_values(0),s),index=s.index).unstack()
Out[87]:
0 1 2
0 0.0 NaN NaN
1 0.0 1.0 NaN
2 0.0 1.0 2.0
3 0.0 0.0 2.0
4 0.0 0.0 0.0
5 0.0 0.0 0.0
6 0.0 1.0 0.0
7 0.0 2.0 0.0
8 0.0 3.0 0.0
9 0.0 0.0 4.0
I have 2 pandas dataframes. The second one is contained in the first one. How can I replace the values in the first one with the ones in the second?
consider this example:
df1 = pd.DataFrame(0, index=[1,2,3], columns=['a','b','c'])
df2 = pd.DataFrame(1, index=[1, 2], columns=['a', 'c'])
ris= [[1, 0, 1],
[1, 0, 1],
[0, 0, 0]]
and ris has the same index and columns of d1
A possible solution is:
for i in df2.index:
for j in df2.columns:
df1.loc[i, j] = df2.loc[i, j]
But this is ugly
I think you can use copy with combine_first:
df3 = df1.copy()
df1[df2.columns] = df2[df2.columns]
print df1.combine_first(df3)
a b c
1 1.0 0 1.0
2 1.0 0 1.0
3 0.0 0 0.0
Next solution is creating empty new DataFrame df4 with index and columns from df1 and fill it by double combine_first:
df4 = pd.DataFrame(index=df1.index, columns=df1.columns)
df4 = df4.combine_first(df2).combine_first(df1)
print df4
a b c
1 1.0 0.0 1.0
2 1.0 0.0 1.0
3 0.0 0.0 0.0
Try
In [7]: df1['a'] = df2['a']
In [8]: df1['c'] = df2['c']
In [14]: df1[['a','c']] = df2[['a','c']]
If the column names are not known:
In [25]: for col in df2.columns:
....: df1[col] = df2[col]