I am trying to flatten a Pandas Dataframe MultiIndex so that there is only a single level index. The usual solution based on any number of SE posts is to use the df.reset_index command, but that is just not fixing the problem.
I started out with an Xarray DataArray and converted it to a dataframe. The original dataframe looked like this.
results
simdata a_ss_yr attr attr1 attr2 attr3
run year
0 0 0 0 0 0 0
1 1 6 2 0 4
2 2 4 2 2 0
3 3 1 0 0 1
4 4 2 0 2 0
To flatten the index I used
df.reset_index(drop=True)
This only accomplished this:
run year results
simdata a_ss_yr attr attr1 attr2
0 0 0 0 0 0 0
1 0 1 1 6 2 0
2 0 2 2 4 2 2
3 0 3 3 1 0 0
4 0 4 4 2 0 2
I tried doing the df.reset_index() option more than once, but this is still not flattening the index, and I want to get this to only a single level index.
More specifically I need the "run" and "year" variables to go to the level 0 set of column names, and I need to remove the "result" heading entirely.
I have been reading the Pandas documentation, but it seems like doing this kind of surgery on the index is not really described. Does anyone have a sense of how to do this?
Use first droplevel for remove first level of MultiIndex and then reset_index:
df.columns = df.columns.droplevel(0)
df = df.reset_index()
Related
My apologies beforehand! I have done this before a few times, but I am having some brain fog. I have two dataframes df1, and df2. I would like to update all values in df2 if it matches a specific value in df1, while not changing the other values in df2. I can do this pretty easily with np.where on columns of a dataframe, I am having brain fog on how I did this previously with 2 dataframes!
Goal: Set values in Df2 to 0 if they are 0 in DF1 - otherwise keep the DF2 value
Example
df1
A
B
C
4
0
1
0
2
0
1
4
0
df2
A
B
C
1
8
1
9
2
7
1
4
6
Expected df2 after our element swap
A
B
C
1
0
1
0
2
0
1
4
0
brain fog is bad! thank you for the assistance!
Using fillna
>>> df2[df1 != 0].fillna(0)
You can try
df2[df1.eq(0)] = 0
print(df2)
A B C
0 1 0 1
1 0 2 0
2 1 4 0
df1 = pd.DataFrame(np.random.random_sample((10)), columns=list('A'))
df1['MATCHED'] = 0
df1
A MATCHED
0 0.424651 0
1 0.855567 0
2 0.983395 0
3 0.921866 0
4 0.001827 0
5 0.341491 0
6 0.055578 0
7 0.970564 0
8 0.078751 0
9 0.348055 0
Then I filter df1:
df1_slice = df1[df1['A'] <=0.4]
df1_slice
A MATCHED
4 0.001827 0
5 0.341491 0
6 0.055578 0
8 0.078751 0
9 0.348055 0
Now, I want to change the column MATCHED values for those rows in df1_slice :
df1.loc[df1_slice.index]['MATCHED']=1
I'd expect the MATCHED column changes in df1 from 0 to 1, but it doesn't.
df1.loc[df1_slice.index]
A MATCHED
4 0.001827 0
5 0.341491 0
6 0.055578 0
8 0.078751 0
9 0.348055 0
Why they don't change and how to change this script so that they change to 1 in df1.
I find that using .loc throughout is the best way to (re-)assign to slices. In your example you use .loc only to slice on the index, not on the columns. This should do it (haven't tested):
df1.loc[df1_slice.index, 'MATCHED'] = 1
my input:
index frame user1 user2
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
Also I have two objects start_frame and end_frame - pandas Series look like this for 'start frame' :
index frame
3 3
and for end frame:
index frame
4 5
My problem is apply function in specific column - user1 and in specific row number, where values I get from start_frame and end_frame.
I expect output like this:
frame user1 user2
0 0 0 0
1 1 0 0
2 2 0 0
3 3 1 0
4 4 1 0
5 5 1 0
I trying this but it return all column to ones or any other output but not that I want
def my_func(x):
x=x+1
return x
df['user1']=df['user1'].between(df['frame']==3, df['frame']==5, inclusive=False).apply(lambda x: add_one(x))
I trying another code:
df['user1']=df.apply(lambda row: 1 if row['frame'] in (3,5) else 0, axis=1)
But it return only 1 in row 3 and 5, how here in (3,5) insert range?
So I have two question: First and most important how to apply my_func exacly in rows what I need, and other question how to use my object end_frame and start_frame instead manually insert in function.
Thank you
Updated:
arr_rang = range(3,6)
df['user1']=df.apply(lambda row: 1 if row['frame'] in (arr_rang) else 0, axis=1)
Now it's return 1 in frame 3,4,5. That I need. But still I dont understand how use my objects end_frame and start_frame
let's append start_frame and end_frame since they are having common columns then check values using isin() and finally changing value by using boolean masking and loc accessor:
s=start_frame.append(end_frame)
mask=(df['index'].isin(s['index'])) | (df['frame'].isin(s['frame']))
df.loc[mask,'user1']=df.loc[mask,'user1']+1
#you can also use np.where() in place of loc accessor
output of df:
index frame user1 user2
0 0 0 0 0
1 1 1 0 0
2 2 2 0 0
3 3 3 1 0
4 4 4 1 0
5 5 5 1 0
Update:
use:
mask=df['frame'].between(3,5)
df.loc[mask,'user1']=df.loc[mask,'user1']+1
Did you try
def putHello(row):
row["hello"] = "world"
return row
data.iloc[5:7].apply(putHello,axis=1)
The output would look something like this
The documentation for pandas functions
Iloc pandas
Apply pandas
I have a dataset whose features are words. These words like "see", "saw", "go, "play" etc. And I try to do some preprocessing like stemming in columns. I want to add the same or same meaning columns to each other and then drop the adding column. Like below
For example, I have a dataset like,
see go see
0 0 0 1
1 2 1 3
2 0 1 1
3 0 0 0
and I want to add one "see" to another "see", and drop one of them, like below,
see go
0 1 0
1 5 1
2 1 1
3 0 0
How can I do this?
df.groupby(lambda x:x, axis=1).sum()
go see
0 0 1
1 1 5
2 1 1
3 0 0
You could use stack, groupby and then unstack:
res = df.stack().groupby(level=[0, 1]).sum().unstack()
print(res)
Output
go see
0 0 1
1 1 5
2 1 1
3 0 0
My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.
You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a