pandas plotting group by dataframe - python

I have this groupby dataframe ( I actually don't know how to call this type of table)
A B C
1 1 124284.312500
2 64472.187500
4 32048.910156
8 16527.763672
16 8841.874023
2 1 61971.035156
2 31569.882812
4 16000.071289
8 7904.339844
16 4046.967041
4 1 31769.435547
2 15804.815430
4 7917.609375
8 4081.160400
16 2034.404541
8 1 15738.752930
2 7907.003418
4 3972.494385
8 1983.464478
16 1032.913574
I want to plot the graph, which has A as x-axis, C as y-axis and B as different variables with legend.
In pandas document, I found the graph I try to have, but no luck yet.
==========edited ===============
This is original dataframe
A B C
0 1 1 122747.722000
1 1 2 61839.731000
2 1 2 61839.762000
3 1 4 31736.405000
4 1 4 31736.559000
5 1 4 31787.312000
6 1 4 31787.833000
7 1 8 15872.596000
8 1 8 15865.406000
9 1 8 15891.001000
I have df = df.groupby(['A', 'B']).C.mean()
How can I plot the graph with stacked table?
Thanks!

Use unstack:
df.unstack().plot()

Related

How can I split pandas dataframe into groups of peaks

I have a dataset in an excel file I'm trying to analyse.
Example data:
Time in s Displacement in mm Force in N
0 0 Not Relevant
1 1 Not Relevant
2 2 Not Relevant
3 3 Not Relevant
4 2 Not Relevant
5 1 Not Relevant
6 0 Not Relevant
7 2 Not Relevant
8 3 Not Relevant
9 4 Not Relevant
10 5 Not Relevant
11 6 Not Relevant
12 5 Not Relevant
13 4 Not Relevant
14 3 Not Relevant
15 2 Not Relevant
16 1 Not Relevant
17 0 Not Relevant
18 4 Not Relevant
19 5 Not Relevant
20 6 Not Relevant
21 7 Not Relevant
22 6 Not Relevant
23 5 Not Relevant
24 4 Not Relevant
24 0 Not Relevant
Imported from an xls file and then plotting a graph of time vs displacement:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel(
'DATA.xls',
engine='xlrd', usecols=['Time in s', 'Displacement in mm', 'Force in N'])
fig, ax = plt.subplots()
ax.plot(df['Time in s'], df['Displacement in mm'])
ax.set(xlabel='Time (s)', ylabel='Disp',
title='time disp')
ax.grid()
fig.savefig("time_disp.png")
plt.show()
I'd like to split the data into multiple groups to analyse separately.
So if I plot displacement against time, I get a sawtooth as a sample is being cyclically loaded.
I'd like to split the data so that each "tooth" is its own group or dataset so I can analyse each cycle
Can anyone help?
you can create a column group with a value changing at each local minimum. First get True at a local minimum and use two diff once forward and once backward. Then use cumsum to increase the group number each time a local minimum is.
df['gr'] = (~(df['Deplacement'].diff(1)>0)
& ~(df['Deplacement'].diff(-1)>0)).cumsum()
print(df)
Time Deplacement gr
0 0 0 1
1 1 1 1
2 2 2 1
3 3 3 1
4 4 2 1
5 5 1 1
6 6 0 2
7 7 2 2
8 8 3 2
9 9 4 2
10 10 5 2
11 11 6 2
12 12 5 2
13 13 4 2
14 14 3 2
15 15 2 2
16 16 1 2
17 17 0 3
18 18 4 3
19 19 5 3
you can split the data by selecting each group individually, or you could do something with a loop and do anything you want in each loop.
s = (~(df['Deplacement'].diff(1)>0)
& ~(df['Deplacement'].diff(-1)>0)).cumsum()
for _, dfg in df.groupby(s):
print(dfg)
# analyze as needed
Edit: in the case of the data in your question with 0 as a minimum, then doing df['gr'] = df['Deplacement'].eq(0).cumsum() would work as well, but it is specific to minimum being exactly 0

Why does pd.rolling and .apply() return multiple outputs from a function returning a single value?

I'm trying to create a rolling function that:
Divides two DataFrames with 3 columns in each df.
Calculate the mean of each row from the output in step 1.
Sums the averages from step 2.
This could be done by using pd.iterrows() hence looping through each row. However, this would be inefficient when working with larger datasets. Therefore, my objective is to create a pd.rolling function that could do this much faster.
What I would need help with is to understand why my approach below returns multiple values while the function I'm using only returns a single value.
EDIT : I have updated the question with the code that produces my desired output.
This is the test dataset I'm working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
One method to achieve my desired output by looping through each row:
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[330.42328042328046,
212.0899470899471,
152.06349206349208,
205.55555555555554,
311.9047619047619,
209.1269841269841,
197.61904761904765,
116.94444444444444,
149.72222222222223,
430.0,
219.51058201058203,
215.34391534391537,
199.15343915343914,
159.6031746031746,
127.6984126984127,
326.85185185185185,
204.16666666666669]
However, this would be timely when working with large datasets. Therefore, I've tried to create a function which applies to a pd.rolling() object.
def SumOfAverageFunction(vals):
Div = df2 / vals.reset_index(drop="True")
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSum = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
The problem here is that my function returns multiple output. How can I solve this?
print(RunningSum)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 3.214286 4.533333 2.277778
3 4.777778 3.200000 2.111111
4 5.888889 4.416667 1.656085
5 5.111111 5.400000 2.915344
6 3.455556 3.933333 5.714286
7 2.866667 2.066667 5.500000
8 2.977778 3.977778 3.063492
9 3.555556 5.622222 1.907937
10 2.750000 4.200000 1.747619
11 1.638889 2.377778 3.616667
12 2.986111 2.005556 5.500000
13 5.333333 3.075000 4.750000
14 4.396825 5.000000 3.055556
15 2.174603 3.888889 2.148148
16 2.111111 2.527778 1.418519
17 2.507937 3.500000 3.311111
18 2.880952 3.000000 5.366667
19 2.722222 3.370370 5.750000
20 2.138889 5.129630 5.666667
After reordering of operations, your calculations can be simplified
BASE = df2.sum(axis=0) /3
BASE_series = pd.Series({k: v for k, v in zip(df1.columns, BASE)})
result = df1.rdiv(BASE_series, axis=1).sum(axis=1)
print(np.around(result[4:], 3))
Outputs:
4 5.508
5 4.200
6 2.400
7 3.000
...
if you dont want to calculate anything before index 4 then change:
df1.iloc[4:].rdiv(...

Melting values of multiple columns to single column based on another column value in pandas

I have a dataframe which looks like :
A B 1 4
alpha 1 2 3
beta 4 5 6
gamma 4 8 9
df= pd.DataFrame([['alpha',1,2,3], ['beta', 4,5,6], ['gamma',4,8,9]], columns=['A','B', 1, 4])
I an now trying to map value of column 'B' to -> 1 and 4. The result dataframe should look like:
A B value
alpha 1 2
beta 4 6
gamma 4 9
​
I tried melt and stack but couldn't figure it out.
Let us try lookup
df['value']=df.lookup(df.index,df.B.astype(str))
df
A B 1 4 value
0 alpha 1 2 3 2
1 beta 4 5 6 6
2 gamma 4 8 9 9

How to correctly sort a multi-indexed pandas DataFrame

I have a multi-indexed pandas dataframe that looks like this:
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
5 1 1.143599
2 1.151358
3 1.272172
10 1 1.765615
2 1.779330
3 1.752246
20 1 1.685807
2 1.688354
3 1.614013
..... ....
0 4 2.111466
5 1.933589
6 1.336527
5 4 2.006936
5 2.040884
6 1.430818
10 4 1.398334
5 1.594028
6 1.684037
20 4 1.529750
5 1.721385
6 1.608393
(Note that I've only posted one antibody, there are many analogous entries under the antibody index) but they all have the same format. Despite missing out the entries in the middle for the sake of space you can see that I have 6 experimental repeats but they are not organized properly. My question is: how would I get the DataFrame to aggregate all the repeats. So the output would look something like this:
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
4 2.111466
5 1.933589
6 1.336527
5 1 1.143599
2 1.151358
3 1.272172
4 2.006936
5 2.040884
6 1.430818
10 1 1.765615
2 1.779330
3 1.752246
4 1.398334
5 1.594028
6 1.684037
20 1 1.685807
2 1.688354
3 1.614013
4 1.529750
5 1.721385
6 1.60839
..... ....
Thanks in advance
I think you need sort_index:
df = df.sort_index(level=[0,1,2])
print (df)
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
4 2.111466
5 1.933589
6 1.336527
5 1 1.143599
2 1.151358
3 1.272172
4 2.006936
5 2.040884
6 1.430818
10 1 1.765615
2 1.779330
3 1.752246
4 1.398334
5 1.594028
6 1.684037
20 1 1.685807
2 1.688354
3 1.614013
4 1.529750
5 1.721385
6 1.608393
Name: col, dtype: float64
Or you can omit parameter levels:
df = df.sort_index()
print (df)
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
4 2.111466
5 1.933589
6 1.336527
5 1 1.143599
2 1.151358
3 1.272172
4 2.006936
5 2.040884
6 1.430818
10 1 1.765615
2 1.779330
3 1.752246
4 1.398334
5 1.594028
6 1.684037
20 1 1.685807
2 1.688354
3 1.614013
4 1.529750
5 1.721385
6 1.608393
Name: col, dtype: float64

How to remove ugly row in pandas.dataframe

so I am filling dataframes from 2 different files. While those 2 files should have the same structure (the values should be different thought) the resulting dataframes look different. So when printing those I get:
a b c d
0 70402.14 70370.602112 0.533332 98
1 31362.21 31085.682726 1.912552 301
... ... ... ... ...
753919 64527.16 64510.008206 0.255541 71
753920 58077.61 58030.943621 0.835758 152
a b c d
index
0 118535.32 118480.657338 0.280282 47
1 49536.10 49372.999416 0.429902 86
... ... ... ... ...
753970 52112.95 52104.717927 0.356051 116
753971 37044.40 36915.264944 0.597472 165
So in the second dataframe there is that "index" row that doesnt make any sense for me and it causes troubles in my following code. I did neither write the code to fill the files into the dataframes nor I did create those files. So I am rather interested in checking if such a row exists and how I might be able to remove it. Does anyone have an idea about this?
The second dataframe has an index level named "index".
You can remove the name with
df.index.name = None
For example,
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3))
In [128]: df.index.name = 'index'
In [129]: df
Out[129]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [130]: df.index.name = None
In [131]: df
Out[131]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
The dataframe might have picked up the name "index" if you used reset_index and set_index like this:
In [138]: df.reset_index()
Out[138]:
index 0 1 2
0 0 0 1 2
1 1 3 4 5
2 2 6 7 8
3 3 9 10 11
4 4 12 13 14
In [140]: df.reset_index().set_index('index')
Out[140]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Index is just the first column - it's numbering the rows by default, but you can change it a number of ways (e.g. filling it with values from one of the columns)

Categories

Resources