I have a dataset in an excel file I'm trying to analyse.
Example data:
Time in s Displacement in mm Force in N
0 0 Not Relevant
1 1 Not Relevant
2 2 Not Relevant
3 3 Not Relevant
4 2 Not Relevant
5 1 Not Relevant
6 0 Not Relevant
7 2 Not Relevant
8 3 Not Relevant
9 4 Not Relevant
10 5 Not Relevant
11 6 Not Relevant
12 5 Not Relevant
13 4 Not Relevant
14 3 Not Relevant
15 2 Not Relevant
16 1 Not Relevant
17 0 Not Relevant
18 4 Not Relevant
19 5 Not Relevant
20 6 Not Relevant
21 7 Not Relevant
22 6 Not Relevant
23 5 Not Relevant
24 4 Not Relevant
24 0 Not Relevant
Imported from an xls file and then plotting a graph of time vs displacement:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel(
'DATA.xls',
engine='xlrd', usecols=['Time in s', 'Displacement in mm', 'Force in N'])
fig, ax = plt.subplots()
ax.plot(df['Time in s'], df['Displacement in mm'])
ax.set(xlabel='Time (s)', ylabel='Disp',
title='time disp')
ax.grid()
fig.savefig("time_disp.png")
plt.show()
I'd like to split the data into multiple groups to analyse separately.
So if I plot displacement against time, I get a sawtooth as a sample is being cyclically loaded.
I'd like to split the data so that each "tooth" is its own group or dataset so I can analyse each cycle
Can anyone help?
you can create a column group with a value changing at each local minimum. First get True at a local minimum and use two diff once forward and once backward. Then use cumsum to increase the group number each time a local minimum is.
df['gr'] = (~(df['Deplacement'].diff(1)>0)
& ~(df['Deplacement'].diff(-1)>0)).cumsum()
print(df)
Time Deplacement gr
0 0 0 1
1 1 1 1
2 2 2 1
3 3 3 1
4 4 2 1
5 5 1 1
6 6 0 2
7 7 2 2
8 8 3 2
9 9 4 2
10 10 5 2
11 11 6 2
12 12 5 2
13 13 4 2
14 14 3 2
15 15 2 2
16 16 1 2
17 17 0 3
18 18 4 3
19 19 5 3
you can split the data by selecting each group individually, or you could do something with a loop and do anything you want in each loop.
s = (~(df['Deplacement'].diff(1)>0)
& ~(df['Deplacement'].diff(-1)>0)).cumsum()
for _, dfg in df.groupby(s):
print(dfg)
# analyze as needed
Edit: in the case of the data in your question with 0 as a minimum, then doing df['gr'] = df['Deplacement'].eq(0).cumsum() would work as well, but it is specific to minimum being exactly 0
Related
This question already has answers here:
How to groupby consecutive values in pandas DataFrame
(4 answers)
Closed last year.
I have a dataframe that looks like this:
x | time | zone
1 10 a
3 11 a
5 12 b
7 13 b
8 14 a
9 18 a
10 20 a
11 22 c
12 24 c
Imagine that zone is a state that changes over time, I would like to process a certain state individually so I can calculate some metrics at each state.
Basically, I want to divide the data frame into blocks, like this:
1st block:
x | time | zone
1 10 a
3 11 a
2nd block:
5 12 b
7 13 b
3rd block:
8 14 a
9 18 a
10 20 a
and so on. With this I can calculate metrics like time spent in state, x difference, etc
How can I accomplish this using pandas?
Thanks!
The classical approach is to use this formula for generating groups of consecutive value.
This works by setting a boolean (True) whenever the value changes, and incrementing the count for each change using cumsum.
group = df['zone'].ne(df['zone'].shift()).cumsum()
output:
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
Name: zone, dtype: int64
Then you can use it to groupby your data (here showing as a dictionary for the example):
dict(list(df.groupby(group)))
output:
{1: x time zone
0 1 10 a
1 3 11 a,
2: x time zone
2 5 12 b
3 7 13 b,
3: x time zone
4 8 14 a
5 9 18 a
6 10 20 a,
4: x time zone
7 11 22 c
8 12 24 c}
I have a dataframe that currently looks somewhat like this.
import pandas as pd
In [161]: pd.DataFrame(np.c_[s,t],columns = ["M1","M2","M1","M2"])
Out[161]:
M1 M2 M1 M2
6/7 1 2 3 5
6/8 2 4 7 8
6/9 3 6 9 9
6/10 4 8 8 10
6/11 5 10 20 40
Except, instead of just four columns, there are approximately 1000 columns, from M1 till ~M340 (there are multiple columns with the same headers). I wanted to sum the values associated with matching columns based on their index. Ideally, the result dataframe would look like:
M1_sum M2_sum
6/7 4 7
6/8 9 12
6/9 12 15
6/10 12 18
6/11 25 50
I wanted to somehow apply the "groupby" and "sum" function, but was unsure how to do that when dealing with a dataframe that has multiple columns and has some columns with 3 other columns matching whereas another may only have one other column matching (or even 0 other columns matching).
You probably want to groupby the first level, and over the second axis, and then perform a .sum(), like:
>>> df.groupby(level=0,axis=1).sum().add_suffix('_sum')
M1_sum M2_sum
0 4 7
1 9 12
2 12 15
3 12 18
4 25 50
If we rename the last column to M1 instead, it will again group this correctly:
>>> df
M1 M2 M1 M1
0 1 2 3 5
1 2 4 7 8
2 3 6 9 9
3 4 8 8 10
4 5 10 20 40
>>> df.groupby(level=0,axis=1).sum().add_suffix('_sum')
M1_sum M2_sum
0 9 2
1 17 4
2 21 6
3 22 8
4 65 10
I have this groupby dataframe ( I actually don't know how to call this type of table)
A B C
1 1 124284.312500
2 64472.187500
4 32048.910156
8 16527.763672
16 8841.874023
2 1 61971.035156
2 31569.882812
4 16000.071289
8 7904.339844
16 4046.967041
4 1 31769.435547
2 15804.815430
4 7917.609375
8 4081.160400
16 2034.404541
8 1 15738.752930
2 7907.003418
4 3972.494385
8 1983.464478
16 1032.913574
I want to plot the graph, which has A as x-axis, C as y-axis and B as different variables with legend.
In pandas document, I found the graph I try to have, but no luck yet.
==========edited ===============
This is original dataframe
A B C
0 1 1 122747.722000
1 1 2 61839.731000
2 1 2 61839.762000
3 1 4 31736.405000
4 1 4 31736.559000
5 1 4 31787.312000
6 1 4 31787.833000
7 1 8 15872.596000
8 1 8 15865.406000
9 1 8 15891.001000
I have df = df.groupby(['A', 'B']).C.mean()
How can I plot the graph with stacked table?
Thanks!
Use unstack:
df.unstack().plot()
I have some dataframe like the one shown above. The goal of this program is to replace some specific value by the previous one.
import pandas as pd
test = pd.DataFrame([2,2,3,1,1,2,4,6,43,23,4,1,3,3,1,1,1,4,5], columns = ['A'])
obtaining:
If one want to replace all 1 by the previous values, a possible solution is:
for li in test[test['A'] == 1].index:
test['A'].iloc[li] = test['A'].iloc[li-1]
However, it is very inefficient. Can you suggest a more efficient solution?
IIUC, replace to np.nan then ffill
test.replace(1,np.nan).ffill().astype(int)
Out[881]:
A
0 2
1 2
2 3
3 3
4 3
5 2
6 4
7 6
8 43
9 23
10 4
11 4
12 3
13 3
14 3
15 3
16 3
17 4
18 5
I want to make a new column of the 5 day return for a stock, let's say. I am using pandas dataframe. I computed a moving average using the rolling_mean function, but I'm not sure how to reference lines like i would in a spreadsheet (B6-B1) for example. Does anyone know how I can do this index reference and subtraction?
sample data frame:
day price 5-day-return
1 10 -
2 11 -
3 15 -
4 14 -
5 12 -
6 18 i want to find this ((day 5 price) -(day 1 price) )
7 20 then continue this down the list
8 19
9 21
10 22
Are you wanting this:
In [10]:
df['5-day-return'] = (df['price'] - df['price'].shift(5)).fillna(0)
df
Out[10]:
day price 5-day-return
0 1 10 0
1 2 11 0
2 3 15 0
3 4 14 0
4 5 12 0
5 6 18 8
6 7 20 9
7 8 19 4
8 9 21 7
9 10 22 10
shift returns the row at a specific offset, we use this to subtract this from the current row. fillna fills the NaN values which will occur prior to the first valid calculation.