Process dataframe as blocks with same column [duplicate] - python

This question already has answers here:
How to groupby consecutive values in pandas DataFrame
(4 answers)
Closed last year.
I have a dataframe that looks like this:
x | time | zone
1 10 a
3 11 a
5 12 b
7 13 b
8 14 a
9 18 a
10 20 a
11 22 c
12 24 c
Imagine that zone is a state that changes over time, I would like to process a certain state individually so I can calculate some metrics at each state.
Basically, I want to divide the data frame into blocks, like this:
1st block:
x | time | zone
1 10 a
3 11 a
2nd block:
5 12 b
7 13 b
3rd block:
8 14 a
9 18 a
10 20 a
and so on. With this I can calculate metrics like time spent in state, x difference, etc
How can I accomplish this using pandas?
Thanks!

The classical approach is to use this formula for generating groups of consecutive value.
This works by setting a boolean (True) whenever the value changes, and incrementing the count for each change using cumsum.
group = df['zone'].ne(df['zone'].shift()).cumsum()
output:
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
Name: zone, dtype: int64
Then you can use it to groupby your data (here showing as a dictionary for the example):
dict(list(df.groupby(group)))
output:
{1: x time zone
0 1 10 a
1 3 11 a,
2: x time zone
2 5 12 b
3 7 13 b,
3: x time zone
4 8 14 a
5 9 18 a
6 10 20 a,
4: x time zone
7 11 22 c
8 12 24 c}

Related

Compare even and odd rows in a Pandas Data Frame

I have a data frame like that :
Index
Time
Id
0
10:10:00
11
1
10:10:01
12
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
8
10:10:12
11
9
10:10:14
13
I want to compare id column for each pairs. So between the row 0 and 1, between the row 2 and 3 etc.
In others words I want to compare even rows with odd rows and keep same id pairs rows.
My ideal output would be :
Index
Time
Id
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
I tried that but it did not work :
df = df[
df[::2]["id"] ==df[1::2]["id"]
]
You can use a GroupBy.transform approach:
# for each pair, is there only one kind of Id?
out = df[df.groupby(np.arange(len(df))//2)['Id'].transform('nunique').eq(1)]
Or, more efficient, using the underlying numpy array:
# convert to numpy
a = df['Id'].to_numpy()
# are the odds equal to evens?
out = df[np.repeat((a[::2]==a[1::2]), 2)]
output:
Index Time Id
2 2 10:10:02 12
3 3 10:10:04 12
4 4 10:10:06 13
5 5 10:10:07 13
6 6 10:10:08 11
7 7 10:10:10 11

How can I split pandas dataframe into groups of peaks

I have a dataset in an excel file I'm trying to analyse.
Example data:
Time in s Displacement in mm Force in N
0 0 Not Relevant
1 1 Not Relevant
2 2 Not Relevant
3 3 Not Relevant
4 2 Not Relevant
5 1 Not Relevant
6 0 Not Relevant
7 2 Not Relevant
8 3 Not Relevant
9 4 Not Relevant
10 5 Not Relevant
11 6 Not Relevant
12 5 Not Relevant
13 4 Not Relevant
14 3 Not Relevant
15 2 Not Relevant
16 1 Not Relevant
17 0 Not Relevant
18 4 Not Relevant
19 5 Not Relevant
20 6 Not Relevant
21 7 Not Relevant
22 6 Not Relevant
23 5 Not Relevant
24 4 Not Relevant
24 0 Not Relevant
Imported from an xls file and then plotting a graph of time vs displacement:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel(
'DATA.xls',
engine='xlrd', usecols=['Time in s', 'Displacement in mm', 'Force in N'])
fig, ax = plt.subplots()
ax.plot(df['Time in s'], df['Displacement in mm'])
ax.set(xlabel='Time (s)', ylabel='Disp',
title='time disp')
ax.grid()
fig.savefig("time_disp.png")
plt.show()
I'd like to split the data into multiple groups to analyse separately.
So if I plot displacement against time, I get a sawtooth as a sample is being cyclically loaded.
I'd like to split the data so that each "tooth" is its own group or dataset so I can analyse each cycle
Can anyone help?
you can create a column group with a value changing at each local minimum. First get True at a local minimum and use two diff once forward and once backward. Then use cumsum to increase the group number each time a local minimum is.
df['gr'] = (~(df['Deplacement'].diff(1)>0)
& ~(df['Deplacement'].diff(-1)>0)).cumsum()
print(df)
Time Deplacement gr
0 0 0 1
1 1 1 1
2 2 2 1
3 3 3 1
4 4 2 1
5 5 1 1
6 6 0 2
7 7 2 2
8 8 3 2
9 9 4 2
10 10 5 2
11 11 6 2
12 12 5 2
13 13 4 2
14 14 3 2
15 15 2 2
16 16 1 2
17 17 0 3
18 18 4 3
19 19 5 3
you can split the data by selecting each group individually, or you could do something with a loop and do anything you want in each loop.
s = (~(df['Deplacement'].diff(1)>0)
& ~(df['Deplacement'].diff(-1)>0)).cumsum()
for _, dfg in df.groupby(s):
print(dfg)
# analyze as needed
Edit: in the case of the data in your question with 0 as a minimum, then doing df['gr'] = df['Deplacement'].eq(0).cumsum() would work as well, but it is specific to minimum being exactly 0

Replace by previous values

I have some dataframe like the one shown above. The goal of this program is to replace some specific value by the previous one.
import pandas as pd
test = pd.DataFrame([2,2,3,1,1,2,4,6,43,23,4,1,3,3,1,1,1,4,5], columns = ['A'])
obtaining:
If one want to replace all 1 by the previous values, a possible solution is:
for li in test[test['A'] == 1].index:
test['A'].iloc[li] = test['A'].iloc[li-1]
However, it is very inefficient. Can you suggest a more efficient solution?
IIUC, replace to np.nan then ffill
test.replace(1,np.nan).ffill().astype(int)
Out[881]:
A
0 2
1 2
2 3
3 3
4 3
5 2
6 4
7 6
8 43
9 23
10 4
11 4
12 3
13 3
14 3
15 3
16 3
17 4
18 5

I want to get the relative index of a column in a pandas dataframe

I want to make a new column of the 5 day return for a stock, let's say. I am using pandas dataframe. I computed a moving average using the rolling_mean function, but I'm not sure how to reference lines like i would in a spreadsheet (B6-B1) for example. Does anyone know how I can do this index reference and subtraction?
sample data frame:
day price 5-day-return
1 10 -
2 11 -
3 15 -
4 14 -
5 12 -
6 18 i want to find this ((day 5 price) -(day 1 price) )
7 20 then continue this down the list
8 19
9 21
10 22
Are you wanting this:
In [10]:
df['5-day-return'] = (df['price'] - df['price'].shift(5)).fillna(0)
df
Out[10]:
day price 5-day-return
0 1 10 0
1 2 11 0
2 3 15 0
3 4 14 0
4 5 12 0
5 6 18 8
6 7 20 9
7 8 19 4
8 9 21 7
9 10 22 10
shift returns the row at a specific offset, we use this to subtract this from the current row. fillna fills the NaN values which will occur prior to the first valid calculation.

Efficiently adding calculated rows based on index values to a pandas DataFrame

I have a pandas DataFrame in the following format:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
I want to append a calculated row that performs some math based on a given items index value, e.g. adding a row that sums the values of all items with an index value < 2, with the new row having an index label of 'Red'. Ultimately, I am trying to add three rows that group the index values into categories:
A row with the sum of item values where index value are < 2, labeled as 'Red'
A row with the sum of item values where index values are 1 < x < 4, labeled as 'Blue'
A row with the sum of item values where index values are > 3, labeled as 'Green'
Ideal output would look like this:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Red 3 5 7
Blue 15 17 19
Green 27 29 31
My current solution involves transposing the DataFrame, applying a map function for each calculated column and then re-transposing, but I would imagine pandas has a more efficient way of doing this, likely using .append().
EDIT:
My in-elegant pre-set list solution (originally used .transpose() but I improved it using .groupby() and .append()):
df = pd.DataFrame(np.arange(18).reshape((6,3)),columns=['a', 'b', 'c'])
df['x'] = ['Red', 'Red', 'Blue', 'Blue', 'Green', 'Green']
df2 = df.groupby('x').sum()
df = df.append(df2)
del df['x']
I much prefer the flexibility of BrenBarn's answer (see below).
Here is one way:
def group(ix):
if ix < 2:
return "Red"
elif 2 <= ix < 4:
return "Blue"
else:
return "Green"
>>> print d
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
>>> print d.append(d.groupby(d.index.to_series().map(group)).sum())
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Blue 15 17 19
Green 27 29 31
Red 3 5 7
For the general case, you need to define a function (or dict) to handle the mapping to different groups. Then you can just use groupby and its usual abilities.
For your particular case, it can be done more simply by directly slicing on the index value as Dan Allan showed, but that will break down if you have a more complex case where the groups you want are not simply definable in terms of contiguous blocks of rows. The method above will also easily extend to situations where the groups you want to create are not based on the index but on some other column (i.e., group together all rows whose value in column X is within range 0-10, or whatever).
The role of "transpose," which you say you used in your unshown solution, might be played more naturally by the orient keyword argument, which is available when you construct a DataFrame from a dictionary.
In [23]: df
Out[23]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
In [24]: dict = {'Red': df.loc[:1].sum(),
'Blue': df.loc[2:3].sum(),
'Green': df.loc[4:].sum()}
In [25]: DataFrame.from_dict(dict, orient='index')
Out[25]:
a b c
Blue 15 17 19
Green 27 29 31
Red 3 5 7
In [26]: df.append(_)
Out[26]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Blue 15 17 19
Green 27 29 31
Red 3 5 7
Based the numbers in your example, I assume that by "> 4" you actually meant ">= 4".

Categories

Resources