Multiconditional average of values of columns of a categorical variable - python

I have a data frame that looks like this
DRUG READING1 READING2 READING3 TEMPERATURE
A 1 2 3 12
A 1 2 3 12
A 2 2 1 14
A 2 3 3 16
B 8 9 7 12
B 8 8 8 14
B 8 9 9 14
B 9 8 7 16
C 12 11 12 12
C 11 11 11 12
C 12 11 11 14
C 12 11 11 14
C 11 12 12 14
C 11 12 11 16
C 11 12 12 16
D 20 21 21 12
Now I want to replace the values of a particular Drug 'C' with the average values of the readings belonging to the particular temperature for the final outcome to be something like this
DRUG READING1 READING2 READING3 TEMPERATURE
A 1 2 3 12
A 1 2 3 12
A 2 2 1 14
A 2 3 3 16
B 8 9 7 12
B 8 8 8 14
B 8 9 9 14
B 9 8 7 16
C 11.5 11 11.5 12
C 11.5 11 11.5 12
C 11.7 11.3 11.3 14
C 11.7 11.3 11.3 14
C 11.7 11.3 11.3 14
C 11 12 11 16
D 20 21 21 12
or other way to see this is:
DRUG READING1 READING2 READING3 TEMPERATURE
A 1 2 3 12
A 1 2 3 12
A 2 2 1 14
A 2 3 3 16
B 8 9 7 12
B 8 8 8 14
B 8 9 9 14
B 9 8 7 16
C 11.5 11 11.5 12
C 11.7 11.3 11.3 14
C 11 12 11 16
D 20 21 21 12

This is groupby().mean():
df.groupby(['DRUG', 'TEMPERATURE'], as_index=False).mean()
Output:
DRUG TEMPERATURE READING1 READING2 READING3
0 A 12 1.000000 2.000000 3.000000
1 A 14 2.000000 2.000000 1.000000
2 A 16 2.000000 3.000000 3.000000
3 B 12 8.000000 9.000000 7.000000
4 B 14 8.000000 8.000000 8.000000
5 B 16 9.000000 8.000000 7.000000
6 C 12 11.500000 11.000000 11.500000
7 C 14 11.666667 11.333333 11.333333
8 C 16 11.000000 12.000000 11.500000
9 D 12 20.000000 21.000000 21.000000

Related

How do I classify a dataframe in a specific case?

I have a pandas.DataFrame of the form. I'll show you a simple example.(In reality, it consists of hundreds of millions of rows of data.).
I want to change the number as the letter in column '2' changes. Numbers in the remaining columns (columns:1,3 ~) should not change.
df=
index 1 2 3
0 0 a100 1
1 1.04 a100 2
2 32 a100 3
3 5.05 a105 4
4 1.01 a105 5
5 155 a105 6
6 3155.26 a105 7
7 354.12 a100 8
8 5680.13 a100 9
9 125.55 a100 10
10 13.32 a100 11
11 5656.33 a156 12
12 456.61 a156 13
13 23.52 a1235 14
14 35.35 a1235 15
15 350.20 a100 16
16 30. a100 17
17 13.50 a100 18
18 323.13 a231 19
19 15.11 a1111 20
20 11.22 a1111 21
Here is my expected result:
df=
index 1 2 3
0 0 0 1
1 1.04 0 2
2 32 0 3
3 5.05 1 4
4 1.01 1 5
5 155 1 6
6 3155.26 1 7
7 354.12 2 8
8 5680.13 2 9
9 125.55 2 10
10 13.32 2 11
11 5656.33 3 12
12 456.61 3 13
13 23.52 4 14
14 35.35 4 15
15 350.20 5 16
16 30 5 17
17 13.50 5 18
18 323.13 6 19
19 15.11 7 20
20 11.22 7 21
How do I solve this problem?
Use consecutive groups created by compare for not equal shifted values with cumulative sum and then subtract 1:
#if column is string '2'
df['2'] = df['2'].ne(df['2'].shift()).cumsum().sub(1)
#if column is number 2
df[2] = df[2].ne(df[2].shift()).cumsum().sub(1)
print (df)
index 1 2 3
0 0 0.00 0 1
1 1 1.04 0 2
2 2 32.00 0 3
3 3 5.05 1 4
4 4 1.01 1 5
5 5 155.00 1 6
6 6 3155.26 1 7
7 7 354.12 2 8
8 8 5680.13 2 9
9 9 125.55 2 10
10 10 13.32 2 11
11 11 5656.33 3 12
12 12 456.61 3 13
13 13 23.52 4 14
14 14 35.35 4 15
15 15 350.20 5 16
16 16 30.00 5 17
17 17 13.50 5 18
18 18 323.13 6 19
19 19 15.11 7 20
20 20 11.22 7 21

Moving average for the last row of a data frame

I have a data frame with two prices and moving average(window=3) for each price:
price1
price2
MA3-price1
MA3-price2
18
10
12
9
20
15
16.66
11.33
12
7
14.66
10.33
4
9
12
10.33
6
4
NaN
NaN
I don't have the MA for the last row. How can I calculate the MA for the last row and get:
price1
price2
MA3-price1
MA3-price2
18
10
12
9
20
15
16.66
11.33
12
7
14.66
10.33
4
9
12
10.33
6
4
7.33
6.66
To compute "MA3-price1" and "MA3-price2" columns from "price1" and "price2", try:
df[["MA3-price1", "MA3-price2"]] = df.rolling(3).mean()
print(df)
Prints:
price1 price2 MA3-price1 MA3-price2
0 18 10 NaN NaN
1 12 9 NaN NaN
2 20 15 16.666667 11.333333
3 12 7 14.666667 10.333333
4 4 9 12.000000 10.333333
5 6 4 7.333333 6.666667

Operations in dataframe

I have cvs data, this dataset has different latitude locations from 17 to 20, and each location has monthly data i.e (1,2,3,4,5,6, ...).
I would like to add a new column name and N and it depends on the latitude and the value per month, put the respective associated number for the given value.
Input data
lan/lon/year/month/prec
-17/18/1990/1/0.4
-17/18/1990/2/0.02
-17/18/1990/3/0.12
-17/18/1990/4/0.06
.
.
.
-17/18/2020/12/0.35
-17/20/1990/1/0.2
-17/20/1990/2/0.2
-17/20/1990/3/0.2
-17/20/1990/4/0.2
.
.
.
-17/20/2020/12/0.08
-18/20/1990/1/0.11
-18/20/1990/2/0.11
-18/20/1990/3/0.11
.
.
.
.
N values depend on lat and month
17 18 19 20 21
1 25 29 13 13 2
2 22 11 1 16 23
3 8 13 10 21 8
4 4 14 16 10 13
5 23 30 8 8 18
6 16 4 7 5 29
7 26 5 10 25 28
8 3 16 2 27 2
9 21 16 23 8 7
10 19 30 10 28 20
11 28 18 12 6 8
12 21 14 26 3 8
EXPECTED OUTPUT
lan/lon/year/month/prec/N
-17/18/1990/1/0.4/25
-17/18/1990/2/0.02/22
-17/18/1990/3/0.12/8
-17/18/1990/4/0.06/4
.
.
.
-17/18/2020/12/0.35/21
-17/20/1990/1/0.2/25
-17/20/1990/2/0.2/22
-17/20/1990/3/0.2/8
-17/20/1990/4/0.2/4
.
.
.
-17/20/2020/12/0.08/21
-18/20/1990/1/0.11/29
-18/20/1990/2/0.11/11
-18/20/1990/3/0.11/13
.
.
.
.
Use DataFrame.stack for reshape second df2 with rename MultiIndex names for match by columns lat1, month in df by DataFrame.join:
df = pd.read_csv(data, sep="/")
s = df2.rename(columns=int).unstack().rename_axis(['lan1','month'])
df['lan1'] = df['lan'].abs()
df2 = df.join(s.rename('N'), on=['lan1','month']).drop('lan1', axis=1)
print (df2)
lan lon year month prec N
0 -17 18 1990 1 0.40 25
1 -17 18 1990 2 0.02 22
2 -17 18 1990 3 0.12 8
3 -17 18 1990 4 0.06 4
4 -17 18 2020 12 0.35 21
5 -17 20 1990 1 0.20 25
6 -17 20 1990 2 0.20 22
7 -17 20 1990 3 0.20 8
8 -17 20 1990 4 0.20 4
9 -17 20 2020 12 0.08 21
10 -18 20 1990 1 0.11 29
11 -18 20 1990 2 0.11 11
12 -18 20 1990 3 0.11 13
print (df2.to_csv(sep='/', index=False))
lan/lon/year/month/prec/N
-17/18/1990/1/0.4/25
-17/18/1990/2/0.02/22
-17/18/1990/3/0.12/8
-17/18/1990/4/0.06/4
-17/18/2020/12/0.35/21
-17/20/1990/1/0.2/25
-17/20/1990/2/0.2/22
-17/20/1990/3/0.2/8
-17/20/1990/4/0.2/4
-17/20/2020/12/0.08/21
-18/20/1990/1/0.11/29
-18/20/1990/2/0.11/11
-18/20/1990/3/0.11/13
print (s)
lan month
17 1 25
2 22
3 8
4 4
5 23
6 16
7 26
8 3
9 21
10 19
11 28
12 21
18 1 29
2 11
3 13
4 14
5 30
6 4
7 5
8 16
9 16
10 30
11 18
12 14
19 1 13
2 1
3 10
4 16
5 8
6 7
7 10
8 2
9 23
10 10
11 12
12 26
20 1 13
2 16
3 21
4 10
5 8
6 5
7 25
8 27
9 8
10 28
11 6
12 3
21 1 2
2 23
3 8
4 13
5 18
6 29
7 28
8 2
9 7
10 20
11 8
12 8
dtype: int64

Merge rows in Pandas dataframe based on content, with multi-column grouping [duplicate]

I have a dataframe that looks like this:
time speaker label_1 label_2
0 0.25 1 10 4
1 0.25 2 10 5
2 0.50 1 10 6
3 0.50 2 10 7
4 0.75 1 10 8
5 0.75 2 10 9
6 1.00 1 10 11
7 1.00 2 10 12
8 1.25 1 11 13
9 1.25 2 11 14
10 1.50 1 11 15
11 1.50 2 11 16
12 1.75 1 11 17
13 1.75 2 11 18
14 2.00 1 11 19
15 2.00 2 11 20
The 'speaker' column yields 1 and 2 to delineate 2 speakers at a given timestamp. I want to make new columns from the 'label_1' and 'label_2' data that are associated with only one speaker. See below for desired output.
time spk_1_label_1 spk_2_label1 spk_1_label_2 spk_2_label_2
0.25 10 10 4 5
0.50 10 10 6 7
0.75 10 10 8 9
1.00 10 10 11 12
1.25 11 11 13 14
1.50 11 11 15 16
1.75 11 11 17 18
2.00 11 11 19 20
First we use pivot_table to pivot our rows to columns. Then we create our desired column names by string concatenating with list_comprehension and f-string:
piv = df.pivot_table(index='time', columns='speaker')
piv.columns = [f'spk_{col[1]}_{col[0]}' for col in piv.columns]
spk_1_label_1 spk_2_label_1 spk_1_label_2 spk_2_label_2
time
0.25 10 10 4 5
0.50 10 10 6 7
0.75 10 10 8 9
1.00 10 10 11 12
1.25 11 11 13 14
1.50 11 11 15 16
1.75 11 11 17 18
2.00 11 11 19 20
If you want to remove the index name:
piv.rename_axis(None, inplace=True)
spk_1_label_1 spk_2_label_1 spk_1_label_2 spk_2_label_2
0.25 10 10 4 5
0.50 10 10 6 7
0.75 10 10 8 9
1.00 10 10 11 12
1.25 11 11 13 14
1.50 11 11 15 16
1.75 11 11 17 18
2.00 11 11 19 20
Extra
If you want, we can make it more general by using the column name as prefix for your flattened columns:
piv.columns = [f'{piv.columns.names[1]}_{col[1]}_{col[0]}' for col in piv.columns]
speaker_1_label_1 speaker_2_label_1 speaker_1_label_2 speaker_2_label_2
time
0.25 10 10 4 5
0.50 10 10 6 7
0.75 10 10 8 9
1.00 10 10 11 12
1.25 11 11 13 14
1.50 11 11 15 16
1.75 11 11 17 18
2.00 11 11 19 20
Notice: if your python version < 3.5, you can't use f-strings, we can use .format for our string formatting:
['spk_{}_{}'.format(col[0], col[1]) for col in piv.columns]

Cutting every nth row in dataframe in Python

I have a dataframe with columns like so:
x y z
1 10 20
2 10 18
3 11 16.5
4 11 12
5 12 23
6 11 21
7 10 19
8 10 26
.
.
Every time z_n+1 is greater than z_n I want to cut that z_n.
The output would be:
x y z
1 10 20
2 10 18
3 11 16.5
5 12 23
6 11 21
8 10 26
.
.
It doesn't occur every x-many times - the index of each change from smaller to larger z_n is not 'regular'.
Is there an easy way to do this?
We can use shift to look one row back and take the reverse with ~:
df[~(df['z'].shift() < df['z'])]
x y z
0 1 10 20.0
1 2 10 18.0
2 3 11 16.5
3 4 11 12.0
5 6 11 21.0
6 7 10 19.0
Try:
df[~(df.z.diff(-1) < 0)]
Output:
x y z
0 1 10 20.0
1 2 10 18.0
2 3 11 16.5
4 5 12 23.0
5 6 11 21.0
7 8 10 26.0

Categories

Resources