Here is my code:
>>> import pandas as pd
>>> df = pd.read_csv('Grade.txt',index_col=0,header=None)
>>> print(df)
1 2 3 4 5 6 7 8 9 10
0
Sarah K. 10 9 7 9 10 20 19 19 45 92
John M. 9 9 8 9 8 20 20 18 43 95
David R. 8 7 7 9 6 18 17 17 40 83
Joan A. 9 10 10 10 10 20 19 20 47 99
Nick J. 9 7 10 10 10 20 20 19 46 98
Vicki T. 7 7 8 9 9 17 18 19 44 88
>>> print(df.mean(axis=0))
1 8.666667
2 8.166667
3 8.333333
4 9.333333
5 8.833333
6 19.166667
7 18.833333
8 18.666667
9 44.166667
10 92.500000
Right now they are labeled 1-10 and I want the rows to look like this:
Homework #1 8.67
Homework #2 8.17
Homework #3 8.33
Homework #4 9.33
Homework #5 8.83
Quiz #1 19.17
Quiz #2 18.83
Quiz #3 18.67
Midterm #1 44.17
Final #1 92.50
I'm just looking for the right way to go about the labeling. So instead of 1-10 I'm looking for (Homework#1, Homework#2, Homework#3, etc.) Thanks
Absent logic to derive the column names from the column number, you'll probably need label the columns with a simple list of the names:
cols
['Homework #1', 'Homework #2', 'Homework #3', 'Homework #4', 'Homework #5', 'Quiz #1', 'Quiz #2', 'Quiz #3', 'Midterm #1', 'Final #1']
df
1 2 3 4 5 6 7 8 9 10
0
Sarah K. 10 9 7 9 10 20 19 19 45 92
John M. 9 9 8 9 8 20 20 18 43 95
David R. 8 7 7 9 6 18 17 17 40 83
Joan A. 9 10 10 10 10 20 19 20 47 99
Nick J. 9 7 10 10 10 20 20 19 46 98
Vicki T. 7 7 8 9 9 17 18 19 44 88
df.columns = cols
df.mean()
Homework #1 8.666667
Homework #2 8.166667
Homework #3 8.333333
Homework #4 9.333333
Homework #5 8.833333
Quiz #1 19.166667
Quiz #2 18.833333
Quiz #3 18.666667
Midterm #1 44.166667
Final #1 92.500000
Related
I have a pandas.DataFrame of the form. I'll show you a simple example.(In reality, it consists of hundreds of millions of rows of data.).
I want to change the number as the letter in column '2' changes. Numbers in the remaining columns (columns:1,3 ~) should not change.
df=
index 1 2 3
0 0 a100 1
1 1.04 a100 2
2 32 a100 3
3 5.05 a105 4
4 1.01 a105 5
5 155 a105 6
6 3155.26 a105 7
7 354.12 a100 8
8 5680.13 a100 9
9 125.55 a100 10
10 13.32 a100 11
11 5656.33 a156 12
12 456.61 a156 13
13 23.52 a1235 14
14 35.35 a1235 15
15 350.20 a100 16
16 30. a100 17
17 13.50 a100 18
18 323.13 a231 19
19 15.11 a1111 20
20 11.22 a1111 21
Here is my expected result:
df=
index 1 2 3
0 0 0 1
1 1.04 0 2
2 32 0 3
3 5.05 1 4
4 1.01 1 5
5 155 1 6
6 3155.26 1 7
7 354.12 2 8
8 5680.13 2 9
9 125.55 2 10
10 13.32 2 11
11 5656.33 3 12
12 456.61 3 13
13 23.52 4 14
14 35.35 4 15
15 350.20 5 16
16 30 5 17
17 13.50 5 18
18 323.13 6 19
19 15.11 7 20
20 11.22 7 21
How do I solve this problem?
Use consecutive groups created by compare for not equal shifted values with cumulative sum and then subtract 1:
#if column is string '2'
df['2'] = df['2'].ne(df['2'].shift()).cumsum().sub(1)
#if column is number 2
df[2] = df[2].ne(df[2].shift()).cumsum().sub(1)
print (df)
index 1 2 3
0 0 0.00 0 1
1 1 1.04 0 2
2 2 32.00 0 3
3 3 5.05 1 4
4 4 1.01 1 5
5 5 155.00 1 6
6 6 3155.26 1 7
7 7 354.12 2 8
8 8 5680.13 2 9
9 9 125.55 2 10
10 10 13.32 2 11
11 11 5656.33 3 12
12 12 456.61 3 13
13 13 23.52 4 14
14 14 35.35 4 15
15 15 350.20 5 16
16 16 30.00 5 17
17 17 13.50 5 18
18 18 323.13 6 19
19 19 15.11 7 20
20 20 11.22 7 21
I have cvs data, this dataset has different latitude locations from 17 to 20, and each location has monthly data i.e (1,2,3,4,5,6, ...).
I would like to add a new column name and N and it depends on the latitude and the value per month, put the respective associated number for the given value.
Input data
lan/lon/year/month/prec
-17/18/1990/1/0.4
-17/18/1990/2/0.02
-17/18/1990/3/0.12
-17/18/1990/4/0.06
.
.
.
-17/18/2020/12/0.35
-17/20/1990/1/0.2
-17/20/1990/2/0.2
-17/20/1990/3/0.2
-17/20/1990/4/0.2
.
.
.
-17/20/2020/12/0.08
-18/20/1990/1/0.11
-18/20/1990/2/0.11
-18/20/1990/3/0.11
.
.
.
.
N values depend on lat and month
17 18 19 20 21
1 25 29 13 13 2
2 22 11 1 16 23
3 8 13 10 21 8
4 4 14 16 10 13
5 23 30 8 8 18
6 16 4 7 5 29
7 26 5 10 25 28
8 3 16 2 27 2
9 21 16 23 8 7
10 19 30 10 28 20
11 28 18 12 6 8
12 21 14 26 3 8
EXPECTED OUTPUT
lan/lon/year/month/prec/N
-17/18/1990/1/0.4/25
-17/18/1990/2/0.02/22
-17/18/1990/3/0.12/8
-17/18/1990/4/0.06/4
.
.
.
-17/18/2020/12/0.35/21
-17/20/1990/1/0.2/25
-17/20/1990/2/0.2/22
-17/20/1990/3/0.2/8
-17/20/1990/4/0.2/4
.
.
.
-17/20/2020/12/0.08/21
-18/20/1990/1/0.11/29
-18/20/1990/2/0.11/11
-18/20/1990/3/0.11/13
.
.
.
.
Use DataFrame.stack for reshape second df2 with rename MultiIndex names for match by columns lat1, month in df by DataFrame.join:
df = pd.read_csv(data, sep="/")
s = df2.rename(columns=int).unstack().rename_axis(['lan1','month'])
df['lan1'] = df['lan'].abs()
df2 = df.join(s.rename('N'), on=['lan1','month']).drop('lan1', axis=1)
print (df2)
lan lon year month prec N
0 -17 18 1990 1 0.40 25
1 -17 18 1990 2 0.02 22
2 -17 18 1990 3 0.12 8
3 -17 18 1990 4 0.06 4
4 -17 18 2020 12 0.35 21
5 -17 20 1990 1 0.20 25
6 -17 20 1990 2 0.20 22
7 -17 20 1990 3 0.20 8
8 -17 20 1990 4 0.20 4
9 -17 20 2020 12 0.08 21
10 -18 20 1990 1 0.11 29
11 -18 20 1990 2 0.11 11
12 -18 20 1990 3 0.11 13
print (df2.to_csv(sep='/', index=False))
lan/lon/year/month/prec/N
-17/18/1990/1/0.4/25
-17/18/1990/2/0.02/22
-17/18/1990/3/0.12/8
-17/18/1990/4/0.06/4
-17/18/2020/12/0.35/21
-17/20/1990/1/0.2/25
-17/20/1990/2/0.2/22
-17/20/1990/3/0.2/8
-17/20/1990/4/0.2/4
-17/20/2020/12/0.08/21
-18/20/1990/1/0.11/29
-18/20/1990/2/0.11/11
-18/20/1990/3/0.11/13
print (s)
lan month
17 1 25
2 22
3 8
4 4
5 23
6 16
7 26
8 3
9 21
10 19
11 28
12 21
18 1 29
2 11
3 13
4 14
5 30
6 4
7 5
8 16
9 16
10 30
11 18
12 14
19 1 13
2 1
3 10
4 16
5 8
6 7
7 10
8 2
9 23
10 10
11 12
12 26
20 1 13
2 16
3 21
4 10
5 8
6 5
7 25
8 27
9 8
10 28
11 6
12 3
21 1 2
2 23
3 8
4 13
5 18
6 29
7 28
8 2
9 7
10 20
11 8
12 8
dtype: int64
I have created a dataframe from an Excel sheet, then filtered it to values in the [Date_rank] column less than 10. The resulting dataframe is filtered
I've then used: g = groupby("Well_name") to segregate the data by each well
Now that I have the data grouped by Well_name, how can I find the standard deviation of [RandomNumber] in this group (providing me with the stdev for both of the wells RandomNumbers)? Perhaps it was not necessary to use the groupby function?
df = pd.read_csv('here.csv')
print(df)
filtered = df[df['Date_rank']<10] #filter the datafram to less than 10
print(filtered)
g = filtered.groupby('Well_name') #grouped the data to segregate by well name
Here is my data
Well_name Date_rank RandomNumber
0 Velta 1 4
1 Velta 2 5
2 Velta 3 2
3 Velta 4 4
4 Velta 5 4
5 Velta 6 9
6 Velta 7 0
7 Velta 8 9
8 Velta 9 1
9 Velta 10 3
10 Velta 11 8
11 Velta 12 3
12 Velta 13 10
13 Velta 14 10
14 Velta 15 0
15 Ronnie 1 8
16 Ronnie 2 1
17 Ronnie 3 6
18 Ronnie 4 2
19 Ronnie 5 2
20 Ronnie 6 9
21 Ronnie 7 6
22 Ronnie 8 5
23 Ronnie 9 2
24 Ronnie 10 1
25 Ronnie 11 3
26 Ronnie 12 3
27 Ronnie 13 4
28 Ronnie 14 0
29 Ronnie 15 4
You should be able to solve the problem with groupby() as you stated. The code you should use is the following:
g = filtered.groupby('Well_name')['RandomNumber'].std()
Or using .agg()
g = filtered.groupby('Well_name').agg({'RandomNumber':'np.std'})
I have a pandas dataframe with around 15 columns and all i am trying to do is see if the data in 1st row of partition_num is equal to the data in last row of partition_num if its not equal, add a new row at the end with the data from the 1st row
Input:
row id partition_num lat long time
0 1 7333 24 26 9
1 2 7333 15 19 10
2 3 7333 24 25 9
3 1 8999 26 18 15
4 2 8999 15 17 45
5 3 8999 26 18 15
6 1 3455 12 14 18
7 2 3455 12 14 18
Desired output:
row id partition_num lat long time
0 1 7333 24 26 9
1 2 7333 15 19 10
2 3 7333 25 26 9
3 4 7333 24 26 9
4 1 8999 26 18 15
5 2 8999 15 17 45
6 3 8999 26 18 15
7 1 3455 12 14 18
8 2 3455 12 14 18
Since the data for partition_num -7333 in row 0 is not equal to the data in row 2, add a new row(row 3) with same data as row 0
can we add a new column to identify the new record something like flag :
row id partition_num lat long time flag
0 1 7333 24 26 9 old
1 2 7333 15 19 10 old
2 3 7333 25 26 9 old
3 4 7333 24 26 9 new
4 1 8999 26 18 15 old
5 2 8999 15 17 45 old
6 3 8999 26 18 15 old
7 1 3455 12 14 18 old
8 2 3455 12 14 18 old
groupby will easily build sub_dataframes per partition_num. From that point the processing is simple:
for i, x in df.groupby('partition_num'):
if (x.iloc[0]['partition_num':] != x.iloc[-1]['partition_num':]).any():
s = x.iloc[0].copy()
s.id = x.iloc[-1].id + 1
df = df.append(s).reset_index(drop=True).rename_axis('row')
The following code compares the values of 'partition_num' in the first and last row, and if they don't match, appends the first row onto the end of the data frame:
if df.loc[0, 'partition_num'] != df.loc[len(df)-1, 'partition_num']:
df = df.append(df.loc[0, :]).reset_index(drop=True)
df.index.name = 'row'
print(df)
id partition_num lat long time
row
0 1 7333 24 26 9
1 2 7333 15 19 10
2 3 7333 24 26 9
3 1 8999 26 18 15
4 2 8999 15 17 45
5 3 8999 26 18 15
6 1 3455 12 14 18
7 2 3455 12 14 18
8 1 7333 24 26 9
The index column is set to 'row', and it is reset and renamed to get the correct ordering.
Added this piece to the above logic:
s['flag']= 'new_row'
and it worked!!
I'm working with python pandas DataFrames and I want to group my Data by category and I don't want any mean or median for other features (PriceBucket, success_rate and products_by_number). My DataFrame look like this :
PriceBucket success_rate products_by_number category
0 0 6.890 149837 10
1 1 7.240 105447 10
2 2 7.710 145295 10
3 3 8.090 181323 10
4 4 8.930 57187 10
5 5 8.110 133449 10
6 6 7.920 142858 10
7 7 8.230 115109 10
8 8 8.510 121930 10
9 9 8.340 122510 10
10 0 10.520 28105 20
11 1 9.770 27494 20
12 2 10.080 26758 20
13 3 10.180 29973 20
14 4 9.860 29175 20
15 5 9.950 23807 20
16 6 9.550 30520 20
17 7 9.550 23653 20
18 8 8.990 27514 20
19 9 6.710 26152 20
20 0 11.060 39538 60
21 1 10.740 34479 60
22 2 10.700 36133 60
23 3 10.900 34220 60
24 4 11.290 46001 60
25 5 11.130 26705 60
26 6 11.040 37258 60
27 7 11.150 34561 60
28 8 10.845 35495 60
29 9 10.220 35434 60
30 0 8.380 34134 90
31 1 7.920 32160 90
32 2 8.170 29500 90
33 3 8.270 31688 90
34 4 8.395 38977 90
35 5 8.620 27130 90
36 6 8.440 31007 90
37 7 8.570 31005 90
38 8 8.170 32659 90
39 9 7.290 30227 90
And this is exactly what I want :
PriceBucket success_rate products_by_number
category
10 0 6.890 149837
1 7.240 105447
2 7.710 145295
3 8.090 181323
4 8.930 57187
5 8.110 133449
6 7.920 142858
7 8.230 115109
8 8.510 121930
9 8.340 122510
20 0 10.520 28105
1 9.770 27494
2 10.080 26758
3 10.180 29973
4 9.860 29175
5 9.950 23807
6 9.550 30520
7 9.550 23653
8 8.990 27514
9 6.710 26152
60 0 11.060 39538
1 10.740 34479
2 10.700 36133
3 10.900 34220
4 11.290 46001
5 11.130 26705
6 11.040 37258
7 11.150 34561
8 10.845 35495
9 10.220 35434
90 0 8.380 34134
1 7.920 32160
2 8.170 29500
3 8.270 31688
4 8.395 38977
5 8.620 27130
6 8.440 31007
7 8.570 31005
8 8.170 32659
9 7.290 30227
What to do ? Many thanks
Assuming you dataframe is df then you want:
print df.set_index(['category', 'PriceBucket'])
success_rate products_by_number
category PriceBucket
10 0 6.890 149837
1 7.240 105447
2 7.710 145295
3 8.090 181323
4 8.930 57187
5 8.110 133449
6 7.920 142858
7 8.230 115109
8 8.510 121930
9 8.340 122510
20 0 10.520 28105
1 9.770 27494
2 10.080 26758
3 10.180 29973
4 9.860 29175
5 9.950 23807
6 9.550 30520
7 9.550 23653
8 8.990 27514
9 6.710 26152
60 0 11.060 39538
1 10.740 34479
2 10.700 36133
3 10.900 34220
4 11.290 46001
5 11.130 26705
6 11.040 37258
7 11.150 34561
8 10.845 35495
9 10.220 35434
90 0 8.380 34134
1 7.920 32160
2 8.170 29500
3 8.270 31688
4 8.395 38977
5 8.620 27130
6 8.440 31007
7 8.570 31005
8 8.170 32659
9 7.290 30227