Calculating daily, weekly and monthly mean in python (Pandas) - python

I have a file which has different readings in columns. I am able to find daily mean by using the following code. This is when the month, date, time are separated by space. I am just wondering, how can I do the same if I have date,month, year in first column and then time in second column. How can I calculate weekly, daily and monthly averages? Please note, the data is not equally sampled.
import pandas as pd
import numpy as np
df = pd.read_csv("Data_set.csv", sep='\s*', names=["month", "day", "time", "Temperature"])
group=df.groupby(["month","day"])
daily=group.aggregate({"Temperature":np.mean})
daily.to_csv('daily.csv')
Date Time T1 T2 T3
17/12/2013 00:28:38 19 23.1 7.3
17/12/2013 00:58:38 19 22.9 7.3
17/12/2013 01:28:38 18.9 22.8 6.3
17/12/2013 01:58:38 18.9 23.1 6.3
17/12/2013 02:28:38 18.8 23 6.3
.....
.....
24/12/2013 19:58:21 14.7 15.5 7
24/12/2013 20:28:21 14.7 15.5 7
24/12/2013 20:58:21 14.7 15.5 7
24/12/2013 21:28:21 14.7 15.6 6
24/12/2013 21:58:21 14.7 15.5 6
24/12/2013 22:28:21 14.7 15.5 5
24/12/2013 22:58:21 14.7 15.5 4

Related

How to append/concat second row with the first row within the similar dataframe in pandas

I have a dataframe
0 1 2 3 ............ 1041 1042 1043
0 32.5 19.4 66.6 91.4 55.5 10.4 77.2
1 13.3 85.3 22.4 65.8 23.4 90.2 14.5
2 22.4 91.7 57.1 23.5 58.2 81.5 46.7
3 75.7 47.1 91.4 45.2 89.7 38.7 78.3
.
.
.
.
18 32.5 23.4 90.2 14.5 91.7 57.1 23.5
19 56.7 58.2 81.5 46.7 65.5 43.4 76.2
20 76.8 19.4 66.6 91.4 54.9 60.4 96.4
the dataframe has 20 rows and 1044 columns. i want to append/concat 2nd row with first row and 4th row with 3rd row and 6th row with 5th row and so on. in this way, the dataframe will become 10 rows and 2088 columns.
After this, from 0 to 1043 columns label should be rename as x1,x2,x3, ...., x1043 and from 1044 to 2083 columns should be rename as y1,y2,y3,.... , y20837. Now the updated dataframe looks like as follow
X0 X1 X2 .... X1042 X1043 Y0 Y1 Y2 .... Y2086 Y2087
0 32.5 19.4 66.6 ... 10.4 77.2 13.3 85.3 22.4 ... 90.2 14.5
1 22.4 91.7 57.1 ... 81.5 46.7 75.7 47.1 91.4 ... 38.7 78.3
.
.
.
.
10 56.7 58.2 81.5 ... 43.4 76.2 76.8 19.4 66.6 ... 60.4 96.4
how to do achieve these two tasks? I have tried a lot (using append/concat/join functions) but still unsuccessful to do this.
Some manipulation of the column names and a trick with indexes is needed but sure it can be done. Remember to pay attention to the behavior when the number of columns is odd.
import numpy as np
import pandas as pd
# Create dummy data
df = pd.DataFrame(np.random.randint(0, 100, size=(1044, 20)))
def pairwise(it):
a = iter(it)
return zip(a, a)
l = [l for l, r in pairwise(df.columns)]
r = [r for l, r in pairwise(df.columns)]
# Rename columns of one of the DFs otherwise the concat will overlap them
dfr = df[r].rename(columns={r: l for l, r in pairwise(df.columns)})
# Concat vertically, ignore index to have them stacked sequentially
df = pd.concat([df[l], dfr], axis=0, ignore_index=True)

Creating a BMI table

I'm trying to create a BMI table with a column for height from 58 inches to 76 inches in 2-inch increments and a row for weight from 100 pounds to 250 pounds in 10-pound increments, I've got the row and the column, but I can't figure out how to calculate the different BMI's within the table.
This is my code:
header = '\t{}'.format('\t'.join(map(str, range(100, 260, 10))))
rows = []
for i in range(58, 78, 2):
row = '\t'.join(map(str, (bmi for q in range(1, 17))))
rows.append('{}\t{}'.format(i, row))
print(header + '\n' + '\n'.join(rows))
This is the output:
100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250
58
60
62
64
66
68
70
72
74
76
What I'm trying to do is fill in the chart. For example, a height of 58 inches and 100 pounds is a BMI of 22.4. A height of 58 inches and 110 pounds is 24.7, and so on.
I'm not sure how you got your expected results of 22.4 and 22.7, but if you define BMI to be weight [lb] / (height [in])^2 * 703, you could do something like the following:
In [16]: weights = range(100, 260, 10)
...: header = '\t' + '\t'.join(map(str, weights))
...: rows = [header]
...: for height in range(58, 78, 2):
...: row = '\t'.join(f'{weight/height**2*703:.1f}' for weight in weights)
...: rows.append(f'{height}\t{row}')
...: print('\n'.join(rows))
...:
100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250
58 20.9 23.0 25.1 27.2 29.3 31.3 33.4 35.5 37.6 39.7 41.8 43.9 46.0 48.1 50.2 52.2
60 19.5 21.5 23.4 25.4 27.3 29.3 31.2 33.2 35.1 37.1 39.1 41.0 43.0 44.9 46.9 48.8
62 18.3 20.1 21.9 23.8 25.6 27.4 29.3 31.1 32.9 34.7 36.6 38.4 40.2 42.1 43.9 45.7
64 17.2 18.9 20.6 22.3 24.0 25.7 27.5 29.2 30.9 32.6 34.3 36.0 37.8 39.5 41.2 42.9
66 16.1 17.8 19.4 21.0 22.6 24.2 25.8 27.4 29.0 30.7 32.3 33.9 35.5 37.1 38.7 40.3
68 15.2 16.7 18.2 19.8 21.3 22.8 24.3 25.8 27.4 28.9 30.4 31.9 33.4 35.0 36.5 38.0
70 14.3 15.8 17.2 18.7 20.1 21.5 23.0 24.4 25.8 27.3 28.7 30.1 31.6 33.0 34.4 35.9
72 13.6 14.9 16.3 17.6 19.0 20.3 21.7 23.1 24.4 25.8 27.1 28.5 29.8 31.2 32.5 33.9
74 12.8 14.1 15.4 16.7 18.0 19.3 20.5 21.8 23.1 24.4 25.7 27.0 28.2 29.5 30.8 32.1
76 12.2 13.4 14.6 15.8 17.0 18.3 19.5 20.7 21.9 23.1 24.3 25.6 26.8 28.0 29.2 30.4
What's probably keeping you down in your own code is the for q in range(1, 17) which you'll want to turn into your weights instead; you could just replace it with for q in range(100, 260, 10) and use the formula directly if you liked, but here we just avoid the duplication by introducing weights.
First of all, you should remove the indent print statement at the end. Running this code with the indent prints out one table as each row is put in. Secondly, the snippet of code you will want to change is
(bmi for q in range(1, 17))
Since BMI is a function of mass and height, I would change your iterator i to height, q to mass, and range(1, 17) to range(100, 260, 10). This is to improve readability. Then, replace bmi with an expression using mass and height that returns bmi. For example,
(mass*height for mass in range(100, 260, 10))
I don't believe BMI=mass*height, but replace this with the real formula.

How to plot a (grouped) bar chart from a dataframe using pandas

I have a dataframe with the following data
T1 SO DR AX NO Overig
SK1 20.2 21.7 27 22.4 22.6 25
PA 20.2 21.7 21.6 20.4 17.7 25.0
T4 30.8 30.0 24.3 28.6 32.3 0.0
XXS 7.7 10.0 10.8 8.2 9.7 25.0
MvM 20.2 16.7 13.5 18.4 14.5 25.0
ACH 1.0 0.0 2.7 2.0 3.2 0.0
With an specified index and columns.
I need a bar chart for just the columns T1, SO, and DR, with the index name on the x-axis, and the values of the index for the three columns on the y-axis. In this case the total of bars will be 6*3 = 18.
I have tried the following:
df.T.plot(kind='bar')
tevr_asp[['T1','SO','DR']].T.plot.bar()
You can use dataframe plot function to render specific columns in y-axis and with use_index you can render the index in x-axis.
df.plot(y=["T1", "SO", "DR"],use_index=True, kind="bar")

correlation matrix between cities

I want to find the corr btw cities and and Rainfall. Note that 'city' is categorical, not numerical.
I wand to compare their rainfall.
How do I go about it? I haven't seen anything on here that talk about how to deal with duplicate cities with different data
like
Date Location MinTemp MaxTemp Rainfall
12/1/2008 Albury 13.4 22.9 0.6
12/2/2008 Albury 7.4 25.1 0
12/3/2008 Albury 12.9 25.7 0
12/5/2008 Brisbane 20.5 29 9.6
12/6/2008 Brisbane 22.1 33.4 7.8
12/7/2008 Brisbane 22.6 33.4 12.4
12/8/2008 Brisbane 21.9 26.7 0
12/9/2008 Brisbane 19.5 27.6 0.2
12/10/2008 Brisbane 22.1 30.3 0.6
3/30/2011 Tuggeranong 9.8 25.2 0.4
3/31/2011 Tuggeranong 10.3 18.5 2.8
5/1/2011 Tuggeranong 5.5 20.8 0
5/2/2011 Tuggeranong 11 16.1 0
5/3/2011 Tuggeranong 7.3 17.5 0.6
8/29/2016 Woomera 15 22.9 0
8/30/2016 Woomera 12.5 22.1 12.8
8/31/2016 Woomera 8 20 0
9/1/2016 Woomera 11.6 21.4 0
9/2/2016 Woomera 11.2 19.6 0.3
9/3/2016 Woomera 7.1 20.4 0
9/4/2016 Woomera 6.5 18.6 0
9/5/2016 Woomera 7.3 21.5 0
One possible solution, if I understood you correctly (based on the title of OP), is:
Step 1
Preparing a dataset with Locations as columns and Rainfall as rows (note, you will lose information here up to a shortest rainfall series)
df2=df.groupby("Location")[["Location", "Rainfall"]].head(3) # head(3) is first 3 observations
df2.loc[:,"col"] = 4*["x1","x2","x3"] # 4 is number of unique cities
df3 = df2.pivot_table(index="col",columns="Location",values="Rainfall")
df3
Location Albury Brisbane Tuggeranong Woomera
col
x1 0.6 9.6 0.4 0.0
x2 0.0 7.8 2.8 12.8
x3 0.0 12.4 0.0 0.0
Step 2
Doing correlation matrix on the obtained dataset
df3.corr()
Location Albury Brisbane Tuggeranong Woomera
Location
Albury 1.000000 -0.124534 -0.381246 -0.500000
Brisbane -0.124534 1.000000 -0.869799 -0.797017
Tuggeranong -0.381246 -0.869799 1.000000 0.991241
Woomera -0.500000 -0.797017 0.991241 1.000000
An alternative, slightly more involved solution would be to keep the longest series and impute missing values with means or median.
But even though you will feed more data into your algo, it won't cure the main problem: your data seem to be misaligned. What I mean by this is that to do correlation analysis properly you should make it sure, that you compare comparable values, e.g. rainfall for summer with rainfall for summer for another city. To do analysis this way, you should make it sure you have equal amount of comparable rainfalls for each city: e.g. winter, spring, summer, autumn; or, January, February, ..., December.

How can I fill my dataframe

Can someone please tell me how I can fill in the missing values of my dataframe? The missing values dont come up as NaN or anything that is common instead they show as two dots like .. How would i go about filling them in with the mean of that row that they are in?
1971 1990 1999 2000 2001 2002
Estonia .. 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia .. 12.4 13.3 13.6 14.5 14.6
My headers are the years and my index are the countries.
It seems you can use mask, compare by numpy array created by values and replace by mean, last cast all columns to float:
print (df.mean(axis=1))
Estonia 10.26
Spain 210.82
SlovakRepublic 29.70
Slovenia 13.68
df = df.mask(df.values == '..', df.mean(axis=1), axis=0).astype(float)
print (df)
1971 1990 1999 2000 2001 2002
Estonia 10.26 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia 13.68 12.4 13.3 13.6 14.5 14.6
You should be able to use an .set_value
try df_name.set_value('index', 'column', value)
something like
df_name.set_value('Estonia','1971', 50)

Categories

Resources