I have a dataframe with the following data
T1 SO DR AX NO Overig
SK1 20.2 21.7 27 22.4 22.6 25
PA 20.2 21.7 21.6 20.4 17.7 25.0
T4 30.8 30.0 24.3 28.6 32.3 0.0
XXS 7.7 10.0 10.8 8.2 9.7 25.0
MvM 20.2 16.7 13.5 18.4 14.5 25.0
ACH 1.0 0.0 2.7 2.0 3.2 0.0
With an specified index and columns.
I need a bar chart for just the columns T1, SO, and DR, with the index name on the x-axis, and the values of the index for the three columns on the y-axis. In this case the total of bars will be 6*3 = 18.
I have tried the following:
df.T.plot(kind='bar')
tevr_asp[['T1','SO','DR']].T.plot.bar()
You can use dataframe plot function to render specific columns in y-axis and with use_index you can render the index in x-axis.
df.plot(y=["T1", "SO", "DR"],use_index=True, kind="bar")
Related
I want to find the corr btw cities and and Rainfall. Note that 'city' is categorical, not numerical.
I wand to compare their rainfall.
How do I go about it? I haven't seen anything on here that talk about how to deal with duplicate cities with different data
like
Date Location MinTemp MaxTemp Rainfall
12/1/2008 Albury 13.4 22.9 0.6
12/2/2008 Albury 7.4 25.1 0
12/3/2008 Albury 12.9 25.7 0
12/5/2008 Brisbane 20.5 29 9.6
12/6/2008 Brisbane 22.1 33.4 7.8
12/7/2008 Brisbane 22.6 33.4 12.4
12/8/2008 Brisbane 21.9 26.7 0
12/9/2008 Brisbane 19.5 27.6 0.2
12/10/2008 Brisbane 22.1 30.3 0.6
3/30/2011 Tuggeranong 9.8 25.2 0.4
3/31/2011 Tuggeranong 10.3 18.5 2.8
5/1/2011 Tuggeranong 5.5 20.8 0
5/2/2011 Tuggeranong 11 16.1 0
5/3/2011 Tuggeranong 7.3 17.5 0.6
8/29/2016 Woomera 15 22.9 0
8/30/2016 Woomera 12.5 22.1 12.8
8/31/2016 Woomera 8 20 0
9/1/2016 Woomera 11.6 21.4 0
9/2/2016 Woomera 11.2 19.6 0.3
9/3/2016 Woomera 7.1 20.4 0
9/4/2016 Woomera 6.5 18.6 0
9/5/2016 Woomera 7.3 21.5 0
One possible solution, if I understood you correctly (based on the title of OP), is:
Step 1
Preparing a dataset with Locations as columns and Rainfall as rows (note, you will lose information here up to a shortest rainfall series)
df2=df.groupby("Location")[["Location", "Rainfall"]].head(3) # head(3) is first 3 observations
df2.loc[:,"col"] = 4*["x1","x2","x3"] # 4 is number of unique cities
df3 = df2.pivot_table(index="col",columns="Location",values="Rainfall")
df3
Location Albury Brisbane Tuggeranong Woomera
col
x1 0.6 9.6 0.4 0.0
x2 0.0 7.8 2.8 12.8
x3 0.0 12.4 0.0 0.0
Step 2
Doing correlation matrix on the obtained dataset
df3.corr()
Location Albury Brisbane Tuggeranong Woomera
Location
Albury 1.000000 -0.124534 -0.381246 -0.500000
Brisbane -0.124534 1.000000 -0.869799 -0.797017
Tuggeranong -0.381246 -0.869799 1.000000 0.991241
Woomera -0.500000 -0.797017 0.991241 1.000000
An alternative, slightly more involved solution would be to keep the longest series and impute missing values with means or median.
But even though you will feed more data into your algo, it won't cure the main problem: your data seem to be misaligned. What I mean by this is that to do correlation analysis properly you should make it sure, that you compare comparable values, e.g. rainfall for summer with rainfall for summer for another city. To do analysis this way, you should make it sure you have equal amount of comparable rainfalls for each city: e.g. winter, spring, summer, autumn; or, January, February, ..., December.
Continuing on my previous question link (things are explained there), I now have obtained an array. However, I don't know how to use this array, but that is a further question. The point of this question is, there are NaN values in the 63 x 2 column that I created and I want the rows with NaN values deleted so that I can use the data (once I ask another question on how to graph and export as x , y arrays)
Here's what I have. This code works.
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
data1 = [df.iloc[:, [0, 1]]]
The sample of the .csv file is located in the link.
I tried inputting
data1.dropna()
but it didn't work.
I want the NaN values/rows to drop so that I'm left with a 28 x 2 array. (I am using the first column with actual values as an example).
Thank you.
Try
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
data1 = df.iloc[:, [0, 1]]
cleaned_data = data1.dropna()
You were probably getting an Exception like "List does not have a method 'dropna'". That's because your data1 was not a Pandas DataFrame, but a List - and inside that list was a DataFrame.
However the answer is already given, Though i would like to put some thoughts across this.
Importing Your dataFrame taking the example dataset from your earlier post you provided:
>>> import pandas as pd
>>> df = pd.read_csv("so.csv")
>>> df
time 1mnaoh trial 1 1mnaoh trial 2 1mnaoh trial 3 ... 5mnaoh trial 1 5mnaoh trial 2 5mnaoh trial 3 5mnaoh trial 4
0 0.0 23.2 23.1 23.1 ... 23.3 24.3 24.1 24.1
1 0.5 23.2 23.1 23.1 ... 23.4 24.3 24.1 24.1
2 1.0 23.2 23.1 23.1 ... 23.5 24.3 24.1 24.1
3 1.5 23.2 23.1 23.1 ... 23.6 24.3 24.1 24.1
4 2.0 23.3 23.2 23.2 ... 23.7 24.5 24.7 25.1
5 2.5 24.0 23.5 23.5 ... 23.8 27.2 26.7 28.1
6 3.0 25.4 24.4 24.1 ... 23.9 31.4 29.8 31.3
7 3.5 26.9 25.5 25.1 ... 23.9 35.1 33.2 34.4
8 4.0 27.8 26.5 26.2 ... 24.0 37.7 35.9 36.8
9 4.5 28.5 27.3 27.0 ... 24.0 39.7 38.0 38.7
10 5.0 28.9 27.9 27.7 ... 24.0 40.9 39.6 40.2
11 5.5 29.2 28.2 28.3 ... 24.0 41.9 40.7 41.0
12 6.0 29.4 28.5 28.6 ... 24.1 42.5 41.6 41.2
13 6.5 29.5 28.8 28.9 ... 24.1 43.1 42.3 41.7
14 7.0 29.6 29.0 29.1 ... 24.1 43.4 42.8 42.3
15 7.5 29.7 29.2 29.2 ... 24.0 43.7 43.1 42.9
16 8.0 29.8 29.3 29.3 ... 24.2 43.8 43.3 43.3
17 8.5 29.8 29.4 29.4 ... 27.0 43.9 43.5 43.6
18 9.0 29.9 29.5 29.5 ... 30.8 44.0 43.6 43.8
19 9.5 29.9 29.6 29.5 ... 33.9 44.0 43.7 44.0
20 10.0 30.0 29.7 29.6 ... 36.2 44.0 43.7 44.1
21 10.5 30.0 29.7 29.6 ... 37.9 44.0 43.8 44.2
22 11.0 30.0 29.7 29.6 ... 39.3 NaN 43.8 44.3
23 11.5 30.0 29.8 29.7 ... 40.2 NaN 43.8 44.3
24 12.0 30.0 29.8 29.7 ... 40.9 NaN 43.9 44.3
25 12.5 30.1 29.8 29.7 ... 41.4 NaN 43.9 44.3
26 13.0 30.1 29.8 29.8 ... 41.8 NaN 43.9 44.4
27 13.5 30.1 29.9 29.8 ... 42.0 NaN 43.9 44.4
28 14.0 30.1 29.9 29.8 ... 42.1 NaN NaN 44.4
29 14.5 NaN 29.9 29.8 ... 42.3 NaN NaN 44.4
30 15.0 NaN 29.9 NaN ... 42.4 NaN NaN NaN
31 15.5 NaN NaN NaN ... 42.4 NaN NaN NaN
However, It good to clean the data beforehand and then process the data as you desired hence dropping the NA values during import itself will be significantly useful.
>>> df = pd.read_csv("so.csv").dropna() <-- dropping the NA here itself
>>> df
time 1mnaoh trial 1 1mnaoh trial 2 1mnaoh trial 3 ... 5mnaoh trial 1 5mnaoh trial 2 5mnaoh trial 3 5mnaoh trial 4
0 0.0 23.2 23.1 23.1 ... 23.3 24.3 24.1 24.1
1 0.5 23.2 23.1 23.1 ... 23.4 24.3 24.1 24.1
2 1.0 23.2 23.1 23.1 ... 23.5 24.3 24.1 24.1
3 1.5 23.2 23.1 23.1 ... 23.6 24.3 24.1 24.1
4 2.0 23.3 23.2 23.2 ... 23.7 24.5 24.7 25.1
5 2.5 24.0 23.5 23.5 ... 23.8 27.2 26.7 28.1
6 3.0 25.4 24.4 24.1 ... 23.9 31.4 29.8 31.3
7 3.5 26.9 25.5 25.1 ... 23.9 35.1 33.2 34.4
8 4.0 27.8 26.5 26.2 ... 24.0 37.7 35.9 36.8
9 4.5 28.5 27.3 27.0 ... 24.0 39.7 38.0 38.7
10 5.0 28.9 27.9 27.7 ... 24.0 40.9 39.6 40.2
11 5.5 29.2 28.2 28.3 ... 24.0 41.9 40.7 41.0
12 6.0 29.4 28.5 28.6 ... 24.1 42.5 41.6 41.2
13 6.5 29.5 28.8 28.9 ... 24.1 43.1 42.3 41.7
14 7.0 29.6 29.0 29.1 ... 24.1 43.4 42.8 42.3
15 7.5 29.7 29.2 29.2 ... 24.0 43.7 43.1 42.9
16 8.0 29.8 29.3 29.3 ... 24.2 43.8 43.3 43.3
17 8.5 29.8 29.4 29.4 ... 27.0 43.9 43.5 43.6
18 9.0 29.9 29.5 29.5 ... 30.8 44.0 43.6 43.8
19 9.5 29.9 29.6 29.5 ... 33.9 44.0 43.7 44.0
20 10.0 30.0 29.7 29.6 ... 36.2 44.0 43.7 44.1
21 10.5 30.0 29.7 29.6 ... 37.9 44.0 43.8 44.2
and lastly cast your dataFrame as you wish:
>>> df = [df.iloc[:, [0, 1]]]
# new_df = [df.iloc[:, [0, 1]]] <-- if you don't want to alter actual dataFrame
>>> df
[ time 1mnaoh trial 1
0 0.0 23.2
1 0.5 23.2
2 1.0 23.2
3 1.5 23.2
4 2.0 23.3
5 2.5 24.0
6 3.0 25.4
7 3.5 26.9
8 4.0 27.8
9 4.5 28.5
10 5.0 28.9
11 5.5 29.2
12 6.0 29.4
13 6.5 29.5
14 7.0 29.6
15 7.5 29.7
16 8.0 29.8
17 8.5 29.8
18 9.0 29.9
19 9.5 29.9
20 10.0 30.0
21 10.5 30.0]
Better Solution:
While looking at the end result, i see you are just concerning about the particular columns those are 'time' & '1mnaoh trial 1' hence idealistic would be to use usecole option which will reduce your memory footprint for the search across the data because you just opted the only columns which are useful for you and then use dropna() which will give you wanted you wanted i believe.
>>> df = pd.read_csv("so.csv", usecols=['time', '1mnaoh trial 1']).dropna()
>>> df
time 1mnaoh trial 1
0 0.0 23.2
1 0.5 23.2
2 1.0 23.2
3 1.5 23.2
4 2.0 23.3
5 2.5 24.0
6 3.0 25.4
7 3.5 26.9
8 4.0 27.8
9 4.5 28.5
10 5.0 28.9
11 5.5 29.2
12 6.0 29.4
13 6.5 29.5
14 7.0 29.6
15 7.5 29.7
16 8.0 29.8
17 8.5 29.8
18 9.0 29.9
19 9.5 29.9
20 10.0 30.0
21 10.5 30.0
22 11.0 30.0
23 11.5 30.0
24 12.0 30.0
25 12.5 30.1
26 13.0 30.1
27 13.5 30.1
28 14.0 30.1
Can someone please tell me how I can fill in the missing values of my dataframe? The missing values dont come up as NaN or anything that is common instead they show as two dots like .. How would i go about filling them in with the mean of that row that they are in?
1971 1990 1999 2000 2001 2002
Estonia .. 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia .. 12.4 13.3 13.6 14.5 14.6
My headers are the years and my index are the countries.
It seems you can use mask, compare by numpy array created by values and replace by mean, last cast all columns to float:
print (df.mean(axis=1))
Estonia 10.26
Spain 210.82
SlovakRepublic 29.70
Slovenia 13.68
df = df.mask(df.values == '..', df.mean(axis=1), axis=0).astype(float)
print (df)
1971 1990 1999 2000 2001 2002
Estonia 10.26 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia 13.68 12.4 13.3 13.6 14.5 14.6
You should be able to use an .set_value
try df_name.set_value('index', 'column', value)
something like
df_name.set_value('Estonia','1971', 50)
I have a file which has different readings in columns. I am able to find daily mean by using the following code. This is when the month, date, time are separated by space. I am just wondering, how can I do the same if I have date,month, year in first column and then time in second column. How can I calculate weekly, daily and monthly averages? Please note, the data is not equally sampled.
import pandas as pd
import numpy as np
df = pd.read_csv("Data_set.csv", sep='\s*', names=["month", "day", "time", "Temperature"])
group=df.groupby(["month","day"])
daily=group.aggregate({"Temperature":np.mean})
daily.to_csv('daily.csv')
Date Time T1 T2 T3
17/12/2013 00:28:38 19 23.1 7.3
17/12/2013 00:58:38 19 22.9 7.3
17/12/2013 01:28:38 18.9 22.8 6.3
17/12/2013 01:58:38 18.9 23.1 6.3
17/12/2013 02:28:38 18.8 23 6.3
.....
.....
24/12/2013 19:58:21 14.7 15.5 7
24/12/2013 20:28:21 14.7 15.5 7
24/12/2013 20:58:21 14.7 15.5 7
24/12/2013 21:28:21 14.7 15.6 6
24/12/2013 21:58:21 14.7 15.5 6
24/12/2013 22:28:21 14.7 15.5 5
24/12/2013 22:58:21 14.7 15.5 4
I have gridded satellite data stored in a dataframe. Normally, this dataframe gets sliced to make imshow plots on a day-by-day basis, which is trivial. However, I would like to plots annual means of the data, which is where I am currently stuck. The dataframe has a multi-level index (datetime, latitude coordinate) with columns making up the longitude coordinates.
import pandas as pd, numpy as np
dates = pd.date_range('20140101',periods=10,freq='1D')
others = np.arange(0,5)
index = [(d,o) for o in others for d in dates]
index = pd.MultiIndex.from_tuples(index, names=['DATES','LAT'])
data = np.random.randint(0,20,(50,10))
df = pd.DataFrame(data=data,index=index,columns=np.arange(0,10))
df.columns.names = ['LON']
If I were using arrays I would normally stack them along the third dimension and then average on the third dimension. e.g.
mat = np.ones( (5,10,1) )
# stack on day-by-day basis so lat/lon pairs sit on top of each other
# on the third dimension
for heute in df.index.get_level_values(0).unique():
tmp = df.xs(heute, level=0)
mat = np.dstack( (mat,tmp.as_matrix()) )
ave = mat[:,:,1:].mean(axis=2)
While this would work, I suspect there is a method of doing this within Pandas. However, for this I do not know where to start. I have played around with groupby and resample, but I have been unable to make those work. Any help would be appreciated.
Here we go:
import pandas as pd, numpy as np
pd.set_option('display.float_format',lambda x: '{:,.1f}'.format(x))
np.random.seed(1)
dates = pd.date_range('20140101',periods=10,freq='1D')
others = np.arange(0,5)
index = [(d,o) for o in others for d in dates]
index = pd.MultiIndex.from_tuples(index, names=['DATES','LAT'])
data = np.random.randint(0,20,(50,10))
df = pd.DataFrame(data=data,index=index,columns=np.arange(0,10))
df.columns.names = ['LON']
# answer
df = df.stack()
df= df.groupby(level=['LAT','LON']).mean()
print df.unstack(level=['LON'])
which yields:
LON 0 1 2 3 4 5 6 7 8 9
LAT
0 8.8 8.5 10.8 9.2 9.0 10.8 9.3 9.3 7.6 9.1
1 10.6 8.5 10.6 12.2 8.0 8.8 9.5 11.3 10.8 9.5
2 11.0 10.3 8.2 11.2 9.9 8.4 13.5 9.7 7.8 9.0
3 8.1 6.2 8.8 12.6 10.6 7.1 8.8 9.3 11.7 10.2
4 9.1 10.1 7.8 8.7 7.4 7.3 10.2 11.9 8.3 11.9
Whilst your array approach yields:
[[ 8.8 8.5 10.8 9.2 9. 10.8 9.3 9.3 7.6 9.1]
[ 10.6 8.5 10.6 12.2 8. 8.8 9.5 11.3 10.8 9.5]
[ 11. 10.3 8.2 11.2 9.9 8.4 13.5 9.7 7.8 9. ]
[ 8.1 6.2 8.8 12.6 10.6 7.1 8.8 9.3 11.7 10.2]
[ 9.1 10.1 7.8 8.7 7.4 7.3 10.2 11.9 8.3 11.9]]