Using Numpy I tried creating an array from one of the columns out of a dataframe. This array I created, however, its size is (48,) where 48 is the number of rows, instead of (48,1) which I expected. Why is this the case? I thought any array created from a numpy dataframe had to have a defined number of rows and columns
Below is the relevant code, output, and dataset represented by df
y = df.iloc[:, -1]
a=y.shape//Output is (48,)
00 0 1
0 1 0.0 45.0
1 1 0.0 48.0
2 1 0.5 67.0
3 1 1.5 59.5
4 1 1.5 62.4
5 1 1.5 84.4
6 1 1.5 82.0
7 1 1.5 79.5
8 1 3.0 64.8
9 1 3.0 67.4
10 1 3.0 82.6
11 1 3.0 78.2
12 1 3.0 80.4
13 1 3.5 71.3
14 1 3.5 70.5
15 1 3.5 75.0
16 1 3.5 80.9
17 1 3.5 83.2
18 1 4.0 78.4
19 1 4.0 74.2
20 1 4.0 81.5
21 1 4.0 68.9
22 1 4.5 68.3
23 1 4.5 78.5
24 1 4.5 75.9
25 1 4.5 81.6
26 1 4.5 83.2
27 1 4.5 86.1
28 1 4.5 87.4
29 1 5.0 72.8
30 1 5.0 75.0
31 1 5.0 75.6
32 1 5.0 79.3
33 1 5.0 82.4
34 1 5.0 86.3
35 1 5.0 90.2
36 1 5.0 93.4
37 1 5.5 79.5
38 1 5.5 81.4
39 1 5.5 83.2
40 1 5.5 85.7
41 1 5.5 91.4
42 1 5.5 98.5
43 1 5.5 94.3
44 1 6.0 81.2
45 1 6.0 85.4
46 1 6.0 91.0
47 1 6.0 94.3
The result is a 1D array. If its length is N, then it can be represented as N-by-1 or 1-by-N vector, as we were told in the linear algebra class. But this approach has some issues that we don't want to deal with in coding.
Issue 1. We need to choose if the answer is N-by-1 or 1-by-N vector and stick to it. However, sometimes we have a preference for one over another, and, therefore, we need to do additional actions.
Issue 2. If the size of an array is (1, N) or (N, 1), we need to access its elements using two indexes, for example, arr[0, N-1] or arr[N-1, 0]. It is confusing -- after all, it is a 1D vector, and a single index suffices. We wish to access its elements as arr[N-1], i.e., by using a single index. In linear algebra notations, it would mean that the shape is (N), which looks awkward. The shape is a tuple, and a one-element tuple is written using a comma as (N, ).
Such a twist solves the issue. Now we can use linear multiplication from both right and left sides and access the elements using a single index.
Related
I used an example dataset which I load into a dataframe. I then use a statsmodels OLS comparing Texture as a result of Mix and then use that model for an ANOVA table.
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('contrastExampleData.csv')
mod = ols(formula = 'Texture ~ Mix', data = df).fit()
aov_table = sm.stats.anova_lm(mod, typ = 1)
print(aov_table)
If it's preferred that I upload the csv and link it, please let me know.
The dataframe:
Mix Blend Flour SPI Texture
0 1 0.5 KSS 1.1 107.3
1 1 0.5 KSS 1.1 110.1
2 1 0.5 KSS 1.1 112.6
3 2 0.5 KSS 2.2 97.9
4 2 0.5 KSS 2.2 100.1
5 2 0.5 KSS 2.2 102.0
6 3 0.5 KSS 3.3 86.8
7 3 0.5 KSS 3.3 88.1
8 3 0.5 KSS 3.3 89.1
9 4 0.5 KNC 1.1 108.1
10 4 0.5 KNC 1.1 110.1
11 4 0.5 KNC 1.1 111.8
12 5 0.5 KNC 2.2 108.6
13 5 0.5 KNC 2.2 110.2
14 5 0.5 KNC 2.2 111.2
15 6 0.5 KNC 3.3 95.0
16 6 0.5 KNC 3.3 95.4
17 6 0.5 KNC 3.3 95.5
18 7 1.0 KSS 1.1 97.3
19 7 1.0 KSS 1.1 99.1
20 7 1.0 KSS 1.1 100.6
21 8 1.0 KSS 2.2 92.8
22 8 1.0 KSS 2.2 94.6
23 8 1.0 KSS 2.2 96.7
24 9 1.0 KSS 3.3 86.8
25 9 1.0 KSS 3.3 88.1
26 9 1.0 KSS 3.3 89.1
27 10 1.0 KNC 1.1 94.1
28 10 1.0 KNC 1.1 96.1
29 10 1.0 KNC 1.1 97.8
30 11 1.0 KNC 2.2 95.7
31 11 1.0 KNC 2.2 97.6
32 11 1.0 KNC 2.2 99.8
33 12 1.0 KNC 3.3 90.2
34 12 1.0 KNC 3.3 92.1
35 12 1.0 KNC 3.3 93.7
Resulting in output:
df sum_sq mean_sq F PR(>F)
Mix 1.0 520.080472 520.080472 10.828726 0.002334
Residual 34.0 1632.947028 48.027854 NaN NaN
However, this is entirely incorrect - the correct ANOVA table can be seen here. At first notice, the degrees of freedom should be 11 instead of 1, given that there are 12 levels to Mix, but I cannot figure out why this has happened. I've done similar analyses with simpler datasets of only two columns and haven't had an issue. I've attempted to use sm.OLS and others but haven't had much luck. What is the issue that is resulting in an incorrect ANOVA?
This question is effectively answered by this R question, as statsmodels uses R type formulae. I found this just after posting and wanted to update for others with similar questions for python.
The solution is to convert the independent variable to a categorical variable instead of a numeric variable, as the "Mix" in this is not a continuous numerical variable, but instead 12 discrete labels. This is done by:
mod = ols(formula = 'Texture ~ C(Mix)', data = df).fit()
which results in the correct ANOVA table:
C(Mix) 11.0 2080.2875 189.117045 62.397705 6.550053e-15
Residual 24.0 72.7400 3.030833 NaN NaN
I have 7 dataframes (df_1, df_2, df_3,..., df_7) all with the same columns but different lengths but sometimes have the same values.
I'd like to concatenate all 7 dataframes under the conditions that:
if df_n.iloc[row_i] != df_n+1.iloc[row_i] and df_n.iloc[row_i][0] < df_n+1.iloc[row_i][0]:
pd.concat([df_n.iloc[row_i], df_n+1.iloc[row_i], df_n+2.iloc[row_i],
...., df_n+6.iloc[row_i]])
Where df_n.iloc[row_i] is the ith row of the nth dataframe and df_n.iloc[row_i][0] is the first column of the ith row.
For example if we only had 2 dataframes and that len(df_1) < len(df_2) and if we used the conditions above the input would be:
df_1 df_2
index 0 1 2 index 0 1 2
0 12.12 11.0 31 0 12.2 12.6 30
1 12.3 12.1 33 1 12.3 12.1 33
2 10 9.1 33 2 13 12.1 23
3 16 12.1 33 3 13.1 12.1 27
4 14.4 13.1 27
5 15.2 13.2 28
And the output would be:
conditions -> pd.concat([df_1, df_2]):
index 0 1 2 3 4 5
0 12.12 11.0 31 12.2 12.6 30
2 10 9.1 33 13 12.1 23
4 nan 14.4 13.1 27
5 nan 15.2 13.2 28
Is there an easy way to do this?
IIUC concat first , the groupby by columns get the different , and we just implement your condition
s=pd.concat([df1,df2],1)
s1=s.groupby(level=0,axis=1).apply(lambda x : x.iloc[:,0]-x.iloc[:,1])
yourdf=s[s1.ne(0).any(1)&s1.iloc[:,0].lt(0)|s1.iloc[:,0].isnull()]
Out[487]:
0 1 2 0 1 2
index
0 12.12 11.0 31.0 12.2 12.6 30
2 10.00 9.1 33.0 13.0 12.1 23
4 NaN NaN NaN 14.4 13.1 27
5 NaN NaN NaN 15.2 13.2 28
Given a DataFrame df that looks roughly like this:
TripID time Latitude SectorID sector_leave_time
0 42 7 52.5 5 8
1 42 8 52.6 5 8
2 42 9 52.7 6 10
3 42 10 52.8 6 10
4 5 9 50.1 2 10
5 5 10 50.0 2 10
6 5 11 49.9 1 12
7 5 12 49.8 1 12
I already computed the time at which a trip leaves a sector by getting the maximum timestamp within the sector. Now, I would like to add another column for the latitude at the point of sector_leave_time for each trip and sector, so the DataFrame becomes this:
TripID time Latitude SectorID sector_leave_time sector_leave_lat
0 42 7 52.5 5 8 52.6
1 42 8 52.6 5 8 52.6
2 42 9 52.7 6 10 52.8
3 42 10 52.8 6 10 52.8
4 5 9 50.1 2 10 50.0
5 5 10 50.0 2 10 50.0
6 5 11 49.9 1 12 49.8
7 5 12 49.8 1 12 49.8
So far I've only been able to add the sector_leave_lat to the line where time == sector_leave_time, i.e. when the trip leaves the sector, using the following line of code:
df['sector_leave_lat'] = df.groupby('TripID').apply(lambda x : x.loc[x['time'] == x['sector_leave_time'], 'Latitude']).reset_index().set_index('level_1')['Latitude']
I know this line looks awful and I would like to add sector_leave_lat to all entries of that trip within that sector. I'm kind of running out of ideas, so I hope someone may be able to help.
The problem is not that complicated if you are familiar with SQL :)
The following code should do the trick :
#Given your dataframe :
df
TripID time Latitude SectorID sector_leave_time
0 42.0 7.0 52.5 5.0 8.0
1 42.0 8.0 52.6 5.0 8.0
2 42.0 9.0 52.7 6.0 10.0
3 42.0 10.0 52.8 6.0 10.0
4 5.0 9.0 50.1 2.0 10.0
5 5.0 10.0 50.0 2.0 10.0
6 5.0 11.0 49.9 1.0 12.0
7 5.0 12.0 49.8 1.0 12.0
# Get the Latitude corresponding to time = sector_leave_time
df_max_lat = df.loc[df_merged.time==df.sector_leave_time, ['TripID', 'Latitude', 'SectorID']]
# Then you have :
TripID Latitude SectorID
1 42.0 52.6 5.0
3 42.0 52.8 6.0
5 5.0 50.0 2.0
7 5.0 49.8 1.0
# Add the max latitude to original dataframe applying a left join
pd.merge(df, df_max_lat, on=['TripID', 'SectorID'], how='left', suffixes=('','_sector_leave'))
# You're getting :
TripID time Latitude SectorID sector_leave_time Latitude_sector_leave
0 42.0 7.0 52.5 5.0 8.0 52.6
1 42.0 8.0 52.6 5.0 8.0 52.6
2 42.0 9.0 52.7 6.0 10.0 52.8
3 42.0 10.0 52.8 6.0 10.0 52.8
4 5.0 9.0 50.1 2.0 10.0 50.0
5 5.0 10.0 50.0 2.0 10.0 50.0
6 5.0 11.0 49.9 1.0 12.0 49.8
7 5.0 12.0 49.8 1.0 12.0 49.8
There you go :)
for each trip-sector combination you want the last Latitude, sorted by time.
df['sector_leave_lat'] = df.sort_values('time').groupby(
['TripID', 'SectorID']
).transform('last')['Latitude']
outputs:
TripID time Latitude SectorID sector_leave_time test
0 42 7 52.5 5 8 52.6
1 42 8 52.6 5 8 52.6
2 42 9 52.7 6 10 52.8
3 42 10 52.8 6 10 52.8
4 5 9 50.1 2 10 50.0
5 5 10 50.0 2 10 50.0
6 5 11 49.9 1 12 49.8
7 5 12 49.8 1 12 49.8
As the sample data already appears sorted by time within each trip-sector group, the sorting here may be redundant
I have a pandas dataframe similar to this.
score avg
date
1/1/2017 0 0
1/2/2017 1 0.5
1/3/2017 2 1
1/4/2017 3 1.5
1/5/2017 4 2
1/6/2017 5 2.5
1/7/2017 6 3
1/8/2017 7 3.5
1/9/2017 8 4
1/10/2017 9 4.5
1/11/2017 10 5
1/12/2017 11 5.5
1/13/2017 12 7.5
1/14/2017 13 6.5
1/15/2017 14 7.5
1/16/2017 15 8.5
1/17/2017 16 9.5
1/18/2017 17 10.5
1/19/2017 18 11.5
1/20/2017 19 12.5
1/21/2017 20 13.5
1/22/2017 21 14.5
1/23/2017 22 15.5
1/24/2017 23 16.5
1/25/2017 24 17.5
1/26/2017 25 18.5
1/27/2017 26 19.5
1/28/2017 27 20.5
1/29/2017 28 21.5
Basically I am looking to create a 14 day rolling average of the data, but instead of showing NaNs for the first 14 days, simply showing the simple averages. For example, the average on day 2 is the average of day 1 and 2, the average on day 10 is the averages of days 1-10, etc. How would I go about doing this without having to manually create averages? Thanks for the help!
What you need to use is rolling with min_periods=1 as paramter:
df['avg2'] = df.rolling(14, min_periods=1)['score'].mean()
Output:
date score avg avg2
0 2017-01-01 0 0.0 0.0
1 2017-01-02 1 0.5 0.5
2 2017-01-03 2 1.0 1.0
3 2017-01-04 3 1.5 1.5
4 2017-01-05 4 2.0 2.0
5 2017-01-06 5 2.5 2.5
6 2017-01-07 6 3.0 3.0
7 2017-01-08 7 3.5 3.5
8 2017-01-09 8 4.0 4.0
9 2017-01-10 9 4.5 4.5
10 2017-01-11 10 5.0 5.0
11 2017-01-12 11 5.5 5.5
12 2017-01-13 12 7.5 6.0
13 2017-01-14 13 6.5 6.5
14 2017-01-15 14 7.5 7.5
15 2017-01-16 15 8.5 8.5
16 2017-01-17 16 9.5 9.5
17 2017-01-18 17 10.5 10.5
18 2017-01-19 18 11.5 11.5
19 2017-01-20 19 12.5 12.5
20 2017-01-21 20 13.5 13.5
21 2017-01-22 21 14.5 14.5
22 2017-01-23 22 15.5 15.5
23 2017-01-24 23 16.5 16.5
24 2017-01-25 24 17.5 17.5
25 2017-01-26 25 18.5 18.5
26 2017-01-27 26 19.5 19.5
27 2017-01-28 27 20.5 20.5
28 2017-01-29 28 21.5 21.5
I am trying to do data analysis of some rainfall data. Example of the data looks like this:-
10 18/05/2016 26.9 40 20.8 34 52.2 20.8 46.5 45
11 19/05/2016 25.5 32 0.3 41.6 42 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9 36 18.4 28.6 46
13 21/05/2016 24.5 18 TRACE 3.5 17 TRACE 4.4 40
14 22/05/2016 0.6 18 0 6.5 14 0 8.6 20
15 23/05/2016 3.5 9 0.6 4.3 14 0.6 7 15
16 24/05/2016 3.6 25 T 3 12 T 14.9 9
17 25/05/2016 25 21 2.2 25.6 50 2.2 25 9
The rainfall data contain a specific string 'TRACE' or 'T' (both meaning non measurable rainfall amount). For analysis, I would like to convert this strings in to '1.0' (float). My desired data should look like this so as to plot the values as line diagram:-
10 18/05/2016 26.9 40 20.8 34 52.2 20.8 46.5 45
11 19/05/2016 25.5 32 0.3 41.6 42 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9 36 18.4 28.6 46
13 21/05/2016 24.5 18 1.0 3.5 17 1.0 4.4 40
14 22/05/2016 0.6 18 0 6.5 14 0 8.6 20
15 23/05/2016 3.5 9 0.6 4.3 14 0.6 7 15
16 24/05/2016 3.6 25 1.0 3 12 1.0 14.9 9
17 25/05/2016 25 21 2.2 25.6 50 2.2 25 9
Can some one point me to right direction?
You can use df.replace, and then converting the numeric to float using df.astype (the original datatype would be object, so any operations on these columns would still suffer from performance issues):
df = df.replace('^T(RACE)?$', 1.0, regex=True)
df.iloc[:, 1:] = df.iloc[:, 1:].astype(float) # converting object columns to floats
This will replace all T or TRACE elements with 1.0.
Output:
10 18/05/2016 26.9 40 20.8 34.0 52.2 20.8 46.5 45.0
11 19/05/2016 25.5 32 0.3 41.6 42.0 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9.0 36.0 18.4 28.6 46.0
13 21/05/2016 24.5 18 1 3.5 17.0 1 4.4 40.0
14 22/05/2016 0.6 18 0 6.5 14.0 0 8.6 20.0
15 23/05/2016 3.5 9 0.6 4.3 14.0 0.6 7.0 15.0
16 24/05/2016 3.6 25 1 3.0 12.0 1 14.9 9.0
17 25/05/2016 25.0 21 2.2 25.6 50.0 2.2 25.0 9.0
Use replace by dict:
df = df.replace({'T':1.0, 'TRACE':1.0})
And then if necessary convert columns to float:
cols = df.columns.difference(['Date','another cols dont need convert'])
df[cols] = df[cols].astype(float)
df = df.replace({'T':1.0, 'TRACE':1.0})
cols = df.columns.difference(['Date','a'])
df[cols] = df[cols].astype(float)
print (df)
a Date 2 3 4 5 6 7 8 9
0 10 18/05/2016 26.9 40.0 20.8 34.0 52.2 20.8 46.5 45.0
1 11 19/05/2016 25.5 32.0 0.3 41.6 42.0 0.3 56.3 65.2
2 12 20/05/2016 8.5 29.0 18.4 9.0 36.0 18.4 28.6 46.0
3 13 21/05/2016 24.5 18.0 1.0 3.5 17.0 1.0 4.4 40.0
4 14 22/05/2016 0.6 18.0 0.0 6.5 14.0 0.0 8.6 20.0
5 15 23/05/2016 3.5 9.0 0.6 4.3 14.0 0.6 7.0 15.0
6 16 24/05/2016 3.6 25.0 1.0 3.0 12.0 1.0 14.9 9.0
7 17 25/05/2016 25.0 21.0 2.2 25.6 50.0 2.2 25.0 9.0
print (df.dtypes)
a int64
Date object
2 float64
3 float64
4 float64
5 float64
6 float64
7 float64
8 float64
9 float64
dtype: object
Extending the answer from #jezrael, you can replace and convert to floats in a single statement (assumes the first column is Date and the remaining are the desired numeric columns):
df.iloc[:, 1:] = df.iloc[:, 1:].replace({'T':1.0, 'TRACE':1.0}).astype(float)