I'm new to using pandas and running into an issue when trying to extract values from a pivot table.
pivot = sp.pivot_table(
index="story_points",
values=["duration"],
aggfunc={"min", "mean", "median", "max", "count"},
)
df = pivot.iloc[:, "duration"]
zero = df.iloc[0, "median"]
one = df.iloc[1.0, "median"]
two = df.iloc[2.0, "median"]
three = df.iloc[3.0, "median"]
duration
count max mean median min
story_points_raw
1.0 227 185.125382 7.241453 2.190405 0.000139
2.0 144 68.849965 11.048451 6.106875 0.000058
3.0 45 126.131181 19.560792 13.558588 0.983241
4.0 16 49.043889 11.948981 6.932321 1.878125
5.0 5 128.653403 44.487847 15.489850 2.935891
8.0 8 79.792199 31.325800 17.299543 7.774792
I get an error on the line df = pivot.iloc[:, "duration"]. Is there possibly a better way to create the table to allow easier direct access to the values?
Related
I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to find a way to compute weighted average on this dataframe which in turn creates another data frame.
Here is how my dataset looks like (very simplified version of it):
prec temp
location_id hours
135 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
136 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to create a new data frame that is basically a weighted average of this data frame. The requirements indicate that 12 of these location_ids should be averaged out by a specified weight to form the combined_location_id values.
For example, location_ids 1,3,5,7,9,11,13,15,17,19,21,23 with their appropriate weights (separate data coming in from another data frame) should be weighted averaged to from the combined_location_id CL_1's data.
That is a lot of data to handle and I wasn't able to find a completely Pandas way of solving it. Therefore, I went with a for loop approach. It is extremely slow and I am sure this is not the right way to do it:
def __weighted(self, ds, weights):
return np.average(ds, weights=weights)
f = {'hours': 'first', 'location_id': 'first',
'temp': lambda x: self.__weighted(x, weights), 'prec': lambda x: self.__weighted(x, weights)}
data_frames = []
for combined_location in all_combined_locations:
mapped_location_ids = combined_location.location_ids
weights = combined_location.weights_of_location_ids
data_for_this_combined_location = pd.concat(df_data.loc[df_data.index.get_level_values(0) == location_id] for location_id in mapped_location_ids)
data_grouped_by_distance = data_for_this_combined_location.groupby("hours", as_index=False)
data_grouped_by_distance = data_grouped_by_distance.agg(f)
data_frames.append(data_grouped_by_distance)
df_combined_location_data = pd.concat(data_frames)
df_combined_location_data.set_index(['location_id', 'hours'], inplace=True)
This works well functionally, however the performance and the memory consumption is horrible. It is taking over 2 hours on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
From what I saw you can reduce one for loop with mapped_location_ids
data_for_this_combined_location = df_data.loc[df_data.index.get_level_values(0).isin(mapped_location_ids)]
I have a pandas dataframe containing 16 columns, of which 14 represent variables where i perform a looped Anova test using statsmodels. My dataframe looks something like this (simplified):
ID Cycle_duration Average_support_phase Average_swing_phase Label
1 23.1 34.3 47.2 1
2 27.3 38.4 49.5 1
3 25.8 31.1 45.7 1
4 24.5 35.6 41.9 1
...
So far this is what i'm doing:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('features_total.csv')
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
Which yields:
sum_sq df F PR(>F)
Label 0.124927 2.0 2.561424 0.084312
Residual 1.731424 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I'm getting an individual table print for each variable where the Anova is performed. Basically what i want is to print one single table with the summarized results, or something like this:
sum_sq df F PR(>F)
Cycle_duration 0.1249270 2.0 2.561424 0.084312
Residual 1.7314240 71.0 NaN NaN
Average_support_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
Average_swing_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I can already see a problem because this method always outputs the 'Label' nomenclature before the actual values, and not the variable name in question (like i've shown above, i would like to have the variable name above each 'residual'). Is this even possible with the statsmodels approach?
I'm fairly new to python and excuse me if this has nothing to do with statsmodels - in that case, please do elucidate me on what i should be trying.
You can collect the tables and concatenate them at the end of your loop. This method will create a hierarchical index, but I think that makes it a bit more clear. Something like this:
keys = []
tables = []
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
Somewhat related, I would also suggest correcting for multiple comparisons. This is more a statistical suggestion than a coding suggestion, but considering you are performing numerous statistical tests, it would make sense to account for the probability that one of the test would result in a false positive.
I have a house rent price data as follows:
import pandas as pd
import numpy as np
data = {
"HouseName": ["A", "A", "B", "B", "B"],
"Type": ["OneRoom", "TwoRooms", "OneRoom", "TwoRooms", "ThreeRooms"],
"Jan_S": [1100, 1776, 1228, 1640, np.NaN],
"Feb_S": [1000, 1805, 1231, 1425, 1800],
"Mar_S": [1033, 1748, 1315, 1591, 2900],
"Jan_L": [1005, np.NaN, 1300, np.NaN, 7000]
}
df = pd.DataFrame.from_dict(data)
print(df)
HouseName Type Jan_S Feb_S Mar_S Jan_L
0 A OneRoom 1100.0 1000 1033 1005.0
1 A TwoRooms 1776.0 1805 1748 NaN
2 B OneRoom 1228.0 1231 1315 1300.0
3 B TwoRooms 1640.0 1425 1591 NaN
4 B ThreeRooms NaN 1800 2900 7000.0
I need to realize two things: first, I want to find a reasonable rent price for January based on columns 'Jan_S', 'Feb_S', 'Mar_S', 'Jan_L'. Here S and L mean two different data sources, both of them may have outliers and nans but data from S will be taken as final price for January at priority.
Second, For the same HouseName I need to check and make sure that the price of one room is lower than two rooms, and prices of two rooms is lower than three rooms.
My final results will look like this:
HouseName Type Jan_S Feb_S Mar_S Jan_L
0 A OneRoom 1100.0 1000 1033 1005.0
1 A TwoRooms 1776.0 1805 1748 NaN
2 B OneRoom 1228.0 1231 1315 1300.0
3 B TwoRooms 1640.0 1425 1591 NaN
4 B ThreeRooms NaN 1800 2900 7000.0
Result(Jan)
0 1100
1 1776
2 1228
3 1640
4 1800
My idea is check if Jan_S is in range of 0.95 and 1.05 of Jan_L, if yes, take Jan_S as final result, otherwise, continue to check a value from Feb_S as Jan_S.
Please share any ideas that you might have to deal with this problem in Python. Thanks!
Here are some references which may helps.
Find nearest value from multiple columns and add to a new column in Python
Compare values under multiple conditions of one column in Python
Check if values in one column is in interval values of another column in Python
You can use fillna for this.
If you want to have a conditional on selection of columns, then you need to figure the logic to filter the columns to pick the values from.
I'm showing the logic using the min() of all price columns
# filter out the price columns
price_cols = df.columns[~df.columns.isin(['HouseName','Type', 'Jan_S'])]
# then figure out the logic to filter the columns you need and use fillna
# here with the min of all columns as example
df['Jan_S'] = df['Jan_S'].fillna(df[price_cols].apply(min, axis=1))
I've got two dataframes df1 and df2 that look like this:
#df1
counts freqs
categories
automatic 13 0.40625
manual 19 0.59375
#df2
counts freqs
categories
Straight Engine 18 0.5625
V engine 14 0.4375
Could anyone explain why pd.concat([df1, df2], axis = 1) will not give me this:
counts freqs
categories
automatic 13 0.40625
manual 19 0.59375
Straight Engine 18 0.5625
V engine 14 0.4375
Here is what I've tried:
1 - Using pd.concat()
I'm suspecting that the way I've built these dataframes may be the source of the issue.
And here is how I've ended up with these particular dataframes:
# imports
import pandas as pd
from pydataset import data # pip install pydataset to get datasets from R
# load data
df_mtcars = data('mtcars')
# change dummyvariables to more describing variables:
df_mtcars['am'][df_mtcars['am'] == 0] = 'manual'
df_mtcars['am'][df_mtcars['am'] == 1] = 'automatic'
df_mtcars['vs'][df_mtcars['vs'] == 0] = 'Straight Engine'
df_mtcars['vs'][df_mtcars['vs'] == 1] = 'V engine'
# describe categorical variables
df1 = pd.Categorical(df_mtcars['am']).describe()
df2 = pd.Categorical(df_mtcars['vs']).describe()
I understand that 'categories' is what is causing the problems here since df_con = pd.concat([df1, df2], axis = 1) raises this error:
TypeError: categories must match existing categories when appending
But it confuses me that this is ok:
# code
df_con = pd.concat([df1, df2], axis = 1)
# output:
counts freqs counts freqs
categories
automatic 13.0 0.40625 NaN NaN
manual 19.0 0.59375 NaN NaN
Straight Engine NaN NaN 18.0 0.5625
V engine NaN NaN 14.0 0.4375
2 - Using df.append() raises the same error as pd.concat()
3 - Using pd.merge() sort of works, but I'm losing the indexes:
# Code
df_merge = pd.merge(df1, df2, how = 'outer')
# Output
counts freqs
0 13 0.40625
1 19 0.59375
2 18 0.56250
3 14 0.43750
3 - Using pd.concat() on transposed dataframes
Since pd.concat() worked with axis = 0 I thought I would get there using transposed dataframes.
# df1.T
categories automatic manual
counts 13.00000 19.00000
freqs 0.40625 0.59375
# df2.T
categories Straight Engine V engine
counts 18.0000 14.0000
freqs 0.5625 0.4375
But still no success:
# code
df_con = pd.concat([df1.T, df2.T], axis = 1)
>>> TypeError: categories must match existing categories when appending
By the way, what I was hoping for here is this:
categories automatic manual Straight Engine V engine
counts 13.00000 19.00000 18.0000 14.0000
freqs 0.40625 0.59375 0.5625 0.4375
Still kind of works with axis = 0 though:
# code
df_con = pd.concat([df1.T, df2.T], axis = 0)
# Output
categories automatic manual Straight Engine V engine
counts 13.00000 19.00000 NaN NaN
freqs 0.40625 0.59375 NaN NaN
counts NaN NaN 18.0000 14.0000
freqs NaN NaN 0.5625 0.4375
But that is still far from what I'm trying to accomplish.
Now I'm thinking that it would be possible to strip the 'category' info from df1 and df2, but I haven't been able to find out how to do that yet.
Thank you for any other suggestions!
try this,
pd.concat([df1.reset_index(),df2.reset_index()],ignore_index=True)
Output:
categories counts freqs
0 automatic 13 0.40625
1 manual 19 0.59375
2 Straight Engine 18 0.56250
3 V engine 14 0.43750
To get again category as index follow this,
pd.concat([df1.reset_index(),df2.reset_index()],ignore_index=True).set_index('categories')
Output:
counts freqs
categories
automatic 13 0.40625
manual 19 0.59375
Straight Engine 18 0.56250
V engine 14 0.43750
for more details follow this docs
I have a dataframe containing clinical readings of hospital patients, for example a similar dataframe could look like this
heartrate pid time
0 67 151 0.0
1 75 151 1.2
2 78 151 2.5
3 99 186 0.0
In reality there are many more columns, but I will just keep those 3 to make the example more concise.
I would like to "expand" the dataset. In short, I would like to be able to give an argument n_times_back and another argument interval.
For each iteration, which corresponds to for i in range (n_times_back + 1), we do the following:
Create a new, unique pid [OLD ID | i] (Although as long as the new
pid is unique for each duplicated entry, the exact name isn't
really important to me so feel free to change this if it makes it
easier)
For every patient (pid), remove the rows with time column which is
more than the final time of that patient - i * interval. For
example if i * interval = 2.0 and the times associated to one pid
are [0, 0.5, 1.5, 2.8], the new times will be [0, 0.5], as final
time - 2.0 = 0.8
iterate
Since I realize that explaining this textually is a bit messy, here is an example.
With the dataset above, if we let n_times_back = 1 and interval=1 then we get
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 99 18600 0.0
For n_times_back = 2, the result would be
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 67 15102 0.0
6 99 18600 0.0
n_times_back = 3 and above would lead to the same result as n_times_back = 2, as no patient data goes below that point in time
I have written code for this.
def expand_df(df, n_times_back, interval):
for curr_patient in df['pid'].unique():
patient_data = df[df['pid'] == curr_patient]
final_time = patient_data['time'].max()
for i in range(n_times_back + 1):
new_data = patient_data[patient_data['time'] <= final_time - i * interval]
new_data['pid'] = patient_data['pid'].astype(str) + str(i).zfill(2)
new_data['pid'] = new_data['pid'].astype(int)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
df = df[df['pid'] != curr_patient] # remove original patient data, now duplicate
df.reset_index(inplace = True, drop = True)
return df
As far as functionality goes, this code works as intended. However, it is very slow. I am working with a dataframe of 30'000 patients and the code has been running for over 2 hours now.
Is there a way to use pandas operations to speed this up? I have looked around but so far I haven't managed to reproduce this functionality with high level pandas functions
ended up using a groupby function and breaking when no more times were available, as well as creating an "index" column that I merge with the "pid" column at the end.
def expand_df(group, n_times, interval):
df = pd.DataFrame()
final_time = group['time'].max()
for i in range(n_times + 1):
new_data = group[group['time'] <= final_time - i * interval]
new_data['iteration'] = str(i).zfill(2)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
else:
break
return df
new_df = df.groupby('pid').apply(lambda x : expand_df(x, n_times_back, interval))
new_df = new_df.reset_index(drop=True)
new_df['pid'] = new_df['pid'].map(str) + new_df['iteration']