I am trying to use the .describe() method on df1 to obtain aggregates. The current index is year. I want to obtain these stats based on each statistics over the 3 year period in the index. I tried using stats_df = df1.groupby('statistics').descirbe().unstack(1)) but I don't get the result that I am looking for.
in df1 =
statistics s_values
year
1999 cigarette use 100
1999 cellphone use 310
1999 internet use 101
1999 alcohol use 100
1999 soda use 215
2000 cigarette use 315
2000 cellphone use 317
2000 internet use 325
2000 alcohol use 108
2000 soda use 200
2001 cigarette use 122
2001 cellphone use 311
2001 internet use 112
2001 alcohol use 144
2001 soda use 689
2002 cigarette use 813
2002 cellphone use 954
2002 internet use 548
2002 alcohol use 882
2002 soda use 121
I am trying to achieve an output like this. Please keep in mind these aggregate values are not accurate I just populated them with random numbers to give you the idea of the format.
result stats_df =
statistics count unique top freq mean std min 20% 40% 50% 60% 80% max
cigarette use 32 335 655 54 45 45 1 23 21 12 55 55 999
cellphone use 92 131 895 49 12 33 6 13 32 55 34 12 933
internet use 32 111 123 44 65 31 2 42 544 15 11 54 111
alcohol use 32 315 611 33 41 53 3 34 22 34 11 33 555
soda use 32 355 655 54 45 45 1 23 21 12 55 55 999
thank you
I created a sample dataframe and I could get the result with just using groupby().describe(). I am unsure what's wrong with your code, could you also edit your post to show the result you obtained?
here's mine
df = pd.DataFrame(index=[1999,1999,1999,1999,1999,2000,2000,2000,2000,2000], columns=['statistics', 's_values'], data=[['cigarette use', 100],['cellphone use', 310],['internet use',
101],['alcohol use', 100], ['soda use', 215],['cigarette use', 315],['cellphone use', 317],['internet use', 325],['alcohol use', 108],['soda use', 200]])
df.groupby("statistics").describe()
output:
s_values
count mean std min 25% 50% 75% max
statistics
alcohol use 2.0 104.0 5.656854 100.0 102.00 104.0 106.00 108.0
cellphone use 2.0 313.5 4.949747 310.0 311.75 313.5 315.25 317.0
cigarette use 2.0 207.5 152.027958 100.0 153.75 207.5 261.25 315.0
internet use 2.0 213.0 158.391919 101.0 157.00 213.0 269.00 325.0
soda use 2.0 207.5 10.606602 200.0 203.75 207.5 211.25 215.0
Related
I am new to python and I am trying to understand how to work with aggregating data and manipulation.
I have a dataframe:
df3
Out[122]:
SBK SSC CountRecs
0 99 22 9
1 99 12 10
2 99 121 11
3 99 138 12
4 99 123 8
... ... ...
160247 184 1318 1
160248 394 2659 1
160249 412 757 1
160250 357 1312 1
160251 202 106 1
I want to understand in the entire data frame, what percentage of CountRecs for each SBK.
For example, in this case, I want to understand 80618 is what % of the summation total number of SBK's with 99. in this case it is 9/50 * 100. But I want this to be done automated for all rows. How can I go about this?
you need to group by the column you want,
marge by the grouped column.
2.1 you can change the name of the new column.
add the percentage column.
a = df3.merge(pd.DataFrame(df3.groupby('SBK' ['CountRecs'].sum()),on='SBK')
df3['percent'] = (a['CountRecs_x']/a['CountRecs_y']) *100
df3
Use GroupBy.transform for Series with same size like original DataFrame filled by counts, so you can divide original column:
df3['percent'] = df3['CountRecs'] / df3.groupby('SBK')['CountRecs'].transform('sum') * 100
print (df3)
SBK SSC CountRecs percent
0 99 22 9 18.0
1 99 12 10 20.0
2 99 121 11 22.0
3 99 138 12 24.0
4 99 123 8 16.0
160247 184 1318 1 100.0
160248 394 2659 1 100.0
160249 412 757 1 100.0
160250 357 1312 1 100.0
160251 202 106 1 100.0
I am trying to figure out how could I plot this data:
column 1 ['genres']: These are the value counts for all the genres in the table
Drama 2453
Comedy 2319
Action 1590
Horror 915
Adventure 586
Thriller 491
Documentary 432
Animation 403
Crime 380
Fantasy 272
Science Fiction 214
Romance 186
Family 144
Mystery 125
Music 100
TV Movie 78
War 59
History 44
Western 42
Foreign 9
Name: genres, dtype: int64
column 2 ['release_year']: These are the value counts for all the release years for different kind of genres
2014 699
2013 656
2015 627
2012 584
2011 540
2009 531
2008 495
2010 487
2007 438
2006 408
2005 363
2004 307
2003 281
2002 266
2001 241
2000 226
1999 224
1998 210
1996 203
1997 192
1994 184
1993 178
1995 174
1988 145
1989 136
1992 133
1991 133
1990 132
1987 125
1986 121
1985 109
1984 105
1981 82
1982 81
1983 80
1980 78
1978 65
1979 57
1977 57
1971 55
1973 55
1976 47
1974 46
1966 46
1975 44
1964 42
1970 40
1967 40
1972 40
1968 39
1965 35
1963 34
1962 32
1960 32
1969 31
1961 31
Name: release_year, dtype: int64
I need to answer the questions like - What genre is most popular from year to year? and so on
what kind of plots can be used and what is the best way to do this since there would be a lot of bins ins a single chart?
Is seaborn better for plotting such variables?
Should I divide the year data into 2 decades(1900 and 2000)?
Sample of the table:
id popularity runtime genres vote_count vote_average release_year
0 135397 32.985763 124 Action 5562 6.5 2015
1 76341 28.419936 120 Action 6185 7.1 1995
2 262500 13.112507 119 Adventure 2480 6.3 2015
3 140607 11.173104 136 Thriller 5292 7.5 2013
4 168259 9.335014 137 Action 2947 7.3 2005
You could do something like this:
Plotting histogram using seaborn for a dataframe
Personally i prefer seaborn for this kind of plots, because it's easier. But you can use matplotlib too.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# sample data
samples = 300
ids = range(samples)
gind = np.random.randint(0, 4, samples)
years = np.random.randint(1990, 2000, samples)
# create sample dataframe
gkeys = {1: 'Drama', 2: 'Comedy', 3: 'Action', 4: 'Adventure', 0: 'Thriller'}
df = pd.DataFrame(zip(ids, gind, years),
columns=['ID', 'Genre', 'Year'])
df['Genre'] = df['Genre'].replace(gkeys)
# count the year groups
res = df.groupby(['Year', 'Genre']).count()
res = res.reset_index()
# only the max values
# res_ind = res.groupby(['Year']).idxmax()
# res = res.loc[res_ind['ID'].tolist()]
# viz
sns.set(style="white")
g = sns.catplot(x='Year',
y= 'ID',
hue='Genre',
data=res,
kind='bar',
ci=None,
)
g.set_axis_labels("Year", "Count")
plt.show()
If this are to many bins in a plot, just split it up.
My aim is to get the percentage of multiple columns, that are divided by another column. The resulting columns should be kept in the same dataframe.
A B Divisor
2000 8 31 166
2001 39 64 108
2002 68 8 142
2003 28 2 130
2004 55 61 150
result:
A B Divisor perc_A perc_B
2000 8 31 166 4.8 18.7
2001 39 64 108 36.1 59.3
2002 68 8 142 47.9 5.6
2003 28 2 130 21.5 1.5
2004 55 61 150 36.7 40.7
My solution:
def percentage(divisor,columns,heading,dframe):
for col in columns:
heading_new = str(heading+col)
dframe[heading_new] = (dframe.loc[:,col]/dframe.loc[:,divisor])*100
return dframe
df_new = division("Divisor",df.columns.values[:2],"perc_",df)
The solution above worked.But is there a more effective way to get the solution?
(I know there are already similar questions. But I couldn't find one, where I can save the results in the same dataframe without loosing the original columns)
Thanks
Use DataFrame.join for add new columns created by DataFrame.div by first 2 columns selected by DataFrame.iloc, multiple by 100 and DataFrame.add_prefix:
df = df.join(df.iloc[:, :2].div(df['Divisor'], axis=0).mul(100).add_prefix('perc_'))
print (df)
A B Divisor perc_A perc_B
2000 8 31 166 4.819277 18.674699
2001 39 64 108 36.111111 59.259259
2002 68 8 142 47.887324 5.633803
2003 28 2 130 21.538462 1.538462
2004 55 61 150 36.666667 40.666667
Your function should be changed:
def percentage(divisor,columns,heading,dframe):
return df.join(df[columns].div(df[divisor], axis=0).mul(100).add_prefix(heading))
df_new = percentage("Divisor",df.columns.values[:2],"perc_",df)
You can reshape the divisor:
df[['perc_A', 'perc_B']] = df[['A', 'B']] / df['Divisor'].values[:,None] * 100
I have a data frame that looks like this:
Name Permits_13 Score_13 Permits_14 Score_14 Permits_15 Score_15
0 P.S. 015 ROBERTO CLEMENTE 12.0 284 22 279 32 283
1 P.S. 019 ASHER LEVY 18.0 296 51 301 55 308
2 P.S. 020 ANNA SILVER 9.0 294 9 290 10 293
3 P.S. 034 FRANKLIN D. ROOSEVELT 3.0 294 4 292 1 296
4 P.S. 064 ROBERT SIMON 3.0 287 15 288 17 291
5 P.S. 110 FLORENCE NIGHTINGALE 0.0 313 3 306 4 308
6 P.S. 134 HENRIETTA SZOLD 4.0 290 12 292 17 288
7 P.S. 137 JOHN L. BERNSTEIN 4.0 276 12 273 17 274
8 P.S. 140 NATHAN STRAUS 13.0 282 37 284 59 284
9 P.S. 142 AMALIA CASTRO 7.0 290 15 285 25 284
10 P.S. 184M SHUANG WEN 5.0 327 12 327 9 327
And I would like to transform it to a data panel structure as the answer for this question Fixed effect in Pandas or Statsmodels, so I can use the PanelOLS with fixed effects.
My first attempt was to do this transformation:
df1 = df.ix[:,['Permits_13', 'Score_13']].T
df2 = df.ix[:,['Permits_14', 'Score_14']].T
df3 = df.ix[:,['Permits_15', 'Score_15']].T
pf = pandas.Panel({'df1':df1,'df2':df2,'df3':df3})
However, it doesn't seem to be the correct way, once I have no information about time. Here, columns ending with 13, 14 and 15, represent observations for the years of 2013, 2014 and 2015, in that order.
Do I have to create a data frame for each one of the rows in the original data?
This is my first trial using Pandas, and any help would be appreciated.
The docstring of DataFrame.to_panel() says:
Transform long (stacked) format (DataFrame) into wide (3D, Panel)
format.
Currently the index of the DataFrame must be a 2-level MultiIndex.
This may be generalized later
So that means you need to do:
Stack your dataframe (as it's currently "wide", not "long")
Pick two columns who can unique define the index of your dataframe
Set those columns as your index
Call to_panel()
So that's:
df.stack().set_index(['first_col', 'other_col']).to_panel()
I have a dataframe that is the result of a pivot table that has columns:
(best buy, count) 753 non-null values
(best buy, mean) 753 non-null values
(best buy, min) 753 non-null values
(best buy, max) 753 non-null values
(best buy, std) 750 non-null values
(amazon, count) 662 non-null values
(amazon, mean) 662 non-null values
(amazon, min) 662 non-null values
(amazon, max) 662 non-null values
(amazon, std) 661 non-null values
If I send this to a csv file I end up with something that looks like this (truncated)
(best buy, count) (best buy, mean) (best buy, max)
laptop 5 10 12
tv 10 23 34
and so on and so forth.
Is there a way for me to manipulate the dataframe so that the csv that is created instead looks like the below?
best buy best buy best buy
count mean max
laptop 5 10 12
tv 10 23 34
You can pass tupleize_cols=False to DataFrame.to_csv():
In [60]: df = DataFrame(poisson(50, size=(10, 2)), columns=['laptop', 'tv'])
In [61]: df
Out[61]:
laptop tv
0 48 57
1 48 45
2 48 49
3 61 47
4 49 47
5 45 65
6 49 40
7 58 39
8 46 65
9 43 53
In [62]: df['store'] = np.random.choice(['best_buy', 'amazon'], len(df))
In [63]: df
Out[63]:
laptop tv store
0 48 57 best_buy
1 48 45 best_buy
2 48 49 best_buy
3 61 47 best_buy
4 49 47 amazon
5 45 65 amazon
6 49 40 amazon
7 58 39 best_buy
8 46 65 amazon
9 43 53 best_buy
In [64]: res = df.groupby('store').agg(['mean', 'std', 'min', 'max']).T
In [65]: res
Out[65]:
store amazon best_buy
laptop mean 47.250 51.000
std 2.062 6.928
min 45.000 43.000
max 49.000 61.000
tv mean 54.250 48.333
std 12.738 6.282
min 40.000 39.000
max 65.000 57.000
In [66]: u = res.unstack()
In [67]: u
Out[67]:
store amazon best_buy
mean std min max mean std min max
laptop 47.25 2.062 45 49 51.000 6.928 43 61
tv 54.25 12.738 40 65 48.333 6.282 39 57
In [68]: u.to_csv('the_csv.csv', tupleize_cols=False, sep='\t')
In [69]: cat the_csv.csv
store amazon amazon amazon amazon best_buy best_buy best_buy best_buy
mean std min max mean std min max
laptop 47.25 2.0615528128088303 45.0 49.0 51.0 6.928203230275509 43.0 61.0
tv 54.25 12.737739202856996 40.0 65.0 48.333333333333336 6.282250127674532 39.0 57.0