I have a data frame that looks like this:
Name Permits_13 Score_13 Permits_14 Score_14 Permits_15 Score_15
0 P.S. 015 ROBERTO CLEMENTE 12.0 284 22 279 32 283
1 P.S. 019 ASHER LEVY 18.0 296 51 301 55 308
2 P.S. 020 ANNA SILVER 9.0 294 9 290 10 293
3 P.S. 034 FRANKLIN D. ROOSEVELT 3.0 294 4 292 1 296
4 P.S. 064 ROBERT SIMON 3.0 287 15 288 17 291
5 P.S. 110 FLORENCE NIGHTINGALE 0.0 313 3 306 4 308
6 P.S. 134 HENRIETTA SZOLD 4.0 290 12 292 17 288
7 P.S. 137 JOHN L. BERNSTEIN 4.0 276 12 273 17 274
8 P.S. 140 NATHAN STRAUS 13.0 282 37 284 59 284
9 P.S. 142 AMALIA CASTRO 7.0 290 15 285 25 284
10 P.S. 184M SHUANG WEN 5.0 327 12 327 9 327
And I would like to transform it to a data panel structure as the answer for this question Fixed effect in Pandas or Statsmodels, so I can use the PanelOLS with fixed effects.
My first attempt was to do this transformation:
df1 = df.ix[:,['Permits_13', 'Score_13']].T
df2 = df.ix[:,['Permits_14', 'Score_14']].T
df3 = df.ix[:,['Permits_15', 'Score_15']].T
pf = pandas.Panel({'df1':df1,'df2':df2,'df3':df3})
However, it doesn't seem to be the correct way, once I have no information about time. Here, columns ending with 13, 14 and 15, represent observations for the years of 2013, 2014 and 2015, in that order.
Do I have to create a data frame for each one of the rows in the original data?
This is my first trial using Pandas, and any help would be appreciated.
The docstring of DataFrame.to_panel() says:
Transform long (stacked) format (DataFrame) into wide (3D, Panel)
format.
Currently the index of the DataFrame must be a 2-level MultiIndex.
This may be generalized later
So that means you need to do:
Stack your dataframe (as it's currently "wide", not "long")
Pick two columns who can unique define the index of your dataframe
Set those columns as your index
Call to_panel()
So that's:
df.stack().set_index(['first_col', 'other_col']).to_panel()
Related
I feel like this should be an easy solution, but it has eluded me a bit (long week).
Say I have the following Pandas Dataframe (df):
day
x_count
x_max
y_count
y_max
1
8
230
18
127
1
6
174
12
121
1
5
218
21
184
1
11
91
32
162
2
11
128
17
151
2
13
156
16
148
2
18
191
22
120
Etc. How can I collapse it down so that I have one row per day and each of the columns in my example are added across all of their days?
For example:
day
x_count
x_max
y_count
y_max
1
40
713
93
594
2
42
475
55
419
Is it best to reshape it or simply create a new one?
I started this question yesterday and have done more work on it.
Thanks #AMC , #ALollz
I have a dataframe of surgical activity data that has 58 columns and 200,000 records. One of the columns is treatment specialty Each row corresponds to a patient encounter. I want to see the relative conribution of medical specialties. One column is 'TRETSPEF' = treatment_specialty. I have used `pd.read_csv('csv, usecols = ['TRETSPEF') to import the series.
df
TRETSPEF
0 150
1 150
2 150
3 150
4 150
... ...
218462 150
218463 &
218464 150
218465 150
218466 218`
The most common treatment specialty is neurosurgery (code 150). So heres the problem. When I apply
.value_counts I get two groups for the 150 code (and the 218 code)
df['TRETSPEF'].value_counts()
150 140411
150 40839
218 13692
108 10552
218 4143
...
501 1
120 1
302 1
219 1
106 1
Name: TRETSPEF, Length: 69, dtype: int64
There are some '&' in there (454) so I wondered if the fact they aren't integers was messing things up so I changed them to null values, and ran value counts.
df['TRETSPEF'].str.replace("&", "").value_counts()
150 140411
218 13692
108 10552
800 858
110 835
811 692
191 580
323 555
454
100 271
400 116
420 47
301 45
812 38
214 24
215 23
180 22
300 17
370 15
421 11
258 11
314 5
422 4
260 4
192 4
242 4
171 4
350 2
307 2
302 2
328 2
160 1
219 1
120 1
107 1
101 1
143 1
501 1
144 1
320 1
104 1
106 1
430 1
264 1
Name: TRETSPEF, dtype: int64
so now I seem to have lost the second group of 150 - about 40000 records by changing '&' to null. The nulls are still showing up in .value_counts though.The length of the series has gone down to 45 fromn 69.
I tried stripping whitespace - no difference. Not sure what tests to run to see why this is happening. I feel it must somehow be due to the data.
This is 100% a data cleansing issue. Try to force the column to be numeric.
pd.to_numeric(df['TRETSPEF'], errors='coerce').value_counts()
I'm learning how to use python for data analysis and I have my first few dataframes to work with that I have pulled from video games I play.
So the dataframe I'm working with currently uses the header row for all the player names (8 players)
All the statistics are the first column.
Is it a better practice to have these positions reversed. i.e. should all the players be in the first col instead of the first row?
Arctic Shat Sly Snky Nanm Zax zack Sorn Cort
Statistics
Assists 470 415 388 182 212 92 40 5 4
Avg Damage Dealt 203.82 167.37 165.2 163.45 136.3 85.08 114.96 128.72 26.71
Boosts 1972 1807 1790 668 1392 471 103 7 33
Damage Dealt 236222.66 239680.08 164373.73 74696.195 99904.48 27991.652 13910.629 901.01385 1228.7041
Days 206 234 218 78 157 94 29 3 10
Head Shot Kills 395 307 219 119 130 29 12 0 0
Headshot % 26.37% 18.65% 18.96% 23.85% 19.58% 16.11% 17.14% 0% 0%
Heals 3139 4385 2516 1326 2007 749 382 15 78
K/D 1.36 1.2 1.22 1.13 0.95 0.58 0.59 0.57 0.07
Kills 1498 1646 1155 499 664 180 70 4 3
Longest Kill 461.77765 430.9177 410.534 292.18732 354.3065 287.72366 217.98175 110.25433 24.15225
Longest Time Survived 2051.842 2180.98 1984.259 1948.513 2064.065 1979.101 2051.846 1486.288 1670.048
Losses 1117 1376 959 448 709 320 119 7 46
Max Kill Streaks 4 4 4 3 4 3 3 1 1
Most Survival Time 2051.842 2180.98 1984.259 1948.513 2064.065 1979.101 2051.846 1486.288 1670.048
Revives 281 455 155 104 221 83 19 2 2
Ride Distance 1610093.4 2157408.8 1572710 486170.5 714986.3 524297 204585.53 156.07877 63669.613
Road Kills 1 4 5 4 0 0 0 0 0
Round Most Kills 9 8 9 7 9 6 5 2 1
Rounds Played 1159 1432 995 457 733 329 121 7 46
Suicides 16 42 14 6 10 4 4 0 2
Swim Distance 2830.028 4966.6914 2703.0044 1740.3292 2317.7866 1035.3792 395.86472 0 92.01848
Team Kills 22 47 23 9 15 4 5 0 2
Time Survived 969792.2 1284232.6 930141.94 328190.22 637273.3 284434.3 109724.04 4580.869 37748.414
Top10s 531 654 509 196 350 187 74 2 28
Vehicle Destroys 23 9 29 4 15 3 1 0 0
Walk Distance 1545281.6 1975185 1517812 505191 1039509.8 461860.53 170913.25 9665.322 63900.125
Weapons Acquired 5043 7226 4683 1551 2909 1514 433 23 204
Wins 55 63 48 17 32 19 3 0 3
dBNOs 1489 1575 1058 488 587 179 78 5 8
Yes, it is better to transpose.
The current best practice is to have one instance (row) for each observation, in your case, player. And one feature (column) for each variable.
This is called "tidy data" (from the paper published by Hadley Wickham). Tidy data works more or less like guidelines for us, data scientists, much like normalization rules for relational database people.
Also Most frameworks/programs/data structures are implemented considering this organization. For instance, in python pandas, using a dataframe with this data you have, if you would want to check out the average headshots, would need to check just a df['Head Shot Kills'].mean() (if it was transposed...).
I created the following python pandas pivot table.
df_pv = pd.pivot_table(df,index=["Fiscal_Week"],columns=["Year"],values=["Category","Sales","Traffic"],
aggfunc={"Category":len,"Sales":np.sum,"Traffic":np.sum},fill_value=0)
Category Sales Traffic
Year |2014 2015 2016 | 2014 2015 2016 | 2014 2015 2016
Fiscal_Week
FW01 4 3 4 35678 654654 47547 567 231 765
FW02 2 6 7 6565 4686 34554 297 464 564
FW03 4 4 5 5867 56856 34346 287 45 324
FW04 2 5 3 8568 45745 3564 546 765 978
FW05 2 5 5 5685 3464 4754 325 235 654
FW06 4 3 2 56765 35663 3643 456 935 936
FW07 1 6 2 8686 2454 2463 324 728 598
FW08 6 2 3 34634 34543 4754 198 436 234
I would like to create the two following plots:
Scatterplot: Number of Campaigns by Sales and each year have it's own color.
The second graph should be Traffic by Fiscal Week.
I tried this unsucessfully
df_pv.plot(x="Fiscal_Week", y="Sales")
KeyError: 'Fiscal_Week'
Is there a better way - for example to not pivot, but within the graph specify the aggregations?
You're trying to use the index as a normal column. That's not possible.
Ways to overcome this:
Reset the index reset_index()
Use the index explicitely .plot(x=df_pv.index, y="Sales")
Use the index implicitely .plot(y="Sales", use_index=True)
Given a time series data, I'm trying to use panel OLS with fixed effects in Python. I found this way to do it:
Fixed effect in Pandas or Statsmodels
My input data looks like this (I will called it df):
Name Permits_13 Score_13 Permits_14 Score_14 Permits_15 Score_15
0 P.S. 015 ROBERTO CLEMENTE 12.0 284 22 279 32 283
1 P.S. 019 ASHER LEVY 18.0 296 51 301 55 308
2 P.S. 020 ANNA SILVER 9.0 294 9 290 10 293
3 P.S. 034 FRANKLIN D. ROOSEVELT 3.0 294 4 292 1 296
4 P.S. 064 ROBERT SIMON 3.0 287 15 288 17 291
5 P.S. 110 FLORENCE NIGHTINGALE 0.0 313 3 306 4 308
6 P.S. 134 HENRIETTA SZOLD 4.0 290 12 292 17 288
7 P.S. 137 JOHN L. BERNSTEIN 4.0 276 12 273 17 274
8 P.S. 140 NATHAN STRAUS 13.0 282 37 284 59 284
9 P.S. 142 AMALIA CASTRO 7.0 290 15 285 25 284
10 P.S. 184M SHUANG WEN 5.0 327 12 327 9 327
So first I have to transform it to Multi-index (_13, _14, _15 represent data from 2013, 2014 and 2015, in that order):
df = df.dropna()
df = df.drop_duplicates()
rng = pandas.date_range(start=pandas.datetime(2013, 1, 1), periods=3, freq='A')
index = pandas.MultiIndex.from_product([rng, df['Name']], names=['date', 'id'])
d1 = numpy.array(df.ix[:, ['Score_13', 'Permits_13']])
d2 = numpy.array(df.ix[:, ['Score_14', 'Permits_14']])
d3 = numpy.array(df.ix[:, ['Score_15', 'Permits_15']])
data = numpy.concatenate((d1, d2, d3), axis=0)
s = pandas.DataFrame(data, index=index, columns=['y', 'x'])
s = s.drop_duplicates()
Which results in something like this:
y x
date id
2013-12-31 P.S. 015 ROBERTO CLEMENTE 284 12
P.S. 019 ASHER LEVY 296 18
P.S. 020 ANNA SILVER 294 9
P.S. 034 FRANKLIN D. ROOSEVELT 294 3
P.S. 064 ROBERT SIMON 287 3
P.S. 110 FLORENCE NIGHTINGALE 313 0
P.S. 134 HENRIETTA SZOLD 290 4
P.S. 137 JOHN L. BERNSTEIN 276 4
P.S. 140 NATHAN STRAUS 282 13
P.S. 142 AMALIA CASTRO 290 7
P.S. 184M SHUANG WEN 327 5
P.S. 188 THE ISLAND SCHOOL 279 4
HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 255 4
TECHNOLOGY, ARTS, AND SCIENCES STUDIO 282 18
THE EAST VILLAGE COMMUNITY SCHOOL 306 35
UNIVERSITY NEIGHBORHOOD MIDDLE SCHOOL 277 4
THE CHILDREN'S WORKSHOP SCHOOL 302 35
NEIGHBORHOOD SCHOOL 299 15
EARTH SCHOOL 305 3
SCHOOL FOR GLOBAL LEADERS 286 15
TOMPKINS SQUARE MIDDLE SCHOOL 306 3
P.S. 001 ALFRED E. SMITH 303 20
P.S. 002 MEYER LONDON 306 8
P.S. 003 CHARRETTE SCHOOL 325 62
P.S. 006 LILLIE D. BLAKE 333 89
P.S. 011 WILLIAM T. HARRIS 320 30
P.S. 033 CHELSEA PREP 313 5
P.S. 040 AUGUSTUS SAINT-GAUDENS 326 23
P.S. 041 GREENWICH VILLAGE 326 25
P.S. 042 BENJAMIN ALTMAN 314 30
... ... ... ...
2015-12-31 P.S. 054 CHARLES W. LENG 309 2
P.S. 055 HENRY M. BOEHM 311 3
P.S. 56 THE LOUIS DESARIO SCHOOL 323 4
P.S. 057 HUBERT H. HUMPHREY 287 2
SPACE SHUTTLE COLUMBIA SCHOOL 307 0
P.S. 060 ALICE AUSTEN 303 1
I.S. 061 WILLIAM A MORRIS 291 2
MARSH AVENUE SCHOOL FOR EXPEDITIONARY LEARNING 316 0
P.S. 069 DANIEL D. TOMPKINS 307 2
I.S. 072 ROCCO LAURIE 308 1
I.S. 075 FRANK D. PAULO 318 9
THE MICHAEL J. PETRIDES SCHOOL 310 0
STATEN ISLAND SCHOOL OF CIVIC LEADERSHIP 309 0
P.S. 075 MAYDA CORTIELLA 282 19
P.S. 086 THE IRVINGTON 286 38
P.S. 106 EDWARD EVERETT HALE 280 27
P.S. 116 ELIZABETH L FARRELL 291 3
P.S. 123 SUYDAM 287 14
P.S. 145 ANDREW JACKSON 285 4
P.S. 151 LYNDON B. JOHNSON 271 27
J.H.S. 162 THE WILLOUGHBY 283 22
P.S. 274 KOSCIUSKO 282 2
J.H.S. 291 ROLAND HAYES 279 13
P.S. 299 THOMAS WARREN FIELD 288 5
I.S. 347 SCHOOL OF HUMANITIES 284 45
I.S. 349 MATH, SCIENCE & TECH. 285 45
P.S. 376 301 9
P.S. 377 ALEJANDRINA B. DE GAUTIER 277 3
P.S. /I.S. 384 FRANCES E. CARTER 291 4
ALL CITY LEADERSHIP SECONDARY SCHOOL 325 18
However, when I try to call:
reg = PanelOLS(y=s['y'],x=s[['x']],time_effects=True)
I get an error:
ValueError: Can't convert non-uniquely indexed DataFrame to Panel
That's my first time using Pandas, this may be a simple question but I don't know what's the problem. As far as I got I have a multi-index object as required.
I don't get why I have duplicates (I put a lot of drop_duplicates() try to get rid of any duplicated data -- which I don't think is the answer, though). If I have data for the same school for three years, shouldn't I have duplicate data somehow (looking just at the row Name, for example)?
EDIT
dfis 935 rows × 7 columns, after getting rid of NaNs rows.
So I expected s to be 2805 rows × 2 columns, which is exactly what I have.
If i run this:
s = s.reset_index().groupby(s.index.names).first()
reg = PanelOLS(y=s['y'],x=s[['x']],time_effects=True)
I get another error:
ValueError: operands could not be broadcast together with shapes (2763,) (3,)
Thank you.
Using the provided pickle file, I ran the regression and it worked fine.
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x>
Number of Observations: 2763
Number of Degrees of Freedom: 4
R-squared: 0.0268
Adj R-squared: 0.0257
Rmse: 16.4732
F-stat (1, 2759): 25.3204, p-value: 0.0000
Degrees of Freedom: model 3, resid 2759
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 0.1666 0.0191 8.72 0.0000 0.1292 0.2041
---------------------------------End of Summary---------------------------------
I ran this in Jupyter Notebook