Pandas Multi-Index - Can't convert non-uniquely indexed DataFrame to Panel - python

Given a time series data, I'm trying to use panel OLS with fixed effects in Python. I found this way to do it:
Fixed effect in Pandas or Statsmodels
My input data looks like this (I will called it df):
Name Permits_13 Score_13 Permits_14 Score_14 Permits_15 Score_15
0 P.S. 015 ROBERTO CLEMENTE 12.0 284 22 279 32 283
1 P.S. 019 ASHER LEVY 18.0 296 51 301 55 308
2 P.S. 020 ANNA SILVER 9.0 294 9 290 10 293
3 P.S. 034 FRANKLIN D. ROOSEVELT 3.0 294 4 292 1 296
4 P.S. 064 ROBERT SIMON 3.0 287 15 288 17 291
5 P.S. 110 FLORENCE NIGHTINGALE 0.0 313 3 306 4 308
6 P.S. 134 HENRIETTA SZOLD 4.0 290 12 292 17 288
7 P.S. 137 JOHN L. BERNSTEIN 4.0 276 12 273 17 274
8 P.S. 140 NATHAN STRAUS 13.0 282 37 284 59 284
9 P.S. 142 AMALIA CASTRO 7.0 290 15 285 25 284
10 P.S. 184M SHUANG WEN 5.0 327 12 327 9 327
So first I have to transform it to Multi-index (_13, _14, _15 represent data from 2013, 2014 and 2015, in that order):
df = df.dropna()
df = df.drop_duplicates()
rng = pandas.date_range(start=pandas.datetime(2013, 1, 1), periods=3, freq='A')
index = pandas.MultiIndex.from_product([rng, df['Name']], names=['date', 'id'])
d1 = numpy.array(df.ix[:, ['Score_13', 'Permits_13']])
d2 = numpy.array(df.ix[:, ['Score_14', 'Permits_14']])
d3 = numpy.array(df.ix[:, ['Score_15', 'Permits_15']])
data = numpy.concatenate((d1, d2, d3), axis=0)
s = pandas.DataFrame(data, index=index, columns=['y', 'x'])
s = s.drop_duplicates()
Which results in something like this:
y x
date id
2013-12-31 P.S. 015 ROBERTO CLEMENTE 284 12
P.S. 019 ASHER LEVY 296 18
P.S. 020 ANNA SILVER 294 9
P.S. 034 FRANKLIN D. ROOSEVELT 294 3
P.S. 064 ROBERT SIMON 287 3
P.S. 110 FLORENCE NIGHTINGALE 313 0
P.S. 134 HENRIETTA SZOLD 290 4
P.S. 137 JOHN L. BERNSTEIN 276 4
P.S. 140 NATHAN STRAUS 282 13
P.S. 142 AMALIA CASTRO 290 7
P.S. 184M SHUANG WEN 327 5
P.S. 188 THE ISLAND SCHOOL 279 4
HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 255 4
TECHNOLOGY, ARTS, AND SCIENCES STUDIO 282 18
THE EAST VILLAGE COMMUNITY SCHOOL 306 35
UNIVERSITY NEIGHBORHOOD MIDDLE SCHOOL 277 4
THE CHILDREN'S WORKSHOP SCHOOL 302 35
NEIGHBORHOOD SCHOOL 299 15
EARTH SCHOOL 305 3
SCHOOL FOR GLOBAL LEADERS 286 15
TOMPKINS SQUARE MIDDLE SCHOOL 306 3
P.S. 001 ALFRED E. SMITH 303 20
P.S. 002 MEYER LONDON 306 8
P.S. 003 CHARRETTE SCHOOL 325 62
P.S. 006 LILLIE D. BLAKE 333 89
P.S. 011 WILLIAM T. HARRIS 320 30
P.S. 033 CHELSEA PREP 313 5
P.S. 040 AUGUSTUS SAINT-GAUDENS 326 23
P.S. 041 GREENWICH VILLAGE 326 25
P.S. 042 BENJAMIN ALTMAN 314 30
... ... ... ...
2015-12-31 P.S. 054 CHARLES W. LENG 309 2
P.S. 055 HENRY M. BOEHM 311 3
P.S. 56 THE LOUIS DESARIO SCHOOL 323 4
P.S. 057 HUBERT H. HUMPHREY 287 2
SPACE SHUTTLE COLUMBIA SCHOOL 307 0
P.S. 060 ALICE AUSTEN 303 1
I.S. 061 WILLIAM A MORRIS 291 2
MARSH AVENUE SCHOOL FOR EXPEDITIONARY LEARNING 316 0
P.S. 069 DANIEL D. TOMPKINS 307 2
I.S. 072 ROCCO LAURIE 308 1
I.S. 075 FRANK D. PAULO 318 9
THE MICHAEL J. PETRIDES SCHOOL 310 0
STATEN ISLAND SCHOOL OF CIVIC LEADERSHIP 309 0
P.S. 075 MAYDA CORTIELLA 282 19
P.S. 086 THE IRVINGTON 286 38
P.S. 106 EDWARD EVERETT HALE 280 27
P.S. 116 ELIZABETH L FARRELL 291 3
P.S. 123 SUYDAM 287 14
P.S. 145 ANDREW JACKSON 285 4
P.S. 151 LYNDON B. JOHNSON 271 27
J.H.S. 162 THE WILLOUGHBY 283 22
P.S. 274 KOSCIUSKO 282 2
J.H.S. 291 ROLAND HAYES 279 13
P.S. 299 THOMAS WARREN FIELD 288 5
I.S. 347 SCHOOL OF HUMANITIES 284 45
I.S. 349 MATH, SCIENCE & TECH. 285 45
P.S. 376 301 9
P.S. 377 ALEJANDRINA B. DE GAUTIER 277 3
P.S. /I.S. 384 FRANCES E. CARTER 291 4
ALL CITY LEADERSHIP SECONDARY SCHOOL 325 18
However, when I try to call:
reg = PanelOLS(y=s['y'],x=s[['x']],time_effects=True)
I get an error:
ValueError: Can't convert non-uniquely indexed DataFrame to Panel
That's my first time using Pandas, this may be a simple question but I don't know what's the problem. As far as I got I have a multi-index object as required.
I don't get why I have duplicates (I put a lot of drop_duplicates() try to get rid of any duplicated data -- which I don't think is the answer, though). If I have data for the same school for three years, shouldn't I have duplicate data somehow (looking just at the row Name, for example)?
EDIT
dfis 935 rows × 7 columns, after getting rid of NaNs rows.
So I expected s to be 2805 rows × 2 columns, which is exactly what I have.
If i run this:
s = s.reset_index().groupby(s.index.names).first()
reg = PanelOLS(y=s['y'],x=s[['x']],time_effects=True)
I get another error:
ValueError: operands could not be broadcast together with shapes (2763,) (3,)
Thank you.

Using the provided pickle file, I ran the regression and it worked fine.
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x>
Number of Observations: 2763
Number of Degrees of Freedom: 4
R-squared: 0.0268
Adj R-squared: 0.0257
Rmse: 16.4732
F-stat (1, 2759): 25.3204, p-value: 0.0000
Degrees of Freedom: model 3, resid 2759
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 0.1666 0.0191 8.72 0.0000 0.1292 0.2041
---------------------------------End of Summary---------------------------------
I ran this in Jupyter Notebook

Related

Create optimum possible data frame of given length where total points are maximised and total cost does not exceed budget

Rider Cost Points Points/Cost
27 Mattia Cattaneo 8 620 77.500000
12 Rigoberto Urán 12 696 58.000000
2 Richard Carapaz 18 927 51.500000
31 Stefan Bissegger 8 402 50.250000
82 Andreas Kron 6 291 48.500000
8 Michael Woods 14 625 44.642857
63 Neilson Powless 6 263 43.833333
86 Gonzalo Serrano 6 250 41.666667
32 Stefan Küng 8 327 40.875000
20 Gino Mäder 10 404 40.400000
33 Eddie Dunbar 8 284 35.500000
18 Rui Costa 12 412 34.333333
3 Jakob Fuglsang 16 521 32.562500
0 Maximilian Schachmann 20 641 32.050000
142 Claudio Imhof 4 127 31.750000
This is a fantasy team problem.
I would like to create a data frame of 9 riders where the combined points is maximised but the total cost does not exceed the budget of 100 credits. How would I go about this problem?
Extract columns
df = df1[['Rider', 'Cost', 'Points']]
df['Points/Cost'] = df['Points']/df['Cost']
Set maximum total points
budget = 100
Set number of riders/people in team
num_riders = 9

Why my replace method does not work with the string [B]?

I have the following dataset called world_top_ten:
`
Most populous countries 2000 2015 2030[A]
0 China[B] 1270 1376 1416
1 India 1053 1311 1528
2 United States 283 322 356
3 Indonesia 212 258 295
4 Pakistan 136 208 245
5 Brazil 176 206 228
6 Nigeria 123 182 263
7 Bangladesh 131 161 186
8 Russia 146 146 149
9 Mexico 103 127 148
10 World total 6127 7349 8501
`
I am trying to replace the [B] with "":
world_top_ten['Most populous countries'].str.replace(r'"[B]"', '')
And it is returning me:
0 China[B]
1 India
2 United States
3 Indonesia
4 Pakistan
5 Brazil
6 Nigeria
7 Bangladesh
8 Russia
9 Mexico
10 World total
Name: Most populous countries, dtype: object
What am I doing wrong here?
Because [] is special regex character escape it:
world_top_ten['Most populous countries'].str.replace(r'\[B\]', '', regex=True)

How do I explode my dataframe based on each word in a column?

I have the following df:
Score num_comments titles
0 134 518 Uhaul implement nicotine-free hiring policy
1 28 43 Orangutan saves child from a giant volcano
2 30 114 Swimmer dies in a horrific shark attack in harbour
3 745 298 More teenagers than ever are addicted to glue
4 40 67 Lebanese lawyers union accuse Al Capone of fraud
...
9366 345 32 City of Louisville closed off this summer
9367 1200 234 New york rats "stronger than ever", reports say
9368 432 123 Congolese militia shipwrecked in Norway
9369 594 203 Scientists now agree on how to use ice in drinks
9370 611 153 Historic drought hits Atlantis
Now I would like to create a new dataframe where I can see what score and how many comments each word gets. Like this: df2
Word score num_comments
Uhaul 134 518
implement 134 518
nicotine-free 134 518
hiring 134 518
policy 134 518
Orangutan 28 43
saves 28 43
child 28 43
from 28 43
a 28 43
giant 28 43
volcano 28 43
...
etc..
I have tried Splitting the title into separate words and then exploding:
In [9]: df3
Out[9]:
df3['titles_split'] = df['titles'].str.split()
This gave me a column that looked like this:
Score num_comments titles_split
0 134 518 [Uhaul, implement, nicotine-free, hiring, policy]
1 28 43 [Orangutan, saves, child, from, a, giant, volcano]
2 30 114 [Swimmer, dies, in, a, horrific, shark, attack, in, harbour]
3 745 298 [More, teenagers, than, ever, are, addicted, to, glue]
4 40 67 [Lebanese, lawyers, union, accuse, Al, Capone, of, fraud]
...
9366 345 32 [City, of, Louisville, closed, off, this, summer]
9367 1200 234 [New, york, rats, stronger, than, ever, reports, say]
9368 432 123 [Congolese, militia, shipwrecked, in, Norway]
9369 594 203 [Scientists, now, agree, on, how, to, use, ice, in, drinks]
9370 611 153 [Historic, drought, hits, Atlantis]
Then I tried this code:
df3.explode(df3.assign(titles_split=df3.titles_split.str.split(',')), 'titles_split')
But I got the following error message:
ValueError: column must be a scalar, tuple, or list thereof
The same thing happened when I tried it for titles in df2.
I also tried creating new columns that repeated scores and num_comments as many times as there are words in titles (or titles_split). The idea was to create a dataframe like this:
In [9]: df4
Out[9]:
Score num_comments titles_split score_repeated
0 134 518 [Uhaul, implement, nicotine-free, hiring, policy] 134,134,134,134,134,134
1 28 43 [Orangutan, saves, child, from, a, giant, volcano] 28,28,28,28,28,28,28
2 30 114 [Swimmer, dies, in, a, horrific, shark, attack, in, harbour] 30,30,30 etc..
3 745 298 [More, teenagers, than, ever, are, addicted, to, glue] etc.
4 40 67 [Lebanese, lawyers, union, accuse, Al, Capone, of, fraud] etc
...
9366 345 32 [City, of, Louisville, closed, off, this, summer] etc
9367 1200 234 [New, york, rats, stronger, than, ever, reports, say] etc
9368 432 123 [Congolese, militia, shipwrecked, in, Norway] etc
9369 594 203 [Scientists, now, agree, on, how, to, use, ice, in, drinks] etc
9370 611 153 [Historic, drought, hits, Atlantis] etc
And then explode on titles_split, score_repeated and comments_repeated like this:
df4.explode(['titles_split', 'score_repeated', 'comments_repeated'])
But I never got to that point because I couldn't get repeated columns. I tried the following code:
df3['score_repeat'] = df3.apply(lambda x: [x.score] * len(x.titles_split) , axis =1)
Which gave me this error message:
TypeError: object of type 'float' has no len()
Then I tried:
df3['score_repeat'] = [[y] * x for x, y in zip(df3['titles_split'].str.len(),df['score'])]
Which gave me:
TypeError: can't multiply sequence by non-int of type 'float'
But I am not even sure I am going about this the right way. Do I even need to create score_repeated and comments_repeated?
Assume this is the df
Score num_comments titles
0 134 518 Uhaul implement nicotine-free hiring policy
1 28 43 Orangutan saves child from a giant volcano
2 30 114 Swimmer dies in a horrific shark attack in harbour
3 745 298 More teenagers than ever are addicted to glue
4 40 67 Lebanese lawyers union accuse Al Capone of fraud
5 345 32 City of Louisville closed off this summer
6 1200 234 New york rats "stronger than ever", reports say
7 432 123 Congolese militia shipwrecked in Norway
8 594 203 Scientists now agree on how to use ice in drinks
9 611 153 Historic drought hits Atlantis
You can try the following:
df['titles'] = df['titles'].str.replace('"', '').str.replace(',', '') #data cleaning
df['titles'] = df['titles'].str.split() #split sentence into a list (of single words)
df2 = df.explode('titles', ignore_index=True)
df2.columns = ['score', 'num_comments', 'word']
print(df2)
score num_comments word
0 134 518 Uhaul
1 134 518 implement
2 134 518 nicotine-free
3 134 518 hiring
4 134 518 policy
5 28 43 Orangutan
6 28 43 saves
7 28 43 child
8 28 43 from
9 28 43 a
10 28 43 giant
11 28 43 volcano
12 30 114 Swimmer
13 30 114 dies
14 30 114 in
15 30 114 a
16 30 114 horrific
17 30 114 shark
18 30 114 attack
19 30 114 in
20 30 114 harbour
21 745 298 More
22 745 298 teenagers
23 745 298 than
24 745 298 ever
25 745 298 are
26 745 298 addicted
27 745 298 to
28 745 298 glue
29 40 67 Lebanese
30 40 67 lawyers
31 40 67 union
32 40 67 accuse
33 40 67 Al
34 40 67 Capone
35 40 67 of
36 40 67 fraud
37 345 32 City
38 345 32 of
39 345 32 Louisville
40 345 32 closed
41 345 32 off
42 345 32 this
43 345 32 summer
44 1200 234 New
45 1200 234 york
46 1200 234 rats
47 1200 234 stronger
48 1200 234 than
49 1200 234 ever
50 1200 234 reports
51 1200 234 say
52 432 123 Congolese
53 432 123 militia
54 432 123 shipwrecked
55 432 123 in
56 432 123 Norway
57 594 203 Scientists
58 594 203 now
59 594 203 agree
60 594 203 on
61 594 203 how
62 594 203 to
63 594 203 use
64 594 203 ice
65 594 203 in
66 594 203 drinks
67 611 153 Historic
68 611 153 drought
69 611 153 hits
70 611 153 Atlantis
Data cleaning needed
I noticed that there are strings with " (double-quotations) and , (comma), meaning the data is not clean. You could do some data cleaning with the following:
df['titles'] = df['titles'].str.replace('"', '').str.replace(',', '')

Dataframe structures in pandas/seaborn. Do I need to be careful about observations vs variables?

I'm learning how to use python for data analysis and I have my first few dataframes to work with that I have pulled from video games I play.
So the dataframe I'm working with currently uses the header row for all the player names (8 players)
All the statistics are the first column.
Is it a better practice to have these positions reversed. i.e. should all the players be in the first col instead of the first row?
Arctic Shat Sly Snky Nanm Zax zack Sorn Cort
Statistics
Assists 470 415 388 182 212 92 40 5 4
Avg Damage Dealt 203.82 167.37 165.2 163.45 136.3 85.08 114.96 128.72 26.71
Boosts 1972 1807 1790 668 1392 471 103 7 33
Damage Dealt 236222.66 239680.08 164373.73 74696.195 99904.48 27991.652 13910.629 901.01385 1228.7041
Days 206 234 218 78 157 94 29 3 10
Head Shot Kills 395 307 219 119 130 29 12 0 0
Headshot % 26.37% 18.65% 18.96% 23.85% 19.58% 16.11% 17.14% 0% 0%
Heals 3139 4385 2516 1326 2007 749 382 15 78
K/D 1.36 1.2 1.22 1.13 0.95 0.58 0.59 0.57 0.07
Kills 1498 1646 1155 499 664 180 70 4 3
Longest Kill 461.77765 430.9177 410.534 292.18732 354.3065 287.72366 217.98175 110.25433 24.15225
Longest Time Survived 2051.842 2180.98 1984.259 1948.513 2064.065 1979.101 2051.846 1486.288 1670.048
Losses 1117 1376 959 448 709 320 119 7 46
Max Kill Streaks 4 4 4 3 4 3 3 1 1
Most Survival Time 2051.842 2180.98 1984.259 1948.513 2064.065 1979.101 2051.846 1486.288 1670.048
Revives 281 455 155 104 221 83 19 2 2
Ride Distance 1610093.4 2157408.8 1572710 486170.5 714986.3 524297 204585.53 156.07877 63669.613
Road Kills 1 4 5 4 0 0 0 0 0
Round Most Kills 9 8 9 7 9 6 5 2 1
Rounds Played 1159 1432 995 457 733 329 121 7 46
Suicides 16 42 14 6 10 4 4 0 2
Swim Distance 2830.028 4966.6914 2703.0044 1740.3292 2317.7866 1035.3792 395.86472 0 92.01848
Team Kills 22 47 23 9 15 4 5 0 2
Time Survived 969792.2 1284232.6 930141.94 328190.22 637273.3 284434.3 109724.04 4580.869 37748.414
Top10s 531 654 509 196 350 187 74 2 28
Vehicle Destroys 23 9 29 4 15 3 1 0 0
Walk Distance 1545281.6 1975185 1517812 505191 1039509.8 461860.53 170913.25 9665.322 63900.125
Weapons Acquired 5043 7226 4683 1551 2909 1514 433 23 204
Wins 55 63 48 17 32 19 3 0 3
dBNOs 1489 1575 1058 488 587 179 78 5 8
Yes, it is better to transpose.
The current best practice is to have one instance (row) for each observation, in your case, player. And one feature (column) for each variable.
This is called "tidy data" (from the paper published by Hadley Wickham). Tidy data works more or less like guidelines for us, data scientists, much like normalization rules for relational database people.
Also Most frameworks/programs/data structures are implemented considering this organization. For instance, in python pandas, using a dataframe with this data you have, if you would want to check out the average headshots, would need to check just a df['Head Shot Kills'].mean() (if it was transposed...).

DataFrame to DataPanel in Pandas / Python

I have a data frame that looks like this:
Name Permits_13 Score_13 Permits_14 Score_14 Permits_15 Score_15
0 P.S. 015 ROBERTO CLEMENTE 12.0 284 22 279 32 283
1 P.S. 019 ASHER LEVY 18.0 296 51 301 55 308
2 P.S. 020 ANNA SILVER 9.0 294 9 290 10 293
3 P.S. 034 FRANKLIN D. ROOSEVELT 3.0 294 4 292 1 296
4 P.S. 064 ROBERT SIMON 3.0 287 15 288 17 291
5 P.S. 110 FLORENCE NIGHTINGALE 0.0 313 3 306 4 308
6 P.S. 134 HENRIETTA SZOLD 4.0 290 12 292 17 288
7 P.S. 137 JOHN L. BERNSTEIN 4.0 276 12 273 17 274
8 P.S. 140 NATHAN STRAUS 13.0 282 37 284 59 284
9 P.S. 142 AMALIA CASTRO 7.0 290 15 285 25 284
10 P.S. 184M SHUANG WEN 5.0 327 12 327 9 327
And I would like to transform it to a data panel structure as the answer for this question Fixed effect in Pandas or Statsmodels, so I can use the PanelOLS with fixed effects.
My first attempt was to do this transformation:
df1 = df.ix[:,['Permits_13', 'Score_13']].T
df2 = df.ix[:,['Permits_14', 'Score_14']].T
df3 = df.ix[:,['Permits_15', 'Score_15']].T
pf = pandas.Panel({'df1':df1,'df2':df2,'df3':df3})
However, it doesn't seem to be the correct way, once I have no information about time. Here, columns ending with 13, 14 and 15, represent observations for the years of 2013, 2014 and 2015, in that order.
Do I have to create a data frame for each one of the rows in the original data?
This is my first trial using Pandas, and any help would be appreciated.
The docstring of DataFrame.to_panel() says:
Transform long (stacked) format (DataFrame) into wide (3D, Panel)
format.
Currently the index of the DataFrame must be a 2-level MultiIndex.
This may be generalized later
So that means you need to do:
Stack your dataframe (as it's currently "wide", not "long")
Pick two columns who can unique define the index of your dataframe
Set those columns as your index
Call to_panel()
So that's:
df.stack().set_index(['first_col', 'other_col']).to_panel()

Categories

Resources