I have the following dataset called world_top_ten:
`
Most populous countries 2000 2015 2030[A]
0 China[B] 1270 1376 1416
1 India 1053 1311 1528
2 United States 283 322 356
3 Indonesia 212 258 295
4 Pakistan 136 208 245
5 Brazil 176 206 228
6 Nigeria 123 182 263
7 Bangladesh 131 161 186
8 Russia 146 146 149
9 Mexico 103 127 148
10 World total 6127 7349 8501
`
I am trying to replace the [B] with "":
world_top_ten['Most populous countries'].str.replace(r'"[B]"', '')
And it is returning me:
0 China[B]
1 India
2 United States
3 Indonesia
4 Pakistan
5 Brazil
6 Nigeria
7 Bangladesh
8 Russia
9 Mexico
10 World total
Name: Most populous countries, dtype: object
What am I doing wrong here?
Because [] is special regex character escape it:
world_top_ten['Most populous countries'].str.replace(r'\[B\]', '', regex=True)
Related
the data frame is:
US Circuit #case
NY,N 202
NY,E 413
NY,S 304
NY,W 106
VT 15
DE 56
NJ 682
PA,E 147
PA,M 132
PA,W 209
VI 0
MD 453
NC,E 84
NC,M 60
NC,W 58
I aim to write a python code to detect the US circuits that belong to the same state and return the sum of the cases for that state.
US Circuit #case state #total case
NY,N 202 NY 1025
NY,E 413
NY,S 304
NY,W 106
VT 15 VT 15
DE 56 DE 56
NJ 682 NJ 682
PA,E 147 PA 488
PA,M 132
PA,W 209
VI 0 VI 0
If you don't need the empty rows you can use groupby transform
df[['state', 'code']] = df['US Circuit'].str.split(',', expand=True)
df['total case'] = df.groupby('state')['#case'].transform('sum')
US Circuit #case state code total case
0 NY,N 202 NY N 1025
1 NY,E 413 NY E 1025
2 NY,S 304 NY S 1025
3 NY,W 106 NY W 1025
4 VT 15 VT None 15
5 DE 56 DE None 56
6 NJ 682 NJ None 682
7 PA,E 147 PA E 488
8 PA,M 132 PA M 488
9 PA,W 209 PA W 488
10 VI 0 VI None 0
11 MD 453 MD None 453
12 NC,E 84 NC E 202
13 NC,M 60 NC M 202
14 NC,W 58 NC W 202
To get the sums, you can split on ",":
In [4]: df.groupby(df["Circuit"].str.split(",").str[0]).sum().reset_index()
Out[4]:
Circuit #case
0 DE 56
1 MD 453
2 NC 202
3 NJ 682
4 NY 1025
5 PA 488
6 VI 0
7 VT 15
My dataframe contains number of matches for given fixtures, but only for home matches for given team (i.e. number of matches for Argentina-Uruguay matches is 97, but for Uruguay-Argentina this number is 80). In short I want to sum both numbers of home matches for given teams, so that I have the total number of matches between the teams concerned. The dataframe's top 30 rows looks like this:
most_often = mc.groupby(["home_team", "away_team"]).size().reset_index(name="how_many").sort_values(by=['how_many'], ascending = False)
most_often = most_often.reset_index(drop=True)
most_often.head(30)
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
5 Argentina Paraguay 64
6 Belgium Netherlands 63
7 Netherlands Belgium 62
8 England Scotland 59
9 Argentina Brazil 58
10 Brazil Paraguay 58
11 Scotland England 58
12 Norway Sweden 56
13 England Wales 54
14 Sweden Denmark 54
15 Wales Scotland 54
16 Denmark Sweden 53
17 Argentina Chile 53
18 Scotland Wales 52
19 Scotland Northern Ireland 52
20 Sweden Norway 51
21 Wales England 50
22 England Northern Ireland 50
23 Wales Northern Ireland 50
24 Chile Uruguay 49
25 Northern Ireland England 49
26 Brazil Argentina 48
27 Brazil Chile 48
28 Brazil Uruguay 47
29 Chile Peru 46
In turn, I mean something like this
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 107
5 Uganda Kenya 107
6 Belgium Netherlands 105
7 Netherlands Belgium 105
But this is only an example, I want to apply it for every team, which I have on dataframe.
What should I do?
Ok, you can follow steps below.
Here is the initial df.
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
Here you need to create a siorted list that will be the key foraggregations.
df1['sorted_list_team'] = list(zip(df1['home_team'],df1['away_team']))
df1['sorted_list_team'] = df1['sorted_list_team'].apply(lambda x: np.sort(np.unique(x)))
home_team away_team how_many sorted_list_team
0 Argentina Uruguay 97 [Argentina, Uruguay]
1 Uruguay Argentina 80 [Argentina, Uruguay]
2 Austria Hungary 69 [Austria, Hungary]
3 Hungary Austria 68 [Austria, Hungary]
4 Kenya Uganda 65 [Kenya, Uganda]
Here you will covert this list to tuple and turn it able to be aggregations.
def converter(list):
return (*list, )
df1['sorted_list_team'] = df1['sorted_list_team'].apply(converter)
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
Do the aggregation to make a sum of 'how_many' values in another dataframe that i call 'df_sum'.
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
And merge with 'df1' to get the result of a sum, the colum 'how_many' are in both dfs, for this reason pandas rename the column of df_sum as 'how_many_y'
df1 = pd.merge(df1,df_sum[['sorted_list_team','how_many']], on='sorted_list_team',how='left').drop_duplicates()
And final step you need select only columns that you need from result df.
df1 = df1[['home_team','away_team','how_many_y']]
df1 = df1.drop_duplicates()
df1.head()
home_team away_team how_many_y
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 65
I found a relatively straightforward thing that hopefully does what you want, but is slightly different than your desired output. Your output has what looks like repetitive information where we aren't caring anymore about home-vs-away team but just want the game counts, and so let's get rid of that distinction (if we can...).
If we make a new column that combines the values from home_team and away_team in the same order each time, we can just do a sum on the how_many where that new column matches
df['teams'] = pd.Series(map('-'.join,np.sort(df[['home_team','away_team']],axis=1)))
# this creates values like 'Argentina-Brazil' and 'Chile-Peru'
df[['how_many','teams']].groupby('teams').sum()
This code gave me the following:
how_many
teams
Argentina-Brazil 106
Argentina-Chile 53
Argentina-Paraguay 64
Argentina-Uruguay 177
Austria-Hungary 137
Belgium-Netherlands 125
Brazil-Chile 48
Brazil-Paraguay 58
Brazil-Uruguay 47
Chile-Peru 46
Chile-Uruguay 49
Denmark-Sweden 107
England-Northern Ireland 99
England-Scotland 117
England-Wales 104
Kenya-Uganda 65
Northern Ireland-Scotland 52
Northern Ireland-Wales 50
Norway-Sweden 107
Scotland-Wales 106
df1:
Id Country P_Type Sales
102 Portugal Industries 1265
163 Portugal Office 1455
111 Portugal Clubs 1265
164 Portugal cars 1751
109 India House_hold 1651
104 India Office 1125
124 India Bakery 1752
112 India House_hold 1259
105 Germany Industries 1451
103 Germany Office 1635
103 Germany Clubs 1520
103 Germany cars 1265
df2:
Id Market Products Expenditure
123 Portugal ALL Wine 5642
136 Portugal St Wine 4568
158 India QA Housing 4529
168 India stm Housing 1576
749 Germany all Sports 4587
759 Germany sts Sports 4756
Output df:
Id Country P_Type Sales
102 Portugal Industries 1265
102 Portugal ALL Wine 5642
102 Portugal St Wine 4568
163 Portugal Office 1455
111 Portugal Clubs 1265
164 Portugal cars 1751
109 India House_hold 1651
109 India QA Housing 4529
109 India stm Housing 1576
104 India Office 1125
124 India Bakery 1752
112 India House_hold 1259
105 Germany Industries 1451
105 Germany all Sports 4587
105 Germany sts Sports 4756
103 Germany Office 1635
103 Germany Clubs 1520
103 Germany cars 1265
I need to append two dataframe, but rows from df2 should append at specific location in df1.
For Example in df2 the first two rows "Market" Column belongs to Portugal and in my df1
Country Portugal first row Id is 102, it should append after 1st row of portugal with same Id.
Same follows for other countries.
I think I would do it by creating a psuedo sort key like this:
df1['sortkey'] = df1['Country'].duplicated()
df2 = df2.set_axis(df1.columns[:-1], axis=1)
df1['sortkey'] = df1['Country'].duplicated().replace({True:2, False:0})
df_sorted = (pd.concat([df1, df2.assign(sortkey=1)])
.sort_values(['Country', 'sortkey'],
key=lambda x: x.astype(str).str.split(' ').str[0]))
df_sorted['Id'] = df_sorted.groupby(df_sorted['Country'].str.split(' ').str[0])['Id'].transform('first')
print(df_sorted.drop('sortkey', axis=1))
Output:
Id Country P_Type Sales
8 105 Germany Industries 1451
4 105 Germany all Sports 4587
5 105 Germany sts Sports 4756
9 105 Germany Office 1635
10 105 Germany Clubs 1520
11 105 Germany cars 1265
4 109 India House_hold 1651
2 109 India QA Housing 4529
3 109 India stm Housing 1576
5 109 India Office 1125
6 109 India Bakery 1752
7 109 India House_hold 1259
0 102 Portugal Industries 1265
0 102 Portugal ALL Wine 5642
1 102 Portugal St Wine 4568
1 102 Portugal Office 1455
2 102 Portugal Clubs 1265
3 102 Portugal cars 1751
Note: Using pandas 1.1.0 with key parameter in sort_values method
from itertools import chain
#ensure the columns match for both dataframes
df1.columns = df.columns
#the Id from the first dataframe takes precedence, so we convert
#the Id in df1 to null
df1.Id = np.nan
#here we iterate through the group for df
#we get the first row for each group
#get the rows from df1 for that particular group
#then the rows from 1 to the end for df
#flatten the data using itertools' chain
#concatenate the data, fill down on the null values in the Id column
merger = ((
value.iloc[[0]],
df1.loc[df1.Country.str.split().str[0].isin(value.Country)],
value.iloc[1:])
for key, value in df.groupby("Country", sort=False).__iter__())
merger = chain.from_iterable(merger)
merger = pd.concat(merger, ignore_index=True).ffill().astype({"Id": "Int16"})
merger.head()
Id Country P_Type Sales
0 102 Portugal Industries 1265
1 102 Portugal ALL Wine 5642
2 102 Portugal St Wine 4568
3 163 Portugal Office 1455
4 111 Portugal Clubs 1265
df2.rename(columns = {'Market':'Country','Products':'P_Type','Expenditure':'Sales'}, inplace = True)
def Insert_row(row_number, df, row_value):
# Starting value of upper half
start_upper = 0
# End value of upper half
end_upper = row_number
# Start value of lower half
start_lower = row_number
# End value of lower half
end_lower = df.shape[0]
# Create a list of upper_half index
upper_half = [*range(start_upper, end_upper, 1)]
# Create a list of lower_half index
lower_half = [*range(start_lower, end_lower, 1)]
# Increment the value of lower half by 1
lower_half = [x.__add__(1) for x in lower_half]
# Combine the two lists
index_ = upper_half + lower_half
# Update the index of the dataframe
df.index = index_
# Insert a row at the end
df.loc[row_number] = row_value
# Sort the index labels
df = df.sort_index()
# return the dataframe
return df
def proper_plc(index_2):
index_1 =0
for ids1 in df1.Country:
# print(ids1 in ids)
if ids1 in ids:
break
index_1+=1
abc = list(df2.loc[index_2])
abc[0] = list(df1.loc[index_1])[0]
return Insert_row(index_1+1,df1,abc )
index_2=0
for ids in df2.Country:
df1 =proper_plc(index_2)
index_2+=1
I'm trying to scrape a web page for the table of countries and their areas.
My code compiles and runs but only outputs the top two rows, when I want them all.
I thought the problem may lie with .head(), so I played around with it passing numbers and leaving it out all together, but I can't get it to print more than two.
Any help would be appreciated!
from gazpacho import get, Soup
import pandas as pd
url = "https://www.cia.gov/library/publications/the-world-factbook/rankorder/2147rank.html"
response = get(url)
soup = Soup(response)
df0 = pd.read_html(str(soup.find('table')))[0]
print(df0[['Rank', 'Country', '(SQ KM)']].head())
First off, no need to use Pandas' .read_html() AND BeautifulSoup/requests AND gazpacho. Pandas actually uses beautifulsoup under the hood and uses requests as well.
Secondly, I don't have an issue with it not printing out more than 2 rows. Where are you running this? Is it possible you have a setting/preference that only outputs x amount of lines?
import pandas as pd
url = "https://www.cia.gov/library/publications/the-world-factbook/rankorder/2147rank.html"
df0 = pd.read_html(url)[0]
print(df0[['Rank', 'Country', '(SQ KM)']])
Output:
print(df0[['Rank', 'Country', '(SQ KM)']].to_string())
Rank Country (SQ KM)
0 1 Russia 17098242
1 2 Antarctica 14000000
2 3 Canada 9984670
3 4 United States 9833517
4 5 China 9596960
5 6 Brazil 8515770
6 7 Australia 7741220
7 8 India 3287263
8 9 Argentina 2780400
9 10 Kazakhstan 2724900
10 11 Algeria 2381741
11 12 Congo, Democratic Republic of the 2344858
12 13 Greenland 2166086
13 14 Saudi Arabia 2149690
14 15 Mexico 1964375
15 16 Indonesia 1904569
16 17 Sudan 1861484
17 18 Libya 1759540
18 19 Iran 1648195
19 20 Mongolia 1564116
20 21 Peru 1285216
21 22 Chad 1284000
22 23 Niger 1267000
23 24 Angola 1246700
24 25 Mali 1240192
25 26 South Africa 1219090
26 27 Colombia 1138910
27 28 Ethiopia 1104300
28 29 Bolivia 1098581
29 30 Mauritania 1030700
30 31 Egypt 1001450
31 32 Tanzania 947300
32 33 Nigeria 923768
33 34 Venezuela 912050
34 35 Namibia 824292
35 36 Mozambique 799380
36 37 Pakistan 796095
37 38 Turkey 783562
38 39 Chile 756102
39 40 Zambia 752618
40 41 Burma 676578
41 42 Afghanistan 652230
42 43 South Sudan 644329
43 44 France 643801
44 45 Somalia 637657
45 46 Central African Republic 622984
46 47 Ukraine 603550
47 48 Madagascar 587041
48 49 Botswana 581730
49 50 Kenya 580367
50 51 Yemen 527968
51 52 Thailand 513120
52 53 Spain 505370
53 54 Turkmenistan 488100
54 55 Cameroon 475440
55 56 Papua New Guinea 462840
56 57 Sweden 450295
57 58 Uzbekistan 447400
58 59 Morocco 446550
59 60 Iraq 438317
60 61 Paraguay 406752
61 62 Zimbabwe 390757
62 63 Japan 377915
63 64 Germany 357022
64 65 Congo, Republic of the 342000
65 66 Finland 338145
66 67 Vietnam 331210
67 68 Malaysia 329847
68 69 Norway 323802
69 70 Cote d'Ivoire 322463
70 71 Poland 312685
71 72 Oman 309500
72 73 Italy 301340
73 74 Philippines 300000
74 75 Ecuador 283561
75 76 Burkina Faso 274200
76 77 New Zealand 268838
77 78 Gabon 267667
78 79 Western Sahara 266000
79 80 Guinea 245857
80 81 United Kingdom 243610
81 82 Uganda 241038
82 83 Ghana 238533
83 84 Romania 238391
84 85 Laos 236800
85 86 Guyana 214969
86 87 Belarus 207600
87 88 Kyrgyzstan 199951
88 89 Senegal 196722
89 90 Syria 185180
90 91 Cambodia 181035
91 92 Uruguay 176215
92 93 Suriname 163820
93 94 Tunisia 163610
94 95 Bangladesh 148460
95 96 Nepal 147181
96 97 Tajikistan 144100
97 98 Greece 131957
98 99 Nicaragua 130370
99 100 Korea, North 120538
100 101 Malawi 118484
101 102 Eritrea 117600
102 103 Benin 112622
103 104 Honduras 112090
104 105 Liberia 111369
105 106 Bulgaria 110879
106 107 Cuba 110860
107 108 Guatemala 108889
108 109 Iceland 103000
109 110 Korea, South 99720
110 111 Hungary 93028
111 112 Portugal 92090
112 113 Jordan 89342
113 114 Azerbaijan 86600
114 115 Austria 83871
115 116 United Arab Emirates 83600
116 117 Czechia 78867
117 118 Serbia 77474
118 119 Panama 75420
119 120 Sierra Leone 71740
120 121 Ireland 70273
121 122 Georgia 69700
122 123 Sri Lanka 65610
123 124 Lithuania 65300
124 125 Latvia 64589
125 126 Svalbard 62045
126 127 Togo 56785
127 128 Croatia 56594
128 129 Bosnia and Herzegovina 51197
129 130 Costa Rica 51100
130 131 Slovakia 49035
131 132 Dominican Republic 48670
132 133 Estonia 45228
133 134 Denmark 43094
134 135 Netherlands 41543
135 136 Switzerland 41277
136 137 Bhutan 38394
137 138 Guinea-Bissau 36125
138 139 Taiwan 35980
139 140 Moldova 33851
140 141 Belgium 30528
141 142 Lesotho 30355
142 143 Armenia 29743
143 144 Solomon Islands 28896
144 145 Albania 28748
145 146 Equatorial Guinea 28051
146 147 Burundi 27830
147 148 Haiti 27750
148 149 Rwanda 26338
149 150 Macedonia 25713
150 151 Djibouti 23200
151 152 Belize 22966
152 153 El Salvador 21041
153 154 Israel 20770
154 155 Slovenia 20273
155 156 New Caledonia 18575
156 157 Fiji 18274
157 158 Kuwait 17818
158 159 Swaziland 17364
159 160 Timor-Leste 14874
160 161 Bahamas, The 13880
161 162 Montenegro 13812
162 163 Vanuatu 12189
163 164 Falkland Islands (Islas Malvinas) 12173
164 165 Qatar 11586
165 166 Gambia, The 11300
166 167 Jamaica 10991
167 168 Kosovo 10887
168 169 Lebanon 10400
169 170 Cyprus 9251
170 171 Puerto Rico 9104
171 172 West Bank 5860
172 173 Brunei 5765
173 174 Trinidad and Tobago 5128
174 175 French Polynesia 4167
175 176 Cabo Verde 4033
176 177 South Georgia and South Sandwich Islands 3903
177 178 Samoa 2831
178 179 Luxembourg 2586
179 180 Comoros 2235
180 181 Mauritius 2040
181 182 Virgin Islands 1910
182 183 Faroe Islands 1393
183 184 Hong Kong 1108
184 185 Sao Tome and Principe 964
185 186 Turks and Caicos Islands 948
186 187 Kiribati 811
187 188 Bahrain 760
188 189 Dominica 751
189 190 Tonga 747
190 191 Micronesia, Federated States of 702
191 192 Singapore 697
192 193 Saint Lucia 616
193 194 Isle of Man 572
194 195 Guam 544
195 196 Andorra 468
196 197 Northern Mariana Islands 464
197 198 Palau 459
198 199 Seychelles 455
199 200 Curacao 444
200 201 Antigua and Barbuda 443
201 202 Barbados 430
202 203 Heard Island and McDonald Islands 412
203 204 Saint Helena, Ascension, and Tristan da Cunha 394
204 205 Saint Vincent and the Grenadines 389
205 206 Jan Mayen 377
206 207 Gaza Strip 360
207 208 Grenada 344
208 209 Malta 316
209 210 Maldives 298
210 211 Cayman Islands 264
211 212 Saint Kitts and Nevis 261
212 213 Niue 260
213 214 Saint Pierre and Miquelon 242
214 215 Cook Islands 236
215 216 American Samoa 199
216 217 Marshall Islands 181
217 218 Aruba 180
218 219 Liechtenstein 160
219 220 British Virgin Islands 151
220 221 Wallis and Futuna 142
221 222 Christmas Island 135
222 223 Dhekelia 131
223 224 Akrotiri 123
224 225 Jersey 116
225 226 Montserrat 102
226 227 Anguilla 91
227 228 Guernsey 78
228 229 San Marino 61
229 230 British Indian Ocean Territory 60
230 231 French Southern and Antarctic Lands 55
231 232 Saint Martin 54
232 233 Bermuda 54
233 234 Bouvet Island 49
234 235 Pitcairn Islands 47
235 236 Norfolk Island 36
236 237 Sint Maarten 34
237 238 Macau 28
238 239 Tuvalu 26
239 240 Saint Barthelemy 25
240 241 United States Pacific Island Wildlife Refuges 22
241 242 Nauru 21
242 243 Cocos (Keeling) Islands 14
243 244 Tokelau 12
244 245 Paracel Islands 8
245 246 Gibraltar 7
246 247 Wake Island 7
247 248 Clipperton Island 6
248 249 Navassa Island 5
249 250 Spratly Islands 5
250 251 Ashmore and Cartier Islands 5
251 252 Coral Sea Islands 3
252 253 Monaco 2
253 254 Holy See (Vatican City) 0
You can also use lxml for this
import requests
import lxml.html
url = 'https://www.cia.gov/library/publications/the-world-factbook/rankorder/2147rank.html'
response = requests.get(url, timeout=5)
tree = lxml.html.fromstring(response.text)
# Extract the table
table = tree.get_element_by_id('rankOrder')
data = table.xpath('//tr/td//text()')
# Separate the columns
rank = data[0::4]
country = data[1::4]
sq_km = data[2::4]
date_of_info = data[3::4]
If you need a data frame the rest is just formatting
# If you want a data frame
import pandas
df = pandas.DataFrame(dict(country=country, sq_km=sq_km, date_of_info=date_of_info))
df
country sq_km date_of_info
0 Russia 17,098,242 \r
1 Antarctica 14,000,000 \r
2 Canada 9,984,670 \r
3 United States 9,833,517 \r
4 China 9,596,960 \r
.. ... ... ...
249 Spratly Islands 5 \r
250 Ashmore and Cartier Islands 5 \r
251 Coral Sea Islands 3 \r
252 Monaco 2 \r
253 Holy See (Vatican City) 0 \r
[254 rows x 3 columns]
Given a time series data, I'm trying to use panel OLS with fixed effects in Python. I found this way to do it:
Fixed effect in Pandas or Statsmodels
My input data looks like this (I will called it df):
Name Permits_13 Score_13 Permits_14 Score_14 Permits_15 Score_15
0 P.S. 015 ROBERTO CLEMENTE 12.0 284 22 279 32 283
1 P.S. 019 ASHER LEVY 18.0 296 51 301 55 308
2 P.S. 020 ANNA SILVER 9.0 294 9 290 10 293
3 P.S. 034 FRANKLIN D. ROOSEVELT 3.0 294 4 292 1 296
4 P.S. 064 ROBERT SIMON 3.0 287 15 288 17 291
5 P.S. 110 FLORENCE NIGHTINGALE 0.0 313 3 306 4 308
6 P.S. 134 HENRIETTA SZOLD 4.0 290 12 292 17 288
7 P.S. 137 JOHN L. BERNSTEIN 4.0 276 12 273 17 274
8 P.S. 140 NATHAN STRAUS 13.0 282 37 284 59 284
9 P.S. 142 AMALIA CASTRO 7.0 290 15 285 25 284
10 P.S. 184M SHUANG WEN 5.0 327 12 327 9 327
So first I have to transform it to Multi-index (_13, _14, _15 represent data from 2013, 2014 and 2015, in that order):
df = df.dropna()
df = df.drop_duplicates()
rng = pandas.date_range(start=pandas.datetime(2013, 1, 1), periods=3, freq='A')
index = pandas.MultiIndex.from_product([rng, df['Name']], names=['date', 'id'])
d1 = numpy.array(df.ix[:, ['Score_13', 'Permits_13']])
d2 = numpy.array(df.ix[:, ['Score_14', 'Permits_14']])
d3 = numpy.array(df.ix[:, ['Score_15', 'Permits_15']])
data = numpy.concatenate((d1, d2, d3), axis=0)
s = pandas.DataFrame(data, index=index, columns=['y', 'x'])
s = s.drop_duplicates()
Which results in something like this:
y x
date id
2013-12-31 P.S. 015 ROBERTO CLEMENTE 284 12
P.S. 019 ASHER LEVY 296 18
P.S. 020 ANNA SILVER 294 9
P.S. 034 FRANKLIN D. ROOSEVELT 294 3
P.S. 064 ROBERT SIMON 287 3
P.S. 110 FLORENCE NIGHTINGALE 313 0
P.S. 134 HENRIETTA SZOLD 290 4
P.S. 137 JOHN L. BERNSTEIN 276 4
P.S. 140 NATHAN STRAUS 282 13
P.S. 142 AMALIA CASTRO 290 7
P.S. 184M SHUANG WEN 327 5
P.S. 188 THE ISLAND SCHOOL 279 4
HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 255 4
TECHNOLOGY, ARTS, AND SCIENCES STUDIO 282 18
THE EAST VILLAGE COMMUNITY SCHOOL 306 35
UNIVERSITY NEIGHBORHOOD MIDDLE SCHOOL 277 4
THE CHILDREN'S WORKSHOP SCHOOL 302 35
NEIGHBORHOOD SCHOOL 299 15
EARTH SCHOOL 305 3
SCHOOL FOR GLOBAL LEADERS 286 15
TOMPKINS SQUARE MIDDLE SCHOOL 306 3
P.S. 001 ALFRED E. SMITH 303 20
P.S. 002 MEYER LONDON 306 8
P.S. 003 CHARRETTE SCHOOL 325 62
P.S. 006 LILLIE D. BLAKE 333 89
P.S. 011 WILLIAM T. HARRIS 320 30
P.S. 033 CHELSEA PREP 313 5
P.S. 040 AUGUSTUS SAINT-GAUDENS 326 23
P.S. 041 GREENWICH VILLAGE 326 25
P.S. 042 BENJAMIN ALTMAN 314 30
... ... ... ...
2015-12-31 P.S. 054 CHARLES W. LENG 309 2
P.S. 055 HENRY M. BOEHM 311 3
P.S. 56 THE LOUIS DESARIO SCHOOL 323 4
P.S. 057 HUBERT H. HUMPHREY 287 2
SPACE SHUTTLE COLUMBIA SCHOOL 307 0
P.S. 060 ALICE AUSTEN 303 1
I.S. 061 WILLIAM A MORRIS 291 2
MARSH AVENUE SCHOOL FOR EXPEDITIONARY LEARNING 316 0
P.S. 069 DANIEL D. TOMPKINS 307 2
I.S. 072 ROCCO LAURIE 308 1
I.S. 075 FRANK D. PAULO 318 9
THE MICHAEL J. PETRIDES SCHOOL 310 0
STATEN ISLAND SCHOOL OF CIVIC LEADERSHIP 309 0
P.S. 075 MAYDA CORTIELLA 282 19
P.S. 086 THE IRVINGTON 286 38
P.S. 106 EDWARD EVERETT HALE 280 27
P.S. 116 ELIZABETH L FARRELL 291 3
P.S. 123 SUYDAM 287 14
P.S. 145 ANDREW JACKSON 285 4
P.S. 151 LYNDON B. JOHNSON 271 27
J.H.S. 162 THE WILLOUGHBY 283 22
P.S. 274 KOSCIUSKO 282 2
J.H.S. 291 ROLAND HAYES 279 13
P.S. 299 THOMAS WARREN FIELD 288 5
I.S. 347 SCHOOL OF HUMANITIES 284 45
I.S. 349 MATH, SCIENCE & TECH. 285 45
P.S. 376 301 9
P.S. 377 ALEJANDRINA B. DE GAUTIER 277 3
P.S. /I.S. 384 FRANCES E. CARTER 291 4
ALL CITY LEADERSHIP SECONDARY SCHOOL 325 18
However, when I try to call:
reg = PanelOLS(y=s['y'],x=s[['x']],time_effects=True)
I get an error:
ValueError: Can't convert non-uniquely indexed DataFrame to Panel
That's my first time using Pandas, this may be a simple question but I don't know what's the problem. As far as I got I have a multi-index object as required.
I don't get why I have duplicates (I put a lot of drop_duplicates() try to get rid of any duplicated data -- which I don't think is the answer, though). If I have data for the same school for three years, shouldn't I have duplicate data somehow (looking just at the row Name, for example)?
EDIT
dfis 935 rows × 7 columns, after getting rid of NaNs rows.
So I expected s to be 2805 rows × 2 columns, which is exactly what I have.
If i run this:
s = s.reset_index().groupby(s.index.names).first()
reg = PanelOLS(y=s['y'],x=s[['x']],time_effects=True)
I get another error:
ValueError: operands could not be broadcast together with shapes (2763,) (3,)
Thank you.
Using the provided pickle file, I ran the regression and it worked fine.
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x>
Number of Observations: 2763
Number of Degrees of Freedom: 4
R-squared: 0.0268
Adj R-squared: 0.0257
Rmse: 16.4732
F-stat (1, 2759): 25.3204, p-value: 0.0000
Degrees of Freedom: model 3, resid 2759
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 0.1666 0.0191 8.72 0.0000 0.1292 0.2041
---------------------------------End of Summary---------------------------------
I ran this in Jupyter Notebook