pandas categorical doesn't sort multiindex - python

I've pulled some data from SQL as a CSV:
Year,Decision,Residency,Class,Count
2019,Applied,Resident,Freshmen,1143
2019,Applied,Resident,Transfer,404
2019,Applied," ",Grad/Postbacc,418
2019,Applied,Non-Resident,Freshmen,1371
2019,Applied,Non-Resident,Transfer,371
2019,Admitted,Resident,Freshmen,918
2019,Admitted,Resident,Transfer,358
2019,Admitted," ",Grad/Postbacc,311
2019,Admitted,Non-Resident,Freshmen,1048
2019,Admitted,Non-Resident,Transfer,313
2020,Applied,Resident,Freshmen,1094
2020,Applied,Resident,Transfer,406
2020,Applied," ",Grad/Postbacc,374
2020,Applied,Non-Resident,Freshmen,1223
2020,Applied,Non-Resident,Transfer,356
2020,Admitted,Resident,Freshmen,1003
2020,Admitted,Resident,Transfer,354
2020,Admitted," ",Grad/Postbacc,282
2020,Admitted,Non-Resident,Freshmen,1090
2020,Admitted,Non-Resident,Transfer,288
I've written a transform as follows:
data = pd.read_csv("Data.csv")
#Categorize the rows
data["Class"] = pd.Categorical(data["Class"],["Freshmen","Transfer","Grad/Postbacc","Grand"],ordered=True)
data["Decision"] = pd.Categorical(data["Decision"],["Applied","Admitted"],ordered=True)
data["Residency"] = pd.Categorical(data["Residency"],["Resident","Non-Resident"],ordered=True)
#Subtotal classes
tmp = data.groupby(["Year","Class","Decision"],sort=False).sum("Count")
tmp["Residency"] = "Total"
tmp.reset_index(inplace=True)
tmp = pd.concat([data,tmp],ignore_index=True)
#Grand total
tmp2 = data.groupby(["Year","Decision"],sort=False).sum("Count")
tmp2["Class"] = "Grand"
tmp2["Residency"] = "Total"
tmp2.reset_index(inplace=True)
tmp = pd.concat([tmp,tmp2],ignore_index=True)
#Crosstab it
tmp = pd.crosstab(index=[tmp["Year"],tmp["Class"],tmp["Residency"]],
columns=[tmp["Decision"]],
values=tmp["Count"],
aggfunc="sum")
tmp = tmp.loc[~(tmp==0).all(axis=1)]
tmp["%"] = np.round(100*tmp["Admitted"]/tmp["Applied"],1)
tmp = tmp.stack().unstack(["Year","Decision"])
print(tmp)
and it outputs as follows:
Year 2019 2020
Decision Applied Admitted % Applied Admitted %
Class Residency
Freshmen Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
Transfer Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
Resident 404.0 358.0 88.6 406.0 354.0 87.2
Total 775.0 671.0 86.6 762.0 642.0 84.3
Expected output is
Year 2019 2020
Decision Applied Admitted % Applied Admitted %
Class Residency
Freshmen Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
Transfer Resident 404.0 358.0 88.6 406.0 354.0 87.2
Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
Total 775.0 671.0 86.6 762.0 642.0 84.3
Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
The categories successfully sort themselves correctly right up until I throw the dataframe into pd.crosstab at which point it all falls apart. What's going on and how do I fix it?

I coudn't fix your code but I got the expected result doing this:
import pandas as pd
df = pd.read_csv("Data.csv")
df["Class"] = pd.Categorical(df["Class"],["Freshmen","Transfer","Grad/Postbacc","Grand"],ordered=True)
df["Decision"] = pd.Categorical(df["Decision"],["Applied","Admitted","%"],ordered=True)
df["Residency"] = pd.Categorical(df["Residency"],["Resident","Non-Resident"," "],ordered=True)
df_grouped = df.groupby(['Year', 'Decision', 'Class', 'Residency'],as_index=False)['Count'].sum()
df_pivot = df_grouped.pivot_table(columns=["Year","Decision"],index=["Class","Residency"], values="Count",aggfunc='sum')
#Create subtotal for rows
df_totals = pd.concat([y.append(y.sum().rename((x, 'Total'))) for x, y in df_pivot.groupby(level=0)]).append(df_pivot.sum().rename(('Grand', 'Total')))
#Drop not wanted rows
df_totals = df_totals[~(df_totals.values == 0).all(axis=1)].drop_duplicates(keep="last")
#Calculate "%" columns
for year in df_totals.columns.get_level_values('Year').unique():
df_totals[year, '%'] = df_totals[year, 'Admitted'] / df_totals[year, 'Applied']
df_totals
Output:
Year 2019 2020
Decision Applied Admitted % Applied Admitted %
Class Residency
Freshmen Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
Transfer Resident 404.0 358.0 88.6 406.0 354.0 87.2
Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
Total 775.0 671.0 86.6 762.0 642.0 84.3
Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
Note: I got a warning about df.append()

Related

How to turn a pre-made list in rows and columns just like a matrix into a csv file

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
url = "https://www.bls.gov/web/ximpim/beaimp.htm"
page = requests.get(url)
doc = BeautifulSoup(page.text, "html.parser")
table1 = doc.find('table', id='main-content-table')
data_points = doc.find_all("span",{"class":"datavalue"})
numbers = [x.text for x in data_points]
numbers1 = numbers[651:1107]
numbers2 = numbers[2431:2900]
df1 = np.matrix(numbers1)
df2 = np.matrix(numbers2)
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
table1 = headers[72:111]
table2 = headers[225:265]
datafile = pd.DataFrame([table1,numbers1])
datafile = datafile.transpose()
datafile.columns=['Category','0']
datafile.head()
datafile.to_csv('hello.csv', header=False)
The idea is to have it into a csv file:
Columns are suppose to included the title and then follow by a 0 for each columns
I thought I could convert it into a 38 by 12 matrix
The output I want is:
All imports excluding petroleum
1985 0 0 0 0 0 0 0 0 0 0 0 0
- - 73.9 - - 74.3 - - 74.8 - - 76.8
1986
- - 79.1 - - 79.9 - - 82.5 - - 82.9
1987
- - 84.8 - - 87.0 - - 87.7 - - 90.2
1988
- - 92.3 - - 94.6 - - 94.0 - - 96.2
1989
96.9 96.1 96.7 96.4 96.7 95.9 95.2 95.1 95.2 95.5 95.7 96.0
1990
96.1 96.4 96.9 96.7 96.4 96.4 96.3 96.8 97.6 98.0 98.3 98.9
1991
99.0 99.3 99.6 98.9 98.8 98.3 97.8 97.8 98.0 98.5 98.7 99.1
1992
99.8 100.0 99.8 99.1 99.0 99.5 99.9 100.3 100.7 101.2 100.9 99.9
1993
99.9 99.7 99.9 100.2 100.6 100.5 100.7 100.8 100.9 101.4 101.3 101.3
1994
101.6 101.5 101.8 102.1 102.3 102.6 103.1 103.8 104.1 104.8 105.1 105.2
1995
105.5 105.9 106.4 107.0 107.7 107.6 108.0 108.0 107.8 107.5 107.7 107.7
1996
107.4 107.4 107.1 107.0 106.7 106.2 105.9 105.7 106.1 105.8 105.7 105.8
1997
105.4 105.3 104.9 104.3 104.2 104.3 104.1 103.8 103.7 103.4 103.3 102.8
1998
102.2 101.7 101.4 101.1 100.9 100.5 100.0 99.6 99.4 99.5 99.6 99.4
1999
99.5 99.4 99.0 98.8 99.0 98.8 98.6 98.7 98.9 99.0 99.4 99.4
2000
99.4 99.7 100.0 100.1 99.9 99.9 100.2 100.3 100.0 100.0 99.9 100.7
2001
101.6 100.8 100.0 99.5 99.2 98.9 97.8 97.5 97.3 96.8 96.6 96.2
2002
96.1 95.7 95.8 96.3 96.2 96.2 96.2 96.3 96.4 96.4 96.3 96.5
2003
96.8 97.1 98.1 97.1 96.9 97.3 97.3 97.0 97.3 97.2 97.4 97.7
2004
98.5 98.9 99.1 99.4 99.6 99.7 99.7 100.0 100.1 100.0 100.9 101.3
2005
101.6 101.7 102.0 102.4 102.2 102.0 101.8 101.9 102.8 103.8 103.7 103.7
2006
104.0 103.3 103.0 103.1 103.8 104.2 104.2 104.7 104.8 104.2 105.2 105.7
2007
105.6 105.6 105.9 106.2 106.8 107.1 107.2 107.2 107.1 107.7 108.5 108.9
2008
109.7 110.4 111.6 113.1 113.9 114.9 115.6 115.1 114.0 113.0 111.1 109.9
2009
109.0 108.2 107.3 107.1 107.3 107.4 107.2 107.6 107.9 108.4 109.1 109.7
2010
110.3 110.4 110.3 110.8 111.2 110.7 110.5 110.7 111.0 111.3 112.2 112.6
2011
113.6 114.4 115.0 115.9 116.4 116.4 116.5 116.8 117.0 116.6 116.3 116.4
2012
116.4 116.3 116.7 116.7 116.6 116.3 115.9 115.8 116.0 116.4 116.4 116.5
2013
116.6 116.6 116.5 116.5 116.1 115.7 115.0 114.8 114.8 114.9 115.0 115.2
2014
115.7 116.0 116.5 116.1 116.0 115.8 115.8 115.7 115.6 115.4 115.1 115.1
2015
114.3 114.0 113.5 113.0 112.9 112.8 112.5 112.1 111.9 111.5 111.2 110.8
2016
110.7 110.5 110.4 110.4 110.8 110.5 111.0 111.1 111.2 111.1 111.1 111.1
2017
111.2 111.5 111.6 111.9 111.9 112.0 111.9 112.1 112.5 112.5 112.7 112.6
2018
113.3 113.7 113.7 113.8 113.9 113.5 113.4 113.1 113.2 113.4 113.5 113.7
2019
113.0 113.2 113.2 112.7 112.4 112.0 111.9 111.8 111.8 111.7 111.8 112.0
2020
112.1 112.3 112.1 111.5 111.6 111.9 112.1 113.0 113.7 113.6 113.6 114.1
2021
115.1 115.7 116.7 117.6 118.8 119.6 119.7 119.7 119.9 120.7 121.5 122.0
2022
123.8 124.8 126.2 126.8 126.7 126.0 125.2 124.9 124.5 124.3 123.9 124.9
The BLS has a data API that would be preferable over web scraping. The Series ID is a bit hard to find. It's not what's shown on the page (typical government issue). You can use the BLS Data Finder to get that:
# You can only query 10 years at a time without registering for an API key
step = 10
data = []
for start_year in range(1985, 2023, step):
payload = {
# All imports excluding petroleum
"seriesid": ["EIUIREXPET"],
"startyear": start_year,
"endyear": start_year + step - 1
}
r = requests.post("https://api.bls.gov/publicAPI/v2/timeseries/data/", json=payload)
r.raise_for_status()
data.extend(r.json()["Results"]["series"][0]["data"])
df = pd.DataFrame(data)[["year","period","value"]]
df["period"] = df["period"].str.strip("M").astype(int)
df = df.pivot(index="year", columns="period")

NPV Calculation on a PANDAS dataframe of values

Am getting errors doing NPV calculation using numpy_financial.npv(rate, values) on a dataframe.
Am i able to use dataframes for NPV calculation?
Not sure how to fix this. Manually looping through each row?
npvValues = ['value_1',' value_2', 'value_3', 'value_4', 'value_5', 'RFR']
round(df[npvValues].sample(5),1).sort_index(ascending=False)
DATE value_1 value_2 value_3 value_4 value_5 RFR
2017-04-03 38.5 92.8 168.7 257.0 354.0 2.1
2016-01-11 35.7 86.1 156.6 238.7 328.6 2.3
2013-07-29 28.1 67.8 123.3 187.8 258.6 2.3
2011-05-02 24.2 58.3 106.1 161.6 222.5 3.4
2010-01-18 24.4 58.8 107.0 163.0 224.5 3.8
NPV Calculation
value = ['value_1',' value_2', 'value_3', 'value_4', 'value_5']
df['NPV_IV'] = npf.npv(rate=df['RFR']/100, values=df[value])
df['NPV_IV']
Here is the full error trace
Full Error Trace
values parameter in npv takes only 1 dimensional array, so you need to loop through it.
df = pd.read_csv('t.csv')
value = ['value_1', 'value_2', 'value_3', 'value_4', 'value_5']
num = df[value].to_numpy()
rate_lists = df['RFR'].tolist()
new_col = []
for index, i in enumerate(num):
n = npf.npv(rate=rate_lists[index] / 100, values=i)
new_col.append(n)
df['NPV_IV'] = new_col
print(df)
DATE value_1 value_2 value_3 value_4 value_5 RFR NPV_IV
0 2017-04-03 38.5 92.8 168.7 257.0 354.0 2.1 858.450837
1 2016-01-11 35.7 86.1 156.6 238.7 328.6 2.3 792.491237
2 2013-07-29 28.1 67.8 123.3 187.8 258.6 2.3 623.725804
3 2011-05-02 24.2 58.3 106.1 161.6 222.5 3.4 520.644433
4 2010-01-18 24.4 58.8 107.0 163.0 224.5 3.8 519.488982

How to title a pandas dataframe

I have the following code that prints out descriptive statistics with df.describe for each class of a categorical variable
for i in list(merged.Response.unique()):
print(merged[(merged.Response==i)].describe().round(2))
and it returns
OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue
count 687.00 687.00 687.00 687.00 687.00
mean 24.75 13.45 4.56 9.61 243.91
std 7.04 3.35 0.17 1.95 107.45
min 11.00 7.00 4.13 5.85 83.27
25% 20.00 11.00 4.45 8.18 167.44
50% 24.00 13.00 4.57 9.34 213.08
75% 29.00 15.00 4.67 10.51 289.74
max 51.00 24.00 4.97 15.75 700.80
OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue
count 1099.0 1099.00 1099.00 1099.00 1099.00
mean 17.2 6.85 4.08 5.18 97.88
std 12.8 2.47 0.24 1.45 101.26
min 1.0 2.00 3.24 2.40 5.72
25% 7.0 5.00 3.89 4.12 31.38
50% 14.0 7.00 4.13 5.21 62.58
75% 24.0 8.00 4.22 5.86 130.90
max 55.0 21.00 4.91 13.46 686.46
OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue
count 392.00 392.00 392.00 392.00 392.00
mean 12.41 11.46 4.44 10.13 125.04
std 3.75 3.34 0.19 1.94 43.91
min 3.00 6.00 4.02 6.98 36.92
25% 10.00 9.00 4.31 8.71 92.68
50% 13.00 10.00 4.38 9.30 121.58
75% 15.00 13.00 4.51 11.00 148.64
max 26.00 22.00 4.94 16.25 266.56
Is there any way I can title each summary table so I know which class is which?
I treid the following with the pandas styler, but despite titling the dataframe, it only printed one of them and it doesn't look as good (I'm in google colab btw):
for i in list(merged.Response.unique()):
test = merged[(merged.Response==i)].describe().round(2).style.set_caption(i)
test
AmznPrime
OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue
count 392.000000 392.000000 392.000000 392.000000 392.000000
mean 12.410000 11.460000 4.440000 10.130000 125.040000
std 3.750000 3.340000 0.190000 1.940000 43.910000
min 3.000000 6.000000 4.020000 6.980000 36.920000
25% 10.000000 9.000000 4.310000 8.710000 92.680000
50% 13.000000 10.000000 4.380000 9.300000 121.580000
75% 15.000000 13.000000 4.510000 11.000000 148.640000
max 26.000000 22.000000 4.940000 16.250000 266.560000
All help is appreciated. Thanks!
How about:
merged.groupby("Response").describe().round(2)
To match your expected output, do stack/unstack:
merged.groupby("Response").describe().stack(level=1).unstack(level=0)

How do I calculate mean value for each month in the dataset?

Sample wind dataset:
`.................RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
DATE
1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
1961-01-06 13.21 8.12 9.96 6.67 5.37 4.50 10.67 4.42 7.17 7.50 8.12 13.17
1961-02-07 13.50 14.29 9.50 4.96 12.29 8.33 9.17 9.29 7.58 7.96 13.96 13.79
1961-02-08 10.96 9.75 7.62 5.91 9.62 7.29 14.29 7.62 9.25 10.46 16.62 16.46
1961-03-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
1962-03-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
1962-06-06 13.21 8.12 9.96 6.67 5.37 4.50 10.67 4.42 7.17 7.50 8.12 13.17
1968-07-07 13.50 14.29 9.50 4.96 12.29 8.33 9.17 9.29 7.58 7.96 13.96 13.79
1968-07-08 10.96 9.75 7.62 5.91 9.62 7.29 14.29 7.62 9.25 10.46 16.62 16.46
1976-08-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
1976-08-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
1978-09-06 13.21 8.12 9.96 6.67 5.37 4.50 10.67 4.42 7.17 7.50 8.12 13.17
1978-09-07 13.50 14.29 9.50 4.96 12.29 8.33 9.17 9.29 7.58 7.96 13.96 13.79
1978-12-08 10.96 9.75 7.62 5.91 9.62 7.29 14.29 7.62 9.25 10.46 16.62 16.46`
The Complete Dataset is [here][1].
In this dataset, the columns are the locations and the values are wind speeds. I want to calculate the wind speed for each month in the dataset. But I want to treat January 1961 and January 1962 as different months.
I tried doing it with the for-loop. First I created a column name 'Month' and then I assigned the values using for-loop like this:
`for i in range(len(data.index)):
if data.index[i].month == 1:
if data.index[i].year == 1961:
data['Month'][i] = 'January 61'
elif data.index[i].year == 1962:
data['Month'][i] = 'January 62'
else:
data['Month'][i] = 'January'
elif data.index[i].month == 2:
data['Month'][i] = 'February'
elif data.index[i].month == 3:
data['Month'][i] = 'March'
elif data.index[i].month == 4:
data['Month'][i] = 'April'
elif data.index[i].month == 5:
data['Month'][i] = 'May'
elif data.index[i].month == 6:
data['Month'][i] = 'June'
elif data.index[i].month == 7:
data['Month'][i] = 'July'
elif data.index[i].month == 8:
data['Month'][i] = 'August'
elif data.index[i].month == 9:
data['Month'][i] = 'September'
elif data.index[i].month == 10:
data['Month'][i] = 'October'
elif data.index[i].month == 11:
data['Month'][i] = 'November'
elif data.index[i].month == 12:
data['Month'][i] = 'December'`
And then I would use groupby on data['Month'] and then find the mean. But it's taking forever to run and I don't wanna have to wait for so long every time I run this program. How else could I have solved this problem?
Note- The Actual data set isn't quite the same as the sample data set. I combined the columns ['Yr', 'Mo', 'Dy'] into one column named 'DATE', and then I made 'DATE' the index. And I have also removed all the NaN values using data.dropna(inplace=True).
[1]:
Try:
df.index = pd.to_datetime(df.index)
df.groupby([df.index.year, df.index.month]).mean()
RPT VAL ROS ... CLO BEL MAL
DATE DATE ...
1961 1 12.373333 9.333333 11.043333 ... 7.906667 8.833333 11.960
2 12.230000 12.020000 8.560000 ... 9.210000 15.290000 15.125
3 10.580000 6.630000 11.750000 ... 5.880000 5.460000 10.880
1962 3 13.330000 13.250000 11.420000 ... 10.340000 12.920000 11.830
6 13.210000 8.120000 9.960000 ... 7.500000 8.120000 13.170
1968 7 12.230000 12.020000 8.560000 ... 9.210000 15.290000 15.125
1976 8 11.955000 9.940000 11.585000 ... 8.110000 9.190000 11.355
1978 9 13.355000 11.205000 9.730000 ... 7.730000 11.040000 13.480
12 10.960000 9.750000 7.620000 ... 10.460000 16.620000 16.460
I think that the groupby approach you tried is the way to go:
df.groupby(['year','month'])['RPT'].mean().reset_index()

Filter Excel data based on date range selection on pandas

I would like to know how to filter Excel data based of a specific date range using pandas via python.
For an example:
(sheet1.xlsx) contains:
DATE 51 52 53 54 55 56
20110706 28.52 27.52 26.52 25.52 24.52 23.52
20110707 28.97 27.97 26.97 25.97 24.97 23.97
20110708 28.52 27.52 26.52 25.52 24.52 23.52
20110709 28.97 27.97 26.97 25.97 24.97 23.97
20110710 30.5 29.5 28.5 27.5 26.5 25.5
20110711 32.93 31.93 30.93 29.93 28.93 27.93
20110712 35.54 34.54 33.54 32.54 31.54 30.54
20110713 33.02 32.02 31.02 30.02 29.02 28.02
20110730 35.99 34.99 33.99 32.99 31.99 30.99
20110731 30.5 29.5 28.5 27.5 26.5 25.5
20110801 32.48 31.48 30.48 29.48 28.48 27.48
20110802 31.04 30.04 29.04 28.04 27.04 26.04
20110803 32.03 31.03 30.03 29.03 28.03 27.03
20110804 34.01 33.01 32.01 31.01 30.01 29.01
20110805 27.44 26.44 25.44 24.44 23.44 22.44
20110806 32.48 31.48 30.48 29.48 28.48 27.48
If I want to filter this data from the range 20110708-20110803
The result would be:
DATE 51 52 53 54 55 56
20110708 28.52 27.52 26.52 25.52 24.52 23.52
20110709 28.97 27.97 26.97 25.97 24.97 23.97
20110710 30.5 29.5 28.5 27.5 26.5 25.5
20110711 32.93 31.93 30.93 29.93 28.93 27.93
20110712 35.54 34.54 33.54 32.54 31.54 30.54
20110713 33.02 32.02 31.02 30.02 29.02 28.02
20110730 35.99 34.99 33.99 32.99 31.99 30.99
20110731 30.5 29.5 28.5 27.5 26.5 25.5
20110801 32.48 31.48 30.48 29.48 28.48 27.48
20110802 31.04 30.04 29.04 28.04 27.04 26.04
20110803 32.03 31.03 30.03 29.03 28.03 27.03
How would I go about doing this?
If you set DATE as an index from your Dataframe df (df.set_index('DATE', inplace=True)).
You can then use loc to slice your DataFrame :
df.loc[20110708:20110803]
You should find example here : http://pandas.pydata.org/pandas-docs/stable/10min.html
PS : I assumed that the dtype of your index (DATE column) was int64.
If you'd rather keep DATE as a standard column (not your index), you can also do this:
df = df[(20110708 <= df.DATE) & (df.DATE <= 20110803)]
The indexing isn't quite as pretty and it'll be a little slower, but it works on columns.
This is assuming you've already read the excel file in using df = pd.read_csv(filename)

Categories

Resources