pandas categorical doesn't sort multiindex - python
I've pulled some data from SQL as a CSV:
Year,Decision,Residency,Class,Count
2019,Applied,Resident,Freshmen,1143
2019,Applied,Resident,Transfer,404
2019,Applied," ",Grad/Postbacc,418
2019,Applied,Non-Resident,Freshmen,1371
2019,Applied,Non-Resident,Transfer,371
2019,Admitted,Resident,Freshmen,918
2019,Admitted,Resident,Transfer,358
2019,Admitted," ",Grad/Postbacc,311
2019,Admitted,Non-Resident,Freshmen,1048
2019,Admitted,Non-Resident,Transfer,313
2020,Applied,Resident,Freshmen,1094
2020,Applied,Resident,Transfer,406
2020,Applied," ",Grad/Postbacc,374
2020,Applied,Non-Resident,Freshmen,1223
2020,Applied,Non-Resident,Transfer,356
2020,Admitted,Resident,Freshmen,1003
2020,Admitted,Resident,Transfer,354
2020,Admitted," ",Grad/Postbacc,282
2020,Admitted,Non-Resident,Freshmen,1090
2020,Admitted,Non-Resident,Transfer,288
I've written a transform as follows:
data = pd.read_csv("Data.csv")
#Categorize the rows
data["Class"] = pd.Categorical(data["Class"],["Freshmen","Transfer","Grad/Postbacc","Grand"],ordered=True)
data["Decision"] = pd.Categorical(data["Decision"],["Applied","Admitted"],ordered=True)
data["Residency"] = pd.Categorical(data["Residency"],["Resident","Non-Resident"],ordered=True)
#Subtotal classes
tmp = data.groupby(["Year","Class","Decision"],sort=False).sum("Count")
tmp["Residency"] = "Total"
tmp.reset_index(inplace=True)
tmp = pd.concat([data,tmp],ignore_index=True)
#Grand total
tmp2 = data.groupby(["Year","Decision"],sort=False).sum("Count")
tmp2["Class"] = "Grand"
tmp2["Residency"] = "Total"
tmp2.reset_index(inplace=True)
tmp = pd.concat([tmp,tmp2],ignore_index=True)
#Crosstab it
tmp = pd.crosstab(index=[tmp["Year"],tmp["Class"],tmp["Residency"]],
columns=[tmp["Decision"]],
values=tmp["Count"],
aggfunc="sum")
tmp = tmp.loc[~(tmp==0).all(axis=1)]
tmp["%"] = np.round(100*tmp["Admitted"]/tmp["Applied"],1)
tmp = tmp.stack().unstack(["Year","Decision"])
print(tmp)
and it outputs as follows:
Year 2019 2020
Decision Applied Admitted % Applied Admitted %
Class Residency
Freshmen Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
Transfer Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
Resident 404.0 358.0 88.6 406.0 354.0 87.2
Total 775.0 671.0 86.6 762.0 642.0 84.3
Expected output is
Year 2019 2020
Decision Applied Admitted % Applied Admitted %
Class Residency
Freshmen Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
Transfer Resident 404.0 358.0 88.6 406.0 354.0 87.2
Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
Total 775.0 671.0 86.6 762.0 642.0 84.3
Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
The categories successfully sort themselves correctly right up until I throw the dataframe into pd.crosstab at which point it all falls apart. What's going on and how do I fix it?
I coudn't fix your code but I got the expected result doing this:
import pandas as pd
df = pd.read_csv("Data.csv")
df["Class"] = pd.Categorical(df["Class"],["Freshmen","Transfer","Grad/Postbacc","Grand"],ordered=True)
df["Decision"] = pd.Categorical(df["Decision"],["Applied","Admitted","%"],ordered=True)
df["Residency"] = pd.Categorical(df["Residency"],["Resident","Non-Resident"," "],ordered=True)
df_grouped = df.groupby(['Year', 'Decision', 'Class', 'Residency'],as_index=False)['Count'].sum()
df_pivot = df_grouped.pivot_table(columns=["Year","Decision"],index=["Class","Residency"], values="Count",aggfunc='sum')
#Create subtotal for rows
df_totals = pd.concat([y.append(y.sum().rename((x, 'Total'))) for x, y in df_pivot.groupby(level=0)]).append(df_pivot.sum().rename(('Grand', 'Total')))
#Drop not wanted rows
df_totals = df_totals[~(df_totals.values == 0).all(axis=1)].drop_duplicates(keep="last")
#Calculate "%" columns
for year in df_totals.columns.get_level_values('Year').unique():
df_totals[year, '%'] = df_totals[year, 'Admitted'] / df_totals[year, 'Applied']
df_totals
Output:
Year 2019 2020
Decision Applied Admitted % Applied Admitted %
Class Residency
Freshmen Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
Transfer Resident 404.0 358.0 88.6 406.0 354.0 87.2
Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
Total 775.0 671.0 86.6 762.0 642.0 84.3
Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
Note: I got a warning about df.append()
Related
How to turn a pre-made list in rows and columns just like a matrix into a csv file
from bs4 import BeautifulSoup import requests import pandas as pd import numpy as np url = "https://www.bls.gov/web/ximpim/beaimp.htm" page = requests.get(url) doc = BeautifulSoup(page.text, "html.parser") table1 = doc.find('table', id='main-content-table') data_points = doc.find_all("span",{"class":"datavalue"}) numbers = [x.text for x in data_points] numbers1 = numbers[651:1107] numbers2 = numbers[2431:2900] df1 = np.matrix(numbers1) df2 = np.matrix(numbers2) headers = [] for i in table1.find_all('th'): title = i.text headers.append(title) table1 = headers[72:111] table2 = headers[225:265] datafile = pd.DataFrame([table1,numbers1]) datafile = datafile.transpose() datafile.columns=['Category','0'] datafile.head() datafile.to_csv('hello.csv', header=False) The idea is to have it into a csv file: Columns are suppose to included the title and then follow by a 0 for each columns I thought I could convert it into a 38 by 12 matrix The output I want is: All imports excluding petroleum 1985 0 0 0 0 0 0 0 0 0 0 0 0 - - 73.9 - - 74.3 - - 74.8 - - 76.8 1986 - - 79.1 - - 79.9 - - 82.5 - - 82.9 1987 - - 84.8 - - 87.0 - - 87.7 - - 90.2 1988 - - 92.3 - - 94.6 - - 94.0 - - 96.2 1989 96.9 96.1 96.7 96.4 96.7 95.9 95.2 95.1 95.2 95.5 95.7 96.0 1990 96.1 96.4 96.9 96.7 96.4 96.4 96.3 96.8 97.6 98.0 98.3 98.9 1991 99.0 99.3 99.6 98.9 98.8 98.3 97.8 97.8 98.0 98.5 98.7 99.1 1992 99.8 100.0 99.8 99.1 99.0 99.5 99.9 100.3 100.7 101.2 100.9 99.9 1993 99.9 99.7 99.9 100.2 100.6 100.5 100.7 100.8 100.9 101.4 101.3 101.3 1994 101.6 101.5 101.8 102.1 102.3 102.6 103.1 103.8 104.1 104.8 105.1 105.2 1995 105.5 105.9 106.4 107.0 107.7 107.6 108.0 108.0 107.8 107.5 107.7 107.7 1996 107.4 107.4 107.1 107.0 106.7 106.2 105.9 105.7 106.1 105.8 105.7 105.8 1997 105.4 105.3 104.9 104.3 104.2 104.3 104.1 103.8 103.7 103.4 103.3 102.8 1998 102.2 101.7 101.4 101.1 100.9 100.5 100.0 99.6 99.4 99.5 99.6 99.4 1999 99.5 99.4 99.0 98.8 99.0 98.8 98.6 98.7 98.9 99.0 99.4 99.4 2000 99.4 99.7 100.0 100.1 99.9 99.9 100.2 100.3 100.0 100.0 99.9 100.7 2001 101.6 100.8 100.0 99.5 99.2 98.9 97.8 97.5 97.3 96.8 96.6 96.2 2002 96.1 95.7 95.8 96.3 96.2 96.2 96.2 96.3 96.4 96.4 96.3 96.5 2003 96.8 97.1 98.1 97.1 96.9 97.3 97.3 97.0 97.3 97.2 97.4 97.7 2004 98.5 98.9 99.1 99.4 99.6 99.7 99.7 100.0 100.1 100.0 100.9 101.3 2005 101.6 101.7 102.0 102.4 102.2 102.0 101.8 101.9 102.8 103.8 103.7 103.7 2006 104.0 103.3 103.0 103.1 103.8 104.2 104.2 104.7 104.8 104.2 105.2 105.7 2007 105.6 105.6 105.9 106.2 106.8 107.1 107.2 107.2 107.1 107.7 108.5 108.9 2008 109.7 110.4 111.6 113.1 113.9 114.9 115.6 115.1 114.0 113.0 111.1 109.9 2009 109.0 108.2 107.3 107.1 107.3 107.4 107.2 107.6 107.9 108.4 109.1 109.7 2010 110.3 110.4 110.3 110.8 111.2 110.7 110.5 110.7 111.0 111.3 112.2 112.6 2011 113.6 114.4 115.0 115.9 116.4 116.4 116.5 116.8 117.0 116.6 116.3 116.4 2012 116.4 116.3 116.7 116.7 116.6 116.3 115.9 115.8 116.0 116.4 116.4 116.5 2013 116.6 116.6 116.5 116.5 116.1 115.7 115.0 114.8 114.8 114.9 115.0 115.2 2014 115.7 116.0 116.5 116.1 116.0 115.8 115.8 115.7 115.6 115.4 115.1 115.1 2015 114.3 114.0 113.5 113.0 112.9 112.8 112.5 112.1 111.9 111.5 111.2 110.8 2016 110.7 110.5 110.4 110.4 110.8 110.5 111.0 111.1 111.2 111.1 111.1 111.1 2017 111.2 111.5 111.6 111.9 111.9 112.0 111.9 112.1 112.5 112.5 112.7 112.6 2018 113.3 113.7 113.7 113.8 113.9 113.5 113.4 113.1 113.2 113.4 113.5 113.7 2019 113.0 113.2 113.2 112.7 112.4 112.0 111.9 111.8 111.8 111.7 111.8 112.0 2020 112.1 112.3 112.1 111.5 111.6 111.9 112.1 113.0 113.7 113.6 113.6 114.1 2021 115.1 115.7 116.7 117.6 118.8 119.6 119.7 119.7 119.9 120.7 121.5 122.0 2022 123.8 124.8 126.2 126.8 126.7 126.0 125.2 124.9 124.5 124.3 123.9 124.9
The BLS has a data API that would be preferable over web scraping. The Series ID is a bit hard to find. It's not what's shown on the page (typical government issue). You can use the BLS Data Finder to get that: # You can only query 10 years at a time without registering for an API key step = 10 data = [] for start_year in range(1985, 2023, step): payload = { # All imports excluding petroleum "seriesid": ["EIUIREXPET"], "startyear": start_year, "endyear": start_year + step - 1 } r = requests.post("https://api.bls.gov/publicAPI/v2/timeseries/data/", json=payload) r.raise_for_status() data.extend(r.json()["Results"]["series"][0]["data"]) df = pd.DataFrame(data)[["year","period","value"]] df["period"] = df["period"].str.strip("M").astype(int) df = df.pivot(index="year", columns="period")
NPV Calculation on a PANDAS dataframe of values
Am getting errors doing NPV calculation using numpy_financial.npv(rate, values) on a dataframe. Am i able to use dataframes for NPV calculation? Not sure how to fix this. Manually looping through each row? npvValues = ['value_1',' value_2', 'value_3', 'value_4', 'value_5', 'RFR'] round(df[npvValues].sample(5),1).sort_index(ascending=False) DATE value_1 value_2 value_3 value_4 value_5 RFR 2017-04-03 38.5 92.8 168.7 257.0 354.0 2.1 2016-01-11 35.7 86.1 156.6 238.7 328.6 2.3 2013-07-29 28.1 67.8 123.3 187.8 258.6 2.3 2011-05-02 24.2 58.3 106.1 161.6 222.5 3.4 2010-01-18 24.4 58.8 107.0 163.0 224.5 3.8 NPV Calculation value = ['value_1',' value_2', 'value_3', 'value_4', 'value_5'] df['NPV_IV'] = npf.npv(rate=df['RFR']/100, values=df[value]) df['NPV_IV'] Here is the full error trace Full Error Trace
values parameter in npv takes only 1 dimensional array, so you need to loop through it. df = pd.read_csv('t.csv') value = ['value_1', 'value_2', 'value_3', 'value_4', 'value_5'] num = df[value].to_numpy() rate_lists = df['RFR'].tolist() new_col = [] for index, i in enumerate(num): n = npf.npv(rate=rate_lists[index] / 100, values=i) new_col.append(n) df['NPV_IV'] = new_col print(df) DATE value_1 value_2 value_3 value_4 value_5 RFR NPV_IV 0 2017-04-03 38.5 92.8 168.7 257.0 354.0 2.1 858.450837 1 2016-01-11 35.7 86.1 156.6 238.7 328.6 2.3 792.491237 2 2013-07-29 28.1 67.8 123.3 187.8 258.6 2.3 623.725804 3 2011-05-02 24.2 58.3 106.1 161.6 222.5 3.4 520.644433 4 2010-01-18 24.4 58.8 107.0 163.0 224.5 3.8 519.488982
How to title a pandas dataframe
I have the following code that prints out descriptive statistics with df.describe for each class of a categorical variable for i in list(merged.Response.unique()): print(merged[(merged.Response==i)].describe().round(2)) and it returns OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue count 687.00 687.00 687.00 687.00 687.00 mean 24.75 13.45 4.56 9.61 243.91 std 7.04 3.35 0.17 1.95 107.45 min 11.00 7.00 4.13 5.85 83.27 25% 20.00 11.00 4.45 8.18 167.44 50% 24.00 13.00 4.57 9.34 213.08 75% 29.00 15.00 4.67 10.51 289.74 max 51.00 24.00 4.97 15.75 700.80 OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue count 1099.0 1099.00 1099.00 1099.00 1099.00 mean 17.2 6.85 4.08 5.18 97.88 std 12.8 2.47 0.24 1.45 101.26 min 1.0 2.00 3.24 2.40 5.72 25% 7.0 5.00 3.89 4.12 31.38 50% 14.0 7.00 4.13 5.21 62.58 75% 24.0 8.00 4.22 5.86 130.90 max 55.0 21.00 4.91 13.46 686.46 OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue count 392.00 392.00 392.00 392.00 392.00 mean 12.41 11.46 4.44 10.13 125.04 std 3.75 3.34 0.19 1.94 43.91 min 3.00 6.00 4.02 6.98 36.92 25% 10.00 9.00 4.31 8.71 92.68 50% 13.00 10.00 4.38 9.30 121.58 75% 15.00 13.00 4.51 11.00 148.64 max 26.00 22.00 4.94 16.25 266.56 Is there any way I can title each summary table so I know which class is which? I treid the following with the pandas styler, but despite titling the dataframe, it only printed one of them and it doesn't look as good (I'm in google colab btw): for i in list(merged.Response.unique()): test = merged[(merged.Response==i)].describe().round(2).style.set_caption(i) test AmznPrime OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue count 392.000000 392.000000 392.000000 392.000000 392.000000 mean 12.410000 11.460000 4.440000 10.130000 125.040000 std 3.750000 3.340000 0.190000 1.940000 43.910000 min 3.000000 6.000000 4.020000 6.980000 36.920000 25% 10.000000 9.000000 4.310000 8.710000 92.680000 50% 13.000000 10.000000 4.380000 9.300000 121.580000 75% 15.000000 13.000000 4.510000 11.000000 148.640000 max 26.000000 22.000000 4.940000 16.250000 266.560000 All help is appreciated. Thanks!
How about: merged.groupby("Response").describe().round(2) To match your expected output, do stack/unstack: merged.groupby("Response").describe().stack(level=1).unstack(level=0)
How do I calculate mean value for each month in the dataset?
Sample wind dataset: `.................RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL DATE 1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88 1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83 1961-01-06 13.21 8.12 9.96 6.67 5.37 4.50 10.67 4.42 7.17 7.50 8.12 13.17 1961-02-07 13.50 14.29 9.50 4.96 12.29 8.33 9.17 9.29 7.58 7.96 13.96 13.79 1961-02-08 10.96 9.75 7.62 5.91 9.62 7.29 14.29 7.62 9.25 10.46 16.62 16.46 1961-03-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88 1962-03-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83 1962-06-06 13.21 8.12 9.96 6.67 5.37 4.50 10.67 4.42 7.17 7.50 8.12 13.17 1968-07-07 13.50 14.29 9.50 4.96 12.29 8.33 9.17 9.29 7.58 7.96 13.96 13.79 1968-07-08 10.96 9.75 7.62 5.91 9.62 7.29 14.29 7.62 9.25 10.46 16.62 16.46 1976-08-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88 1976-08-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83 1978-09-06 13.21 8.12 9.96 6.67 5.37 4.50 10.67 4.42 7.17 7.50 8.12 13.17 1978-09-07 13.50 14.29 9.50 4.96 12.29 8.33 9.17 9.29 7.58 7.96 13.96 13.79 1978-12-08 10.96 9.75 7.62 5.91 9.62 7.29 14.29 7.62 9.25 10.46 16.62 16.46` The Complete Dataset is [here][1]. In this dataset, the columns are the locations and the values are wind speeds. I want to calculate the wind speed for each month in the dataset. But I want to treat January 1961 and January 1962 as different months. I tried doing it with the for-loop. First I created a column name 'Month' and then I assigned the values using for-loop like this: `for i in range(len(data.index)): if data.index[i].month == 1: if data.index[i].year == 1961: data['Month'][i] = 'January 61' elif data.index[i].year == 1962: data['Month'][i] = 'January 62' else: data['Month'][i] = 'January' elif data.index[i].month == 2: data['Month'][i] = 'February' elif data.index[i].month == 3: data['Month'][i] = 'March' elif data.index[i].month == 4: data['Month'][i] = 'April' elif data.index[i].month == 5: data['Month'][i] = 'May' elif data.index[i].month == 6: data['Month'][i] = 'June' elif data.index[i].month == 7: data['Month'][i] = 'July' elif data.index[i].month == 8: data['Month'][i] = 'August' elif data.index[i].month == 9: data['Month'][i] = 'September' elif data.index[i].month == 10: data['Month'][i] = 'October' elif data.index[i].month == 11: data['Month'][i] = 'November' elif data.index[i].month == 12: data['Month'][i] = 'December'` And then I would use groupby on data['Month'] and then find the mean. But it's taking forever to run and I don't wanna have to wait for so long every time I run this program. How else could I have solved this problem? Note- The Actual data set isn't quite the same as the sample data set. I combined the columns ['Yr', 'Mo', 'Dy'] into one column named 'DATE', and then I made 'DATE' the index. And I have also removed all the NaN values using data.dropna(inplace=True). [1]:
Try: df.index = pd.to_datetime(df.index) df.groupby([df.index.year, df.index.month]).mean() RPT VAL ROS ... CLO BEL MAL DATE DATE ... 1961 1 12.373333 9.333333 11.043333 ... 7.906667 8.833333 11.960 2 12.230000 12.020000 8.560000 ... 9.210000 15.290000 15.125 3 10.580000 6.630000 11.750000 ... 5.880000 5.460000 10.880 1962 3 13.330000 13.250000 11.420000 ... 10.340000 12.920000 11.830 6 13.210000 8.120000 9.960000 ... 7.500000 8.120000 13.170 1968 7 12.230000 12.020000 8.560000 ... 9.210000 15.290000 15.125 1976 8 11.955000 9.940000 11.585000 ... 8.110000 9.190000 11.355 1978 9 13.355000 11.205000 9.730000 ... 7.730000 11.040000 13.480 12 10.960000 9.750000 7.620000 ... 10.460000 16.620000 16.460
I think that the groupby approach you tried is the way to go: df.groupby(['year','month'])['RPT'].mean().reset_index()
Filter Excel data based on date range selection on pandas
I would like to know how to filter Excel data based of a specific date range using pandas via python. For an example: (sheet1.xlsx) contains: DATE 51 52 53 54 55 56 20110706 28.52 27.52 26.52 25.52 24.52 23.52 20110707 28.97 27.97 26.97 25.97 24.97 23.97 20110708 28.52 27.52 26.52 25.52 24.52 23.52 20110709 28.97 27.97 26.97 25.97 24.97 23.97 20110710 30.5 29.5 28.5 27.5 26.5 25.5 20110711 32.93 31.93 30.93 29.93 28.93 27.93 20110712 35.54 34.54 33.54 32.54 31.54 30.54 20110713 33.02 32.02 31.02 30.02 29.02 28.02 20110730 35.99 34.99 33.99 32.99 31.99 30.99 20110731 30.5 29.5 28.5 27.5 26.5 25.5 20110801 32.48 31.48 30.48 29.48 28.48 27.48 20110802 31.04 30.04 29.04 28.04 27.04 26.04 20110803 32.03 31.03 30.03 29.03 28.03 27.03 20110804 34.01 33.01 32.01 31.01 30.01 29.01 20110805 27.44 26.44 25.44 24.44 23.44 22.44 20110806 32.48 31.48 30.48 29.48 28.48 27.48 If I want to filter this data from the range 20110708-20110803 The result would be: DATE 51 52 53 54 55 56 20110708 28.52 27.52 26.52 25.52 24.52 23.52 20110709 28.97 27.97 26.97 25.97 24.97 23.97 20110710 30.5 29.5 28.5 27.5 26.5 25.5 20110711 32.93 31.93 30.93 29.93 28.93 27.93 20110712 35.54 34.54 33.54 32.54 31.54 30.54 20110713 33.02 32.02 31.02 30.02 29.02 28.02 20110730 35.99 34.99 33.99 32.99 31.99 30.99 20110731 30.5 29.5 28.5 27.5 26.5 25.5 20110801 32.48 31.48 30.48 29.48 28.48 27.48 20110802 31.04 30.04 29.04 28.04 27.04 26.04 20110803 32.03 31.03 30.03 29.03 28.03 27.03 How would I go about doing this?
If you set DATE as an index from your Dataframe df (df.set_index('DATE', inplace=True)). You can then use loc to slice your DataFrame : df.loc[20110708:20110803] You should find example here : http://pandas.pydata.org/pandas-docs/stable/10min.html PS : I assumed that the dtype of your index (DATE column) was int64.
If you'd rather keep DATE as a standard column (not your index), you can also do this: df = df[(20110708 <= df.DATE) & (df.DATE <= 20110803)] The indexing isn't quite as pretty and it'll be a little slower, but it works on columns. This is assuming you've already read the excel file in using df = pd.read_csv(filename)