How to replace slow 'apply' method in pandas DataFrame - python

I have a DataFrame with currencies transactions:
import pandas as pd
data = [[1653663281618, -583.8686, 'USD'],
[1653741652125, -84.0381, 'USD'],
[1653776860252, -33.8723, 'CHF'],
[1653845294504, -465.4614, 'CHF'],
[1653847155140, 22.285, 'USD'],
[1653993629537, -358.04640000000006, 'USD']]
df = pd.DataFrame(data = data, columns = ['time', 'qty', 'currency_1'])
I need to add new column "balance" which would calculate the sum of the column 'qty' for all previous transactions. I have a simple function:
def balance(row):
table = df[(df['time'] < row['time']) & (df['currency_1'] == row['currency_1'])]
return table['qty'].sum()
df['balance'] = df.apply(balance, axis = 1)
But my real DataFrame is very large and .apply method works extremely slow.
Is it a way to avoid using apply function in this case?
Something like np.where?

You could just use pandas cumsum here:
EDIT
After adding a condition:
I don't know how transform performs compared to apply, I'd say just try it on your real data. Can't think of an easier solution for the moment.
df['balance'] = df.groupby('currency_1')['qty'].transform(lambda x: x.shift().cumsum())
print(df)
time qty currency_1 balance
0 1653663281618 -583.8686 USD NaN
1 1653741652125 -84.0381 USD -583.8686
2 1653776860252 -33.8723 CHF NaN
3 1653845294504 -465.4614 CHF -33.8723
4 1653847155140 22.2850 USD -667.9067
5 1653993629537 -358.0464 USD -645.6217
old answer:
df['Balance'] = df['qty'].shift(fill_value=0).cumsum()
print(df)
time qty currency_1 Balance
0 1653663281618 -583.8686 USD 0.0000
1 1653741652125 -84.0381 USD -583.8686
2 1653776860252 -33.8723 USD -667.9067
3 1653845294504 -465.4614 USD -701.7790
4 1653847155140 22.2850 USD -1167.2404
5 1653993629537 -358.0464 USD -1144.9554

Related

Faster alternative for implementing this pandas solution

I have a dataframe which contains sales information of products, what i need to do is to create a function which based on the product id, product type and date, calculates the average sales for a time period which is less than the given date in the function.
This is how I have implemented it, but this approach takes a lot of time and I was wondering if there was a faster way to do this.
Dataframe:
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
Current code:
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
Calling the function:
cal_avg(2,'A','2018-02-12 15:00:00')
53.983
If you are running the calc_avg function "rarely" then I suggest ignoring my answer. Otherwise, it might be beneficial to you to simply calculate the expanding window average for each product/product type. It might be slow depending on your dataset size (in which case maybe just run it on specific product types?), but you'll only need to run it once. First sort by the column you want to perform the 'expanding' on (expanding is missing the 'on' parameter) to ensure the proper row order. Then 'groupby' and transform each group (to keep the indices of the original dataframe) with your expanding window aggregation of choice (in this case 'mean').
df = df.sort_values('sales_time')
df['exp_mean_sales'] = df.groupby(['prod_id', 'prod_type'])['sale_amt'].transform(lambda gr: gr.expanding().mean())
With the result being:
df.head()
prod_id prod_type sales_time sale_amt exp_mean_sales
0 2 B 2018-01-01 00:00:00 8 8.000000
1 2 B 2018-01-01 03:00:00 72 40.000000
2 2 B 2018-01-01 06:00:00 33 37.666667
3 2 A 2018-01-01 09:00:00 81 81.000000
4 2 B 2018-01-01 12:00:00 83 49.000000
Check Below code, with %%timeit comparison (Google Colab)
import pandas as pd
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
## OP's function
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
## Numpy data prep
prod_id_array = np.array(df.values[:,:1])
prod_type_array = np.array(df.values[:,1:2])
sales_time_array = np.array(df.values[:,2:3], dtype=np.datetime64)
values = np.array(df.values[:,3:])
OP's function -
%%timeit
cal_avg(2,'A','2018-02-12 15:00:00')
Output:
Numpy version
%%timeit -n 1000
cal_vals = [2,'A','2018-02-12 15:00:00']
mask = np.logical_and(prod_id_array == cal_vals[0], prod_type_array == cal_vals[1], sales_time_array <= np.datetime64(cal_vals[2]) )
np.mean(values[mask])
Output:

Pythonic way to convert Pandas dataframe from wide to long [duplicate]

This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 1 year ago.
I have a JSON file, which I then convert to a Pandas dataframe called stocks. The stocks dataframe is in wide format and I'd like to convert it to long format.
Here's what the stocks dataframe looks like after it's ingested and converted from JSON:
TSLA MSFT GE DELL
0 993.22 320.72 93.19 57.25
I would like to convert the stocks dataframe into the following format:
ticker price
0 TSLA 993.22
1 MSFT 320.72
2 GE 93.19
3 DELL 57.25
Here is my attempt (which works):
stocks = pd.read_json('stocks.json', lines=True).T.reset_index()
stocks.columns = ['ticker', 'price']
Is there a more Pythonic way to do this? Thanks!
pandas provides the melt function for this job.
pd.melt(stocks, var_name="ticker", value_name="price")
# ticker price
#0 TSLA 993.22
#1 MSFT 320.72
#2 GE 93.19
#3 DELL 57.25
A perhaps more intuitive method than melt would be to transpose and reset the index:
df = df.T.reset_index().set_axis(['ticker', 'price'], axis=1)
Output:
>>> df
ticker price
0 TSLA 993.22
1 MSFT 320.72
2 GE 93.19
3 DELL 57.25
Edit: oops, saw that the OP already did that! :)

creating a function for mathematical data imputation python

I am performing a number of similar operations and I would like to write a function but not even sure how to approach this. I am calculating the values for 0 data for the following series:
the formula is 2 * value in 2001 - value in 2002
I currently do it one by one in Python:
print(full_data.loc['Croatia', 'fertile_age_pct'])
print(full_data.loc['Croatia', 'working_age_pct'])
print(full_data.loc['Croatia', 'young_age'])
print(full_data.loc['Croatia', 'old_age'])
full_data.replace(to_replace={'fertile_age_pct': {0:(2*46.420061-46.326103)}}, inplace=True)
full_data.replace(to_replace={'working_age_pct': {0:(2*67.038157-66.889212)}}, inplace=True)
full_data.replace(to_replace={'young_age': {0:(2*0.723475-0.715874)}}, inplace=True)
full_data.replace(to_replace={'old_age': {0:(2*0.692245-0.709597)}}, inplace=True)
Data frame (full_data):
geo_full year fertile_age_pct working_age_pct young_age old_age
Croatia 2000 0 0 0 0
Croatia 2001 46.420061 67.038157 0.723475 0.692245
Croatia 2002 46.326103 66.889212 0.715874 0.709597
Croatia 2003 46.111822 66.771187 0.706091 0.72444
Croatia 2004 45.929829 66.782133 0.694854 0.735333
Croatia 2005 45.695932 66.742514 0.686534 0.747083
So you are trying to fill the 0 values in year 2000 with your formula. If you have any other country in the DataFrame then it can get messy.
Assuming the year with 0's is always the first year for each country, try this:
full_data.set_index('year', inplace=True)
fixed_data = {}
for country, df in full_data.groupby('geo_full')[full_data.columns[1:]]:
if df.iloc[0].sum() == 0:
df.iloc[0] = df.iloc[1] * 2 - df.iloc[0]
fixed_data[country] = df
fixed_data = pd.concat(list(fixed_data.values()), keys=fixed_data.keys(), names=['geo_full'], axis=0)

Format number based on conditional

I am new to python and struggling with a simple formating issue. I have a table with two columns - metrics and value. I am looking to format the value based on the name of the metric ( in the metrics column ). Can't seem to get it to work. I'd like the numbers to show as #,### and metrics with the name 'Pct ..." to be #.#%. The code runs ok but no changes are made. Also, some of the values may be nulls. Not sure how to handle that.
# format numbers and percentages
pct_options = ['Pct Conversion', 'Pct Gross Churn', 'Pct Net Churn']
for x in pct_options:
if x in df['metrics']:
df.value.mul(100).astype('float64').astype(str).add('%')
else:
df.value.astype('float64')
IIUC, you can do it with isin, try
#first convert your column to float if necessary note you need to reassign the column
df.value = df.value.astype('float64')
#then change only the rows with the right metrics with a mask created with isin
mask_pct = df.metrics.isin(pct_options)
df.loc[mask_pct, 'value'] = df.loc[mask_pct, 'value'].mul(100).astype(str).add('%')
EDIT here may be wahat you want:
#example df
df = pd.DataFrame({'metrics': ['val', 'Pct Conversion', 'Pct Gross Churn', 'ind', 'Pct Net Churn'], 'value': [12345.5432, 0.23245436, 0.4, 13, 0.000004]})
print (df)
metrics value
0 val 12345.543200
1 Pct Conversion 0.232454
2 Pct Gross Churn 0.400000
3 ind 13.000000
4 Pct Net Churn 0.000004
#change the formatting with np.where
pct_options = ['Pct Conversion', 'Pct Gross Churn', 'Pct Net Churn']
df.value = np.where(df.metrics.isin(pct_options), df.value.mul(100).map('{:.2f}%'.format), df.value.map('{:,.2f}'.format))
metrics value
0 val 12,345.54
1 Pct Conversion 23.25%
2 Pct Gross Churn 40.00%
3 ind 13.00
4 Pct Net Churn 0.00%

pandas how to use groupby to group columns by date in the label?

I have a dataframe 10730 rows × 249 columns, i have columns:
Index(['RegionID', 'Metro', 'CountyName', 'SizeRank', '1996-04', '1996-05',
'1996-06', '1996-07', '1996-08', '1996-09',
...
'2015-11', '2015-12', '2016-01', '2016-02', '2016-03', '2016-04',
'2016-05', '2016-06', '2016-07', '2016-08'],
dtype='object', length=249)
so what i need to do is group the columns by the quarter, jan to march Q1, and so on till Q4(using mean for the values). i know how to group 3 columns for example, but how do i group all the columns since i cannot specify the name of the column one by one.
This is the dataframe head in csv to use for testing:
'State,RegionName,RegionID,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,1996-07,1996-08,1996-09,1996-10,1996-11,1996-12,1997-01,1997-02,1997-03,1997-04,1997-05,1997-06,1997-07,1997-08,1997-09,1997-10,1997-11,1997-12,1998-01,1998-02,1998-03,1998-04,1998-05,1998-06,1998-07,1998-08,1998-09,1998-10,1998-11,1998-12,1999-01,1999-02,1999-03,1999-04,1999-05,1999-06,1999-07,1999-08,1999-09,1999-10,1999-11,1999-12,2000-01,2000-02,2000-03,2000-04,2000-05,2000-06,2000-07,2000-08,2000-09,2000-10,2000-11,2000-12,2001-01,2001-02,2001-03,2001-04,2001-05,2001-06,2001-07,2001-08,2001-09,2001-10,2001-11,2001-12,2002-01,2002-02,2002-03,2002-04,2002-05,2002-06,2002-07,2002-08,2002-09,2002-10,2002-11,2002-12,2003-01,2003-02,2003-03,2003-04,2003-05,2003-06,2003-07,2003-08,2003-09,2003-10,2003-11,2003-12,2004-01,2004-02,2004-03,2004-04,2004-05,2004-06,2004-07,2004-08,2004-09,2004-10,2004-11,2004-12,2005-01,2005-02,2005-03,2005-04,2005-05,2005-06,2005-07,2005-08,2005-09,2005-10,2005-11,2005-12,2006-01,2006-02,2006-03,2006-04,2006-05,2006-06,2006-07,2006-08,2006-09,2006-10,2006-11,2006-12,2007-01,2007-02,2007-03,2007-04,2007-05,2007-06,2007-07,2007-08,2007-09,2007-10,2007-11,2007-12,2008-01,2008-02,2008-03,2008-04,2008-05,2008-06,2008-07,2008-08,2008-09,2008-10,2008-11,2008-12,2009-01,2009-02,2009-03,2009-04,2009-05,2009-06,2009-07,2009-08,2009-09,2009-10,2009-11,2009-12,2010-01,2010-02,2010-03,2010-04,2010-05,2010-06,2010-07,2010-08,2010-09,2010-10,2010-11,2010-12,2011-01,2011-02,2011-03,2011-04,2011-05,2011-06,2011-07,2011-08,2011-09,2011-10,2011-11,2011-12,2012-01,2012-02,2012-03,2012-04,2012-05,2012-06,2012-07,2012-08,2012-09,2012-10,2012-11,2012-12,2013-01,2013-02,2013-03,2013-04,2013-05,2013-06,2013-07,2013-08,2013-09,2013-10,2013-11,2013-12,2014-01,2014-02,2014-03,2014-04,2014-05,2014-06,2014-07,2014-08,2014-09,2014-10,2014-11,2014-12,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,2015-11,2015-12,2016-01,2016-02,2016-03,2016-04,2016-05,2016-06,2016-07,2016-08\nNY,New York,6181,New York,Queens,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,432600.0,438700.0,440500.0,433900.0,422000.0,415700.0,421200.0,431100.0,435100.0,431900.0,428400.0,430700.0,438800.0,446800.0,455400.0,465500.0,472600.0,478200.0,487600.0,498600.0,508800.0,515300.0,517000.0,517800.0,520800.0,521500.0,523000.0,526300.0,524800.0,519100.0,516200.0,516400.0,516300.0,515500.0,512200.0,509200.0,509800.0,511600.0,512700.0,514000.0,513400.0,510700.0,508100.0,506700.0,505200.0,503700.0,502900.0,502400.0,500500.0,496400.0,491900.0,487500.0,484400.0,481700.0,477900.0,473600.0,469700.0,466100.0,461700.0,457700.0,455300.0,454800.0,456000.0,457800.0,461300.0,466100.0,470200.0,472800.0,475300.0,477100.0,478400.0,479100.0,478900.0,477700.0,476700.0,477100.0,478000.0,478000.0,476800.0,475300.0,473800.0,472000.0,470600.0,469900.0,469500.0,468200.0,465800.0,463500.0,461800.0,460100.0,459700.0,460800.0,461700.0,462500.0,463900.0,466000.0,467500.0,468200.0,468700.0,469400.0,469400,469100.0,468700,469300,470300,472100,474300,477600,481400,485100,488800,492600,495900,499500,503500,506400,509900,515700,520800,522200,522400,523800,526200,528400,529600,530800,532200,533800,536200,540600,545600,551400,557200,563000,568700,573600,576200,578400,582200,588000,592200,592500,590200,588000,586400\nCA,Los Angeles,12447,Los Angeles-Long Beach-Anaheim,Los Angeles,2,155000.0,154600.0,154400.0,154200.0,154100.0,154300.0,154300.0,154200.0,154800.0,155900.0,157000.0,157700.0,158200.0,158600.0,158800.0,158900.0,159100.0,159800.0,160700.0,161900.0,163400.0,165400.0,167000.0,168500.0,169900.0,171400.0,172900.0,174300.0,175800.0,177800.0,180100.0,182600.0,184400.0,185600.0,186900.0,188200.0,189600.0,191300.0,193100.0,194700.0,196300.0,197700.0,199100.0,200700.0,202300.0,204400.0,207000.0,209800.0,212300.0,214500.0,216600.0,219000.0,221100.0,222800.0,224300.0,226100.0,228100.0,230600.0,233000.0,235400.0,237300.0,239100.0,240900.0,242900.0,245000.0,247300.0,250100.0,253100.0,255900.0,258800.0,261900.0,265200.0,268600.0,272600.0,276900.0,281800.0,287000.0,292200.0,297000.0,302100.0,307600.0,313400.0,319000.0,324300.0,329600.0,334600.0,339300.0,344500.0,350600.0,356800.0,363400.0,370700.0,378400.0,386500.0,394900.0,404300.0,414600.0,425500.0,436600.0,447400.0,456700.0,464400.0,471200.0,477400.0,483500.0,489100.0,494700.0,501400.0,509700.0,518300.0,527200.0,536100.0,545400.0,555200.0,564500.0,571900.0,576800.0,579700.0,581800.0,583800.0,585300.0,587300.0,589900.0,592200.0,593300.0,593400.0,593100.0,592900.0,591600.0,590900.0,591800.0,592600.0,592100.0,590200.0,586200.0,581600.0,577500.0,572800.0,567600.0,562100.0,554400.0,545000.0,535500.0,525400.0,513600.0,502000.0,491200.0,480200.0,469000.0,459300.0,451200.0,443900.0,436800.0,430900.0,426100.0,421800.0,417800.0,413700.0,410200.0,407900.0,406300.0,404900.0,404200.0,402900.0,405900.0,412000.0,415000.0,413100.0,412100.0,411300.0,410100.0,408400.0,406800.0,405100.0,403300.0,401900.0,401000.0,399200.0,397100.0,395000.0,392700.0,390200.0,387400.0,384700.0,382100.0,379500.0,377200.0,375700.0,373800.0,371500.0,370000.0,370300.0,372100.0,375300.0,378600.0,382100.0,385600.0,389000.0,391800.0,396400.0,401500,405700.0,410700,418200,425500,432700,440400,448100,455200,461900,467800,472300,475700,479400,484000,489400,494200,498100,501800,505600,509000,512600,516000,518900,521700,525100,528900,532400,535300,538200,541000,544000,547200,550600,554200,558200,560800,562800,565600,569700,574000,577800,580600,583000,585100\nIL,Chicago,17426,Chicago,Cook,3,109700.0,109400.0,109300.0,109300.0,109100.0,109000.0,109000.0,109600.0,110200.0,110800.0,111300.0,111700.0,112200.0,112300.0,112100.0,112200.0,113000.0,113700.0,114200.0,114800.0,115500.0,116200.0,117100.0,117600.0,117800.0,118300.0,119200.0,120000.0,120600.0,121500.0,122300.0,122700.0,122900.0,123300.0,123700.0,124500.0,125700.0,127300.0,128800.0,130200.0,131400.0,132600.0,133700.0,134600.0,135500.0,136800.0,138300.0,140100.0,141900.0,143700.0,145300.0,146700.0,147900.0,149000.0,150400.0,152000.0,154000.0,155600.0,157000.0,158200.0,159900.0,161800.0,163700.0,165300.0,166400.0,167500.0,168800.0,170400.0,172100.0,173900.0,175600.0,177000.0,177800.0,177600.0,177300.0,177700.0,178800.0,180400.0,182300.0,183800.0,185000.0,185600.0,186800.0,188900.0,191300.0,194100.0,197500.0,200200.0,202300.0,203700.0,204000.0,204000.0,204400.0,205300.0,206300.0,207000.0,207600.0,208600.0,209600.0,210900.0,212800.0,214600.0,216400.0,218300.0,220300.0,222300.0,224000.0,225400.0,226900.0,228600.0,230100.0,231800.0,233200.0,234500.0,236000.0,237500.0,239000.0,240800.0,242500.0,243900.0,244900.0,245300.0,245400.0,245800.0,245800.0,245500.0,245900.0,246900.0,247300.0,247400.0,247300.0,247000.0,246700.0,246400.0,246100.0,246100.0,246300.0,246400.0,246700.0,247100.0,246700.0,245300.0,243900.0,242000.0,239800.0,237900.0,236000.0,233500.0,231800.0,230700.0,229200.0,226700.0,225200.0,224500.0,223800.0,223000.0,221900.0,219700.0,217500.0,215600.0,213800.0,212900.0,212300.0,211900.0,210800.0,209300.0,207300.0,205300.0,204200.0,204100.0,203100.0,201100.0,199000.0,196700.0,193800.0,191100.0,189200.0,188100.0,187600.0,186500.0,184400.0,181700.0,178700.0,175900.0,174100.0,172800.0,171400.0,170100.0,169100.0,167900.0,166700.0,166200.0,166400.0,166800.0,167900.0,168900.0,168400.0,167100.0,166900.0,167300.0,167500,167700.0,168300,169100,170400,172400,175100,178200,181000,183200,184600,185800,187200,189100,191100,192500,192600,192400,192900,193900,195600,197800,200100,201700,202000,201200,200500,201500,204000,206500,207600,207700,208100,209100,209000,207800,206900,206200,205800,206200,207300,208200,209100,211000,213000\nPA,Philadelphia,13271,Philadelphia,Philadelphia,4,50000.0,49900.0,49600.0,49400.0,49400.0,49300.0,49300.0,49400.0,49700.0,49600.0,49500.0,49700.0,49800.0,49700.0,49700.0,49800.0,49700.0,49700.0,49800.0,49900.0,49900.0,50000.0,50300.0,50600.0,50800.0,50800.0,50800.0,50800.0,50700.0,50500.0,50500.0,50700.0,50700.0,50800.0,50900.0,51100.0,51200.0,51400.0,51500.0,51400.0,51500.0,51800.0,52100.0,52100.0,52300.0,52700.0,53100.0,53200.0,53400.0,53700.0,53800.0,53800.0,54100.0,54500.0,54700.0,54600.0,54800.0,55100.0,55400.0,55500.0,55400.0,55500.0,55700.0,55900.0,56300.0,56600.0,57000.0,57500.0,58100.0,58600.0,59100.0,59700.0,60300.0,60700.0,61200.0,61800.0,62200.0,62500.0,63000.0,63600.0,63900.0,64200.0,64700.0,65300.0,65700.0,66100.0,66800.0,67700.0,68500.0,69200.0,69800.0,70700.0,71700.0,72800.0,73700.0,74700.0,75700.0,76700.0,77800.0,79100.0,80500.0,82100.0,84000.0,85600.0,87000.0,88200.0,89600.0,91300.0,93000.0,94900.0,96700.0,98400.0,100200.0,101900.0,103400.0,104900.0,106400.0,107500.0,108200.0,109300.0,110800.0,112500.0,113800.0,114800.0,115600.0,116000.0,116400.0,116700.0,116800.0,116900.0,117300.0,117800.0,118200.0,118600.0,119300.0,120200.0,120900.0,121400.0,121300.0,120900.0,120200.0,119600.0,119600.0,119500.0,118800.0,118100.0,117500.0,117100.0,117000.0,116700.0,116300.0,115800.0,115500.0,115900.0,116300.0,116400.0,116400.0,116100.0,116000.0,116200.0,116700.0,117300.0,118000.0,118200.0,119500.0,120900.0,121300.0,121300.0,122100.0,123000.0,123300.0,122300.0,120000.0,118200.0,117600.0,117900.0,117800.0,117400.0,117000.0,116900.0,116700.0,116500.0,115700.0,115300.0,115500.0,115600.0,115200.0,114800.0,114100.0,113500.0,112900.0,111800.0,110800.0,110400.0,110400.0,110200.0,109900.0,109700.0,110000.0,110700.0,111800,112100.0,111900,112000,112200,111800,111200,111000,110900,111100,111800,112700,112900,113100,113900,114200,113600,113500,114100,114900,115500,115500,115400,115600,116000,116100,116100,116400,117000,117900,119000,120100,121300,122300,122700,122300,121600,121800,123300,125200,126400,127000,127400,128300,129100\nAZ,Phoenix,40326,Phoenix,Maricopa,5,87200.0,87700.0,88200.0,88400.0,88500.0,88900.0,89400.0,89700.0,90100.0,90700.0,91400.0,91700.0,91800.0,92000.0,92300.0,92600.0,93000.0,93400.0,94000.0,94600.0,95300.0,96100.0,96800.0,97300.0,97700.0,98400.0,99200.0,100100.0,100500.0,100700.0,100900.0,101700.0,102600.0,103400.0,103900.0,104400.0,105100.0,105900.0,106200.0,106600.0,107400.0,108300.0,109000.0,109700.0,110400.0,111000.0,111700.0,112800.0,113700.0,114300.0,115100.0,115600.0,115900.0,116500.0,117200.0,117400.0,117600.0,118400.0,119700.0,120700.0,121200.0,121500.0,122000.0,122400.0,122700.0,123000.0,123600.0,124300.0,125000.0,125800.0,126600.0,127200.0,127900.0,128400.0,128800.0,129500.0,130500.0,131600.0,132500.0,133200.0,134000.0,134900.0,135700.0,136500.0,137200.0,138000.0,138600.0,138900.0,139200.0,139400.0,139600.0,140300.0,141400.0,142500.0,143700.0,144900.0,145900.0,147100.0,148400.0,150300.0,153100.0,156200.0,159400.0,162900.0,166500.0,170000.0,173900.0,178800.0,185000.0,192300.0,200700.0,209400.0,217000.0,223600.0,229800.0,234900.0,238600.0,241300.0,243000.0,244100.0,244800.0,245400.0,245600.0,245600.0,245300.0,244600.0,243800.0,243400.0,243400.0,243600.0,243200.0,242200.0,241300.0,240200.0,238400.0,236400.0,234700.0,233300.0,231600.0,229100.0,226100.0,222800.0,218800.0,214300.0,209500.0,205200.0,201100.0,197300.0,193700.0,190300.0,186700.0,182800.0,180500.0,179600.0,178000.0,175100.0,172100.0,168400.0,164200.0,160000.0,156000.0,151800.0,147600.0,143900.0,138900.0,133400.0,130200.0,129200.0,127700.0,126200.0,124800.0,123100.0,120700.0,118500.0,117000.0,115800.0,114800.0,114100.0,113200.0,111800.0,110100.0,108000.0,105900.0,104100.0,102900.0,102300.0,102400.0,103000.0,104100.0,105800.0,107600.0,109100.0,111200.0,114000.0,117200.0,120400.0,123300.0,125800.0,128300.0,130500.0,132500,134400.0,136200,138400,141600,144700,147400,150500,153600,156100,158100,160000,161600,162700,163300,163700,164100,164200,164500,164700,165200,166200,167200,168400,169900,171000,171500,172100,172900,174100,175500,177100,179100,181000,182400,183800,185300,186600,188000,189100,190200,191300,192800,194500,195900\n'
I changed the column index to date by dropping the non dates from the df quarter = df.drop(['RegionID','Metro','CountyName','SizeRank'],axis=1)
then change the columns to date quarter.columns = pd.to_datetime(quarter.columns) then i would like to do something likequarter = quarter.groupby(pd.TimeGrouper(freq='3M'),axis=1) but it's not working, then i would merge it back to the non-date columns. Also with this approach i wouldnt know how to put the right label for it like [2015Q4,2016Q1,2016Q2,2016Q3,2016Q4]
Here is a vectorized solution which uses pd.PeriodIndex and groupby(..., axis=1):
Data:
In [69]: x
Out[69]:
2016-01 2016-02 2016-03 2016-04 2016-05 2016-06
0 1 0 1 0 0 0
1 2 0 1 0 0 0
2 1 1 2 0 1 0
Solution:
In [70]: x.groupby(pd.PeriodIndex(x.columns, freq='Q'), axis=1).mean()
Out[70]:
2016Q1 2016Q2
0 0.666667 0.000000
1 1.000000 0.000000
2 1.333333 0.333333
Explanation:
In [71]: pd.PeriodIndex(x.columns, freq='Q')
Out[71]: PeriodIndex(['2016Q1', '2016Q1', '2016Q1', '2016Q2', '2016Q2', '2016Q2'], dtype='period[Q-DEC]', freq='Q-DEC')
It's not pretty, but this is the first thing I thought of. It sounds like the date columns can be manipulated separately. Break those out into a separate dataframe of the form below. If the other fields are kept, the conversion to datetime will throw an error.
import numpy as np
import pandas as pd
csv_df = pd.DataFrame({'2016-01':[1,2,1], '2016-02':[0,0,1], '2016-03':[1,1,2], '2016-04':[0,0,0], '2016-05':[0,0,1], '2016-06':[0,0,0]})
# convert columns into datetime format
csv_df.rename(columns=lambda x: pd.to_datetime(x, format='%Y-%m'), inplace=True)
# now strip out the year and the quarter
csv_df.rename(columns=lambda x: str(x.year) + 'Q' + str(x.quarter), inplace=True)
# #lucarlig improved my suggestion by using groupby as follows
csv_df = csv_df.groupby(csv_df.columns, axis=1).mean()
Consider melting the dataframe from the wide format to long format, parse out the quarter and year using datetime and run a pivot_table() to transform back from long to wide aggregating the values with mean:
import pandas as pd
import datetime as dt
import numpy as np
...
# MELT DATAFRAME
meltdf = pd.melt(df, id_vars = ['State','RegionName','RegionID',
'Metro','CountyName','SizeRank'],
var_name = 'Date', value_name = 'Data')
# EXTRACT QUARTER
meltdf['Date'] = pd.to_datetime(meltdf['Date'] + '-01')
meltdf['YearQuarter'] = meltdf['Date'].dt.year.astype(str) + 'Q' + \
meltdf['Date'].dt.quarter.astype(str)
# PIVOT DATAFRAME
pivotdf = pd.pivot_table(meltdf, index=['State','RegionName','RegionID',
'Metro','CountyName','SizeRank'],
columns=['YearQuarter'], values='Data', aggfunc=np.mean)
Output
print(pivotdf.head())
# State RegionName RegionID Metro CountyName SizeRank 1996Q2 1996Q3 1996Q4 1997Q1 1997Q2 1997Q3 ...
# AZ Phoenix 40326 Phoenix Maricopa 5 87700 88600 89733.33333 91266.66667 92033.33333 93000
# CA Los Angeles 12447 Los Angeles...Los Angeles 2 154666.6667 154200 154433.3333 156866.6667 158533.3333 159266.6667
# IL Chicago 17426 Chicago Cook 3 109466.6667 109133.3333 109600 111266.6667 112200 112966.6667
# NY New York 6181 New York Queens 1
# PA Philadelphia 13271 Philadelphia Philadelphia 4 49833.33333 49366.66667 49466.66667 49600 49733.33333 49733.33333

Categories

Resources