Expand pandas dataframe based on range in a column - python

I have a pandas dataframe like this:
Name SICs
Agric 0100-0199
Agric 0910-0919
Agric 2048-2048
Food 2000-2009
Food 2010-2019
Soda 2097-2097
The SICs column gives a range of integer values that match the Name given in the first column (although they're stored as a string).
I need to expand this DataFrame so that it has one row for each integer in the range:
Agric 100
Agric 101
Agric 102
...
Agric 199
Agric 910
Agric 911
...
Agric 919
Agric 2048
Food 2000
...
Is there a particularly good way to do this? I was going to do something like this
ranges = {i:r.split('-') for i, r in enumerate(inds['SICs'])}
ranges_expanded = {}
for r in ranges:
ranges_expanded[r] = range(int(ranges[r][0]),int(ranges[r][1])+1)
but I wonder if there's a better way or perhaps a pandas feature to do this. (Also, I'm not sure this will work, as I don't yet see how to read the ranges_expanded dictionary into a DataFrame.)

Quick and dirty but I think this gets you to what you need:
from io import StringIO
import pandas as pd
players=StringIO(u"""Name,SICs
Agric,0100-0199
Agric,0210-0211
Food,2048-2048
Soda,1198-1200""")
df = pd.DataFrame.from_csv(players, sep=",", parse_dates=False).reset_index()
df2 = pd.DataFrame(columns=('Name', 'SIC'))
count = 0
for idx,r in df.iterrows():
data = r['SICs'].split("-")
for i in range(int(data[0]), int(data[1])+1):
df2.loc[count] = (r['Name'], i)
count += 1

The neatest way I found (building on from Andy Hayden's answer):
# Extract date min and max
df = df.set_index("Name")
df = df['SICs'].str.extract("(\d+)-(\d+)")
df.columns = ['min', 'max']
df = df.astype('int')
# Enumerate dates into wide table
enumerated_dates = [np.arange(row['min'], row['max']+1) for _, row in df.iterrows()]
df = pd.DataFrame.from_records(data=enumerated_dates, index=df.index)
# Convert from wide to long table
df = df.stack().reset_index(1, drop=True)
It is however slow due to the for loop. A vectorised solution would be amazing but I cant find one.

You can use str.extract to get strings from a regular expression:
In [11]: df
Out[11]:
Name SICs
0 Agri 0100-0199
1 Agri 0910-0919
2 Food 2000-2009
First take out the name as that's the thing we want to keep:
In [12]: df1 = df.set_index("Name")
In [13]: df1
Out[13]:
SICs
Name
Agri 0100-0199
Agri 0910-0919
Food 2000-2009
In [14]: df1['SICs'].str.extract("(\d+)-(\d+)")
Out[14]:
0 1
Name
Agri 0100 0199
Agri 0910 0919
Food 2000 2009
Then flatten this with stack (which adds a MultiIndex):
In [15]: df1['SICs'].str.extract("(\d+)-(\d+)").stack()
Out[15]:
Name
Agri 0 0100
1 0199
0 0910
1 0919
Food 0 2000
1 2009
dtype: object
If you must you can remove the 0-1 level of the MultiIndex:
In [16]: df1['SICs'].str.extract("(\d+)-(\d+)").stack().reset_index(1, drop=True)
Out[16]:
Name
Agri 0100
Agri 0199
Agri 0910
Agri 0919
Food 2000
Food 2009
dtype: object

Related

pandas average across dynamic number of columns

I have a dataframe like as shown below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10
1 1234 1231 1256 1239
2 5678 3425 3255 2345
I would like to do the below
a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)
b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)
So, I tried the below
df['revenue_mean_2m'] = (df['revenue_m10']+df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10']+df['revenue_m9']+df['revenue_m8']+df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)
But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input
I expect my output to be like as below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 revenue_mean_2m revenue_mean_4m
1 1234 1231 1256 1239 1867 1240
2 5678 3425 3255 2345 2800 3675.75
Use filter and slicing:
# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d+')
# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)
Output:
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 \
0 1 1234 1231 1256 1239
1 2 5678 3425 3255 2345
revenue_mean_2m revenue_mean_4m
0 1247.5 1240.00
1 2800.0 3675.75
if column order it not guaranteed
Sort them with natural sorting
# shuffle the DataFrame columns for demo
df = df.sample(frac=1, axis=1)
# filter and reorder the needed columns
from natsort import natsort_key
df2 = df.filter(regex=r'revenue_m\d+').sort_index(key=natsort_key, axis=1)
you could try something like this in reference to this post:
n_months = 4 # you could also do this in a loop for all months range(1, 12)
df[f'revenue_mean_{n_months}m'] = df.iloc[:, -n_months:-1].mean(axis=1)

Delete entire dataframe rows based on condition and stack remaining rows

I have the following dataframe:
Date Name Grade Hobby
01/01/2005 Albert 4 Drawing
08/04/1996 Martha 6 Horseback riding
03/03/2003 Jack 5 Singing
07/01/2001 Millie 5 Netflix
24/09/2000 Julie 7 Sleeping
...
I want to filter the df to only contain the rows for repeat dates, so where df['Date'].value_counts()>=2
And then groupby dates sorted in chronological order so that I can have something like:
Date Name Grade Hobby
08/08/1996 Martha 6 Horseback riding
Matt 4 Sleeping
Paul 5 Cooking
24/09/2000 Julie 7 Sleeping
Simone 4 Sleeping
...
I have tried some code, but I get stuck on the first step. I tried something like:
same=df['Date'].value_counts()
same=same.loc[lambda x:x >=2]
mult=same.index.to_list()
for i in df['Date']:
if i not in mult:
df.drop(df[df['Date'==i]].index)
I also tried
new=df.loc[df['Date'].isin(mult)]
plot=pd.pivot_table(new, index=['Date'],columns=['Name'])
But this only gets 1 of the rows per each repeat dates instead of all the rows with the same date
Think this should do the job
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df_new = df[df['Date'].duplicated(keep=False)].sort_values('Date')
Convert Date to datetimes by to_datetime, then filter rows in boolean indexing and last sorting by DataFrame.sort_values:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
same=df['Date'].value_counts()
df1 = df[df['Date'].map(same) >= 2].sort_values(['Date','Grade'], ascending=[True, False])
Or use Series.duplicated with keep=False for count 2 and more, what is same like duplicates:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = df[df['Date'].duplicated(keep=False)].sort_values(['Date','Grade'], ascending=[True, False])

localizing rows from a dataframe

I working with a dataframe which has 20 columns but I'm only going to use three of them in my task, which are named "Price","Retail" and "Profit" and are like this:
cols = ['Price', 'Retail','Profit']
df3.loc[:,cols]
Price Retail Profit
0 861.5 1315.233051 453.733051
1 901.5 1315.233051 413.733051
2 911.0 1315.233051 404.233051
3 901.5 1315.233051 413.733051
4 901.5 1315.233051 413.733051
... ... ... ...
2678 14574.0 21546.730769 6972.730769
2679 35708.5 52026.764706 16318.264706
2680 35708.5 52026.764706 16318.264706
2681 163276.5 250882.500000 87606.000000
2682 7369.5 11785.729730 4416.229730
2683 rows × 3 columns
My goal is to find the lines where the prices are lower than 5000 and sort by the biggest values of profit. How can I make it?
You for example use query and sort_values
df3.query("Price < 5000").sort_values("Profit", ascending=False)
You can do this:
df_pofit_less_5000 = df[df['Price']<5000]
You can start by subsetting the dataframe where price is lower than 5000:
df = df[df["Price"] < 5000]
Then sort by Profit descending:
df = df.sort_values(by = ["Profit"], ascending=False)

pandas how to use groupby to group columns by date in the label?

I have a dataframe 10730 rows × 249 columns, i have columns:
Index(['RegionID', 'Metro', 'CountyName', 'SizeRank', '1996-04', '1996-05',
'1996-06', '1996-07', '1996-08', '1996-09',
...
'2015-11', '2015-12', '2016-01', '2016-02', '2016-03', '2016-04',
'2016-05', '2016-06', '2016-07', '2016-08'],
dtype='object', length=249)
so what i need to do is group the columns by the quarter, jan to march Q1, and so on till Q4(using mean for the values). i know how to group 3 columns for example, but how do i group all the columns since i cannot specify the name of the column one by one.
This is the dataframe head in csv to use for testing:
'State,RegionName,RegionID,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,1996-07,1996-08,1996-09,1996-10,1996-11,1996-12,1997-01,1997-02,1997-03,1997-04,1997-05,1997-06,1997-07,1997-08,1997-09,1997-10,1997-11,1997-12,1998-01,1998-02,1998-03,1998-04,1998-05,1998-06,1998-07,1998-08,1998-09,1998-10,1998-11,1998-12,1999-01,1999-02,1999-03,1999-04,1999-05,1999-06,1999-07,1999-08,1999-09,1999-10,1999-11,1999-12,2000-01,2000-02,2000-03,2000-04,2000-05,2000-06,2000-07,2000-08,2000-09,2000-10,2000-11,2000-12,2001-01,2001-02,2001-03,2001-04,2001-05,2001-06,2001-07,2001-08,2001-09,2001-10,2001-11,2001-12,2002-01,2002-02,2002-03,2002-04,2002-05,2002-06,2002-07,2002-08,2002-09,2002-10,2002-11,2002-12,2003-01,2003-02,2003-03,2003-04,2003-05,2003-06,2003-07,2003-08,2003-09,2003-10,2003-11,2003-12,2004-01,2004-02,2004-03,2004-04,2004-05,2004-06,2004-07,2004-08,2004-09,2004-10,2004-11,2004-12,2005-01,2005-02,2005-03,2005-04,2005-05,2005-06,2005-07,2005-08,2005-09,2005-10,2005-11,2005-12,2006-01,2006-02,2006-03,2006-04,2006-05,2006-06,2006-07,2006-08,2006-09,2006-10,2006-11,2006-12,2007-01,2007-02,2007-03,2007-04,2007-05,2007-06,2007-07,2007-08,2007-09,2007-10,2007-11,2007-12,2008-01,2008-02,2008-03,2008-04,2008-05,2008-06,2008-07,2008-08,2008-09,2008-10,2008-11,2008-12,2009-01,2009-02,2009-03,2009-04,2009-05,2009-06,2009-07,2009-08,2009-09,2009-10,2009-11,2009-12,2010-01,2010-02,2010-03,2010-04,2010-05,2010-06,2010-07,2010-08,2010-09,2010-10,2010-11,2010-12,2011-01,2011-02,2011-03,2011-04,2011-05,2011-06,2011-07,2011-08,2011-09,2011-10,2011-11,2011-12,2012-01,2012-02,2012-03,2012-04,2012-05,2012-06,2012-07,2012-08,2012-09,2012-10,2012-11,2012-12,2013-01,2013-02,2013-03,2013-04,2013-05,2013-06,2013-07,2013-08,2013-09,2013-10,2013-11,2013-12,2014-01,2014-02,2014-03,2014-04,2014-05,2014-06,2014-07,2014-08,2014-09,2014-10,2014-11,2014-12,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,2015-11,2015-12,2016-01,2016-02,2016-03,2016-04,2016-05,2016-06,2016-07,2016-08\nNY,New York,6181,New York,Queens,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,432600.0,438700.0,440500.0,433900.0,422000.0,415700.0,421200.0,431100.0,435100.0,431900.0,428400.0,430700.0,438800.0,446800.0,455400.0,465500.0,472600.0,478200.0,487600.0,498600.0,508800.0,515300.0,517000.0,517800.0,520800.0,521500.0,523000.0,526300.0,524800.0,519100.0,516200.0,516400.0,516300.0,515500.0,512200.0,509200.0,509800.0,511600.0,512700.0,514000.0,513400.0,510700.0,508100.0,506700.0,505200.0,503700.0,502900.0,502400.0,500500.0,496400.0,491900.0,487500.0,484400.0,481700.0,477900.0,473600.0,469700.0,466100.0,461700.0,457700.0,455300.0,454800.0,456000.0,457800.0,461300.0,466100.0,470200.0,472800.0,475300.0,477100.0,478400.0,479100.0,478900.0,477700.0,476700.0,477100.0,478000.0,478000.0,476800.0,475300.0,473800.0,472000.0,470600.0,469900.0,469500.0,468200.0,465800.0,463500.0,461800.0,460100.0,459700.0,460800.0,461700.0,462500.0,463900.0,466000.0,467500.0,468200.0,468700.0,469400.0,469400,469100.0,468700,469300,470300,472100,474300,477600,481400,485100,488800,492600,495900,499500,503500,506400,509900,515700,520800,522200,522400,523800,526200,528400,529600,530800,532200,533800,536200,540600,545600,551400,557200,563000,568700,573600,576200,578400,582200,588000,592200,592500,590200,588000,586400\nCA,Los Angeles,12447,Los Angeles-Long Beach-Anaheim,Los Angeles,2,155000.0,154600.0,154400.0,154200.0,154100.0,154300.0,154300.0,154200.0,154800.0,155900.0,157000.0,157700.0,158200.0,158600.0,158800.0,158900.0,159100.0,159800.0,160700.0,161900.0,163400.0,165400.0,167000.0,168500.0,169900.0,171400.0,172900.0,174300.0,175800.0,177800.0,180100.0,182600.0,184400.0,185600.0,186900.0,188200.0,189600.0,191300.0,193100.0,194700.0,196300.0,197700.0,199100.0,200700.0,202300.0,204400.0,207000.0,209800.0,212300.0,214500.0,216600.0,219000.0,221100.0,222800.0,224300.0,226100.0,228100.0,230600.0,233000.0,235400.0,237300.0,239100.0,240900.0,242900.0,245000.0,247300.0,250100.0,253100.0,255900.0,258800.0,261900.0,265200.0,268600.0,272600.0,276900.0,281800.0,287000.0,292200.0,297000.0,302100.0,307600.0,313400.0,319000.0,324300.0,329600.0,334600.0,339300.0,344500.0,350600.0,356800.0,363400.0,370700.0,378400.0,386500.0,394900.0,404300.0,414600.0,425500.0,436600.0,447400.0,456700.0,464400.0,471200.0,477400.0,483500.0,489100.0,494700.0,501400.0,509700.0,518300.0,527200.0,536100.0,545400.0,555200.0,564500.0,571900.0,576800.0,579700.0,581800.0,583800.0,585300.0,587300.0,589900.0,592200.0,593300.0,593400.0,593100.0,592900.0,591600.0,590900.0,591800.0,592600.0,592100.0,590200.0,586200.0,581600.0,577500.0,572800.0,567600.0,562100.0,554400.0,545000.0,535500.0,525400.0,513600.0,502000.0,491200.0,480200.0,469000.0,459300.0,451200.0,443900.0,436800.0,430900.0,426100.0,421800.0,417800.0,413700.0,410200.0,407900.0,406300.0,404900.0,404200.0,402900.0,405900.0,412000.0,415000.0,413100.0,412100.0,411300.0,410100.0,408400.0,406800.0,405100.0,403300.0,401900.0,401000.0,399200.0,397100.0,395000.0,392700.0,390200.0,387400.0,384700.0,382100.0,379500.0,377200.0,375700.0,373800.0,371500.0,370000.0,370300.0,372100.0,375300.0,378600.0,382100.0,385600.0,389000.0,391800.0,396400.0,401500,405700.0,410700,418200,425500,432700,440400,448100,455200,461900,467800,472300,475700,479400,484000,489400,494200,498100,501800,505600,509000,512600,516000,518900,521700,525100,528900,532400,535300,538200,541000,544000,547200,550600,554200,558200,560800,562800,565600,569700,574000,577800,580600,583000,585100\nIL,Chicago,17426,Chicago,Cook,3,109700.0,109400.0,109300.0,109300.0,109100.0,109000.0,109000.0,109600.0,110200.0,110800.0,111300.0,111700.0,112200.0,112300.0,112100.0,112200.0,113000.0,113700.0,114200.0,114800.0,115500.0,116200.0,117100.0,117600.0,117800.0,118300.0,119200.0,120000.0,120600.0,121500.0,122300.0,122700.0,122900.0,123300.0,123700.0,124500.0,125700.0,127300.0,128800.0,130200.0,131400.0,132600.0,133700.0,134600.0,135500.0,136800.0,138300.0,140100.0,141900.0,143700.0,145300.0,146700.0,147900.0,149000.0,150400.0,152000.0,154000.0,155600.0,157000.0,158200.0,159900.0,161800.0,163700.0,165300.0,166400.0,167500.0,168800.0,170400.0,172100.0,173900.0,175600.0,177000.0,177800.0,177600.0,177300.0,177700.0,178800.0,180400.0,182300.0,183800.0,185000.0,185600.0,186800.0,188900.0,191300.0,194100.0,197500.0,200200.0,202300.0,203700.0,204000.0,204000.0,204400.0,205300.0,206300.0,207000.0,207600.0,208600.0,209600.0,210900.0,212800.0,214600.0,216400.0,218300.0,220300.0,222300.0,224000.0,225400.0,226900.0,228600.0,230100.0,231800.0,233200.0,234500.0,236000.0,237500.0,239000.0,240800.0,242500.0,243900.0,244900.0,245300.0,245400.0,245800.0,245800.0,245500.0,245900.0,246900.0,247300.0,247400.0,247300.0,247000.0,246700.0,246400.0,246100.0,246100.0,246300.0,246400.0,246700.0,247100.0,246700.0,245300.0,243900.0,242000.0,239800.0,237900.0,236000.0,233500.0,231800.0,230700.0,229200.0,226700.0,225200.0,224500.0,223800.0,223000.0,221900.0,219700.0,217500.0,215600.0,213800.0,212900.0,212300.0,211900.0,210800.0,209300.0,207300.0,205300.0,204200.0,204100.0,203100.0,201100.0,199000.0,196700.0,193800.0,191100.0,189200.0,188100.0,187600.0,186500.0,184400.0,181700.0,178700.0,175900.0,174100.0,172800.0,171400.0,170100.0,169100.0,167900.0,166700.0,166200.0,166400.0,166800.0,167900.0,168900.0,168400.0,167100.0,166900.0,167300.0,167500,167700.0,168300,169100,170400,172400,175100,178200,181000,183200,184600,185800,187200,189100,191100,192500,192600,192400,192900,193900,195600,197800,200100,201700,202000,201200,200500,201500,204000,206500,207600,207700,208100,209100,209000,207800,206900,206200,205800,206200,207300,208200,209100,211000,213000\nPA,Philadelphia,13271,Philadelphia,Philadelphia,4,50000.0,49900.0,49600.0,49400.0,49400.0,49300.0,49300.0,49400.0,49700.0,49600.0,49500.0,49700.0,49800.0,49700.0,49700.0,49800.0,49700.0,49700.0,49800.0,49900.0,49900.0,50000.0,50300.0,50600.0,50800.0,50800.0,50800.0,50800.0,50700.0,50500.0,50500.0,50700.0,50700.0,50800.0,50900.0,51100.0,51200.0,51400.0,51500.0,51400.0,51500.0,51800.0,52100.0,52100.0,52300.0,52700.0,53100.0,53200.0,53400.0,53700.0,53800.0,53800.0,54100.0,54500.0,54700.0,54600.0,54800.0,55100.0,55400.0,55500.0,55400.0,55500.0,55700.0,55900.0,56300.0,56600.0,57000.0,57500.0,58100.0,58600.0,59100.0,59700.0,60300.0,60700.0,61200.0,61800.0,62200.0,62500.0,63000.0,63600.0,63900.0,64200.0,64700.0,65300.0,65700.0,66100.0,66800.0,67700.0,68500.0,69200.0,69800.0,70700.0,71700.0,72800.0,73700.0,74700.0,75700.0,76700.0,77800.0,79100.0,80500.0,82100.0,84000.0,85600.0,87000.0,88200.0,89600.0,91300.0,93000.0,94900.0,96700.0,98400.0,100200.0,101900.0,103400.0,104900.0,106400.0,107500.0,108200.0,109300.0,110800.0,112500.0,113800.0,114800.0,115600.0,116000.0,116400.0,116700.0,116800.0,116900.0,117300.0,117800.0,118200.0,118600.0,119300.0,120200.0,120900.0,121400.0,121300.0,120900.0,120200.0,119600.0,119600.0,119500.0,118800.0,118100.0,117500.0,117100.0,117000.0,116700.0,116300.0,115800.0,115500.0,115900.0,116300.0,116400.0,116400.0,116100.0,116000.0,116200.0,116700.0,117300.0,118000.0,118200.0,119500.0,120900.0,121300.0,121300.0,122100.0,123000.0,123300.0,122300.0,120000.0,118200.0,117600.0,117900.0,117800.0,117400.0,117000.0,116900.0,116700.0,116500.0,115700.0,115300.0,115500.0,115600.0,115200.0,114800.0,114100.0,113500.0,112900.0,111800.0,110800.0,110400.0,110400.0,110200.0,109900.0,109700.0,110000.0,110700.0,111800,112100.0,111900,112000,112200,111800,111200,111000,110900,111100,111800,112700,112900,113100,113900,114200,113600,113500,114100,114900,115500,115500,115400,115600,116000,116100,116100,116400,117000,117900,119000,120100,121300,122300,122700,122300,121600,121800,123300,125200,126400,127000,127400,128300,129100\nAZ,Phoenix,40326,Phoenix,Maricopa,5,87200.0,87700.0,88200.0,88400.0,88500.0,88900.0,89400.0,89700.0,90100.0,90700.0,91400.0,91700.0,91800.0,92000.0,92300.0,92600.0,93000.0,93400.0,94000.0,94600.0,95300.0,96100.0,96800.0,97300.0,97700.0,98400.0,99200.0,100100.0,100500.0,100700.0,100900.0,101700.0,102600.0,103400.0,103900.0,104400.0,105100.0,105900.0,106200.0,106600.0,107400.0,108300.0,109000.0,109700.0,110400.0,111000.0,111700.0,112800.0,113700.0,114300.0,115100.0,115600.0,115900.0,116500.0,117200.0,117400.0,117600.0,118400.0,119700.0,120700.0,121200.0,121500.0,122000.0,122400.0,122700.0,123000.0,123600.0,124300.0,125000.0,125800.0,126600.0,127200.0,127900.0,128400.0,128800.0,129500.0,130500.0,131600.0,132500.0,133200.0,134000.0,134900.0,135700.0,136500.0,137200.0,138000.0,138600.0,138900.0,139200.0,139400.0,139600.0,140300.0,141400.0,142500.0,143700.0,144900.0,145900.0,147100.0,148400.0,150300.0,153100.0,156200.0,159400.0,162900.0,166500.0,170000.0,173900.0,178800.0,185000.0,192300.0,200700.0,209400.0,217000.0,223600.0,229800.0,234900.0,238600.0,241300.0,243000.0,244100.0,244800.0,245400.0,245600.0,245600.0,245300.0,244600.0,243800.0,243400.0,243400.0,243600.0,243200.0,242200.0,241300.0,240200.0,238400.0,236400.0,234700.0,233300.0,231600.0,229100.0,226100.0,222800.0,218800.0,214300.0,209500.0,205200.0,201100.0,197300.0,193700.0,190300.0,186700.0,182800.0,180500.0,179600.0,178000.0,175100.0,172100.0,168400.0,164200.0,160000.0,156000.0,151800.0,147600.0,143900.0,138900.0,133400.0,130200.0,129200.0,127700.0,126200.0,124800.0,123100.0,120700.0,118500.0,117000.0,115800.0,114800.0,114100.0,113200.0,111800.0,110100.0,108000.0,105900.0,104100.0,102900.0,102300.0,102400.0,103000.0,104100.0,105800.0,107600.0,109100.0,111200.0,114000.0,117200.0,120400.0,123300.0,125800.0,128300.0,130500.0,132500,134400.0,136200,138400,141600,144700,147400,150500,153600,156100,158100,160000,161600,162700,163300,163700,164100,164200,164500,164700,165200,166200,167200,168400,169900,171000,171500,172100,172900,174100,175500,177100,179100,181000,182400,183800,185300,186600,188000,189100,190200,191300,192800,194500,195900\n'
I changed the column index to date by dropping the non dates from the df quarter = df.drop(['RegionID','Metro','CountyName','SizeRank'],axis=1)
then change the columns to date quarter.columns = pd.to_datetime(quarter.columns) then i would like to do something likequarter = quarter.groupby(pd.TimeGrouper(freq='3M'),axis=1) but it's not working, then i would merge it back to the non-date columns. Also with this approach i wouldnt know how to put the right label for it like [2015Q4,2016Q1,2016Q2,2016Q3,2016Q4]
Here is a vectorized solution which uses pd.PeriodIndex and groupby(..., axis=1):
Data:
In [69]: x
Out[69]:
2016-01 2016-02 2016-03 2016-04 2016-05 2016-06
0 1 0 1 0 0 0
1 2 0 1 0 0 0
2 1 1 2 0 1 0
Solution:
In [70]: x.groupby(pd.PeriodIndex(x.columns, freq='Q'), axis=1).mean()
Out[70]:
2016Q1 2016Q2
0 0.666667 0.000000
1 1.000000 0.000000
2 1.333333 0.333333
Explanation:
In [71]: pd.PeriodIndex(x.columns, freq='Q')
Out[71]: PeriodIndex(['2016Q1', '2016Q1', '2016Q1', '2016Q2', '2016Q2', '2016Q2'], dtype='period[Q-DEC]', freq='Q-DEC')
It's not pretty, but this is the first thing I thought of. It sounds like the date columns can be manipulated separately. Break those out into a separate dataframe of the form below. If the other fields are kept, the conversion to datetime will throw an error.
import numpy as np
import pandas as pd
csv_df = pd.DataFrame({'2016-01':[1,2,1], '2016-02':[0,0,1], '2016-03':[1,1,2], '2016-04':[0,0,0], '2016-05':[0,0,1], '2016-06':[0,0,0]})
# convert columns into datetime format
csv_df.rename(columns=lambda x: pd.to_datetime(x, format='%Y-%m'), inplace=True)
# now strip out the year and the quarter
csv_df.rename(columns=lambda x: str(x.year) + 'Q' + str(x.quarter), inplace=True)
# #lucarlig improved my suggestion by using groupby as follows
csv_df = csv_df.groupby(csv_df.columns, axis=1).mean()
Consider melting the dataframe from the wide format to long format, parse out the quarter and year using datetime and run a pivot_table() to transform back from long to wide aggregating the values with mean:
import pandas as pd
import datetime as dt
import numpy as np
...
# MELT DATAFRAME
meltdf = pd.melt(df, id_vars = ['State','RegionName','RegionID',
'Metro','CountyName','SizeRank'],
var_name = 'Date', value_name = 'Data')
# EXTRACT QUARTER
meltdf['Date'] = pd.to_datetime(meltdf['Date'] + '-01')
meltdf['YearQuarter'] = meltdf['Date'].dt.year.astype(str) + 'Q' + \
meltdf['Date'].dt.quarter.astype(str)
# PIVOT DATAFRAME
pivotdf = pd.pivot_table(meltdf, index=['State','RegionName','RegionID',
'Metro','CountyName','SizeRank'],
columns=['YearQuarter'], values='Data', aggfunc=np.mean)
Output
print(pivotdf.head())
# State RegionName RegionID Metro CountyName SizeRank 1996Q2 1996Q3 1996Q4 1997Q1 1997Q2 1997Q3 ...
# AZ Phoenix 40326 Phoenix Maricopa 5 87700 88600 89733.33333 91266.66667 92033.33333 93000
# CA Los Angeles 12447 Los Angeles...Los Angeles 2 154666.6667 154200 154433.3333 156866.6667 158533.3333 159266.6667
# IL Chicago 17426 Chicago Cook 3 109466.6667 109133.3333 109600 111266.6667 112200 112966.6667
# NY New York 6181 New York Queens 1
# PA Philadelphia 13271 Philadelphia Philadelphia 4 49833.33333 49366.66667 49466.66667 49600 49733.33333 49733.33333

pandas replace null values for a subset of columns

I have a data frame with many columns, say:
df:
name salary age title
John 100 35 eng
Bill 200 NaN adm
Lena NaN 28 NaN
Jane 120 45 eng
I want to replace the null values in salary and age, but no in the other columns. I know I can do something like this:
u = df[['salary', 'age']]
df[['salary', 'age']] = u.fillna(-1)
But this seems terse as it involves copying. Is there a more efficient way to do this?
According to Pandas documentation in 23.3
values = {'salary': -1, 'age': -1}
df.fillna(value=values, inplace=True)
Try this:
subset = ['salary', 'age']
df.loc[:, subset] = df.loc[:, subset].fillna(-1)
It is not so beautiful, but it works:
df.salary.fillna(-1, inplace=True)
df.age.fillna(-1, inplace=True)
df
>>> name salary age title
0 John 101.0 35.0 eng
1 Bill 200.0 -1.0 adm
2 Lena -1.0 28.0 NaN
3 Jane 120.0 45.0 eng
I was hoping fillna() had subset parameter like drop(), maybe should post request to pandas however this is the cleanest version in my opinion.
df[["salary", "age"]] = df[["salary", "age"]].fillna(-1)
You can do:
df = df.assign(
salary=df.salary.fillna(-1),
age=df.age.fillna(-1),
)
if you want to chain it with other operations.

Categories

Resources