Slicing a multiindex on Pandas - python

I've got a dataframe with a multiindex of the form:
(label, date)
where label is a string and date is a DateTimeIndex.
I want to slice my dataframe by date; say for example, I want to get all the rows between 2007 and 2009:
df.loc[:, '2007':'2009']
It seems like the second part (where I've put the date) is actually slicing the column.
How do I slice on date?

You can check partial string indexing:
DatetimeIndex partial string indexing also works on a DataFrame with a MultiIndex:
df = pd.DataFrame(np.random.randn(20, 1),
columns=['A'],
index=pd.MultiIndex.from_product(
[['a', 'b'], pd.date_range('20050101', periods=10, freq='10M'),
]))
idx = pd.IndexSlice
df1 = df.loc[idx[:, '2007':'2009'], :]
print (df1)
A
a 2007-07-31 0.325027
2008-05-31 -1.307117
2009-03-31 -0.556454
b 2007-07-31 1.808920
2008-05-31 1.245404
2009-03-31 -0.425046
Another idea is use loc with axis=0 parameter:
df1 = df.loc(axis=0)[:, '2007':'2009']
print (df1)
A
a 2007-07-31 0.325027
2008-05-31 -1.307117
2009-03-31 -0.556454
b 2007-07-31 1.808920
2008-05-31 1.245404
2009-03-31 -0.425046

Related

DataFrame 'groupby' is fixing group columns with index

I have used a simple 'groupby' to condense rows in a Pandas dataframe:
df = df.groupby(['col1', 'col2', 'col3']).sum()
In the new DataFrame 'df', the three columns that were used in the 'groupby' function are now fixed within the index and are no longer column indexes 0, 1 and 2 - what was previously column index 4 is now column index 0.
How do I stop this from happening / reinclude the three 'groupby' columns along with the original data?
Try -
df = df.groupby(['col1', 'col2', 'col3'], as_index = False).sum()
#or
df = df.groupby(['col1', 'col2', 'col3']).sum().reset_index()
Try resetting the index
df = df.reset_index()

Python sum with condition using a date and a condition

I have to dataframes and I am using pandas.
I want to do a cumulative sum from a variable date and by the value in a column
I want to add a second column to df2 that show the date to know the day when the sum of the AVG column is greater than 100 after date2 in df2.
For example with df1 and df2 being the dataframe I start with and df3 what I want and df3['date100'] is the day the sum of avg is greater than 100:
df1 = pd.DataFrame({'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'])})
*Something*
df3 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'], 'date100': ['3/1/2014', '2/1/2014'], 'sum': [123, 105]})
I found some answers but most them use groupby and df2 has no groups.
Since your example is very basic, if you have edge cases you want me to take care of, just ask. This solution implies that :
The solution :
# For this solution your DataFrame needs to be sorted by date.
limit = 100
df = pd.DataFrame({
'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014',
'2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],
'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C']})
result = []
for row in df2.to_dict('records'):
# For each date, I want to select the date that comes AFTER this one.
# Then, I take the .cumsum(), because it's the agg you wish to do.
# Filter by your limit and take the first occurrence.
# Converting this to a dict, appending it to a list, makes it easy
# to rebuild a DataFrame later.
ndf = df.loc[ (df['date1'] >= row['date2']) & (df['Place'] == row['Place']) ]\
.sort_values(by='date1')
ndf['avgsum'] = ndf['AVG'].cumsum()
final_df = ndf.loc[ ndf['avgsum'] >= limit ]
# Error handling, in case there is not avgsum above the threshold.
try:
final_df = final_df.iloc[0][['date1', 'avgsum']].rename({'date1' : 'date100'})
result.append( final_df.to_dict() )
except IndexError:
continue
df3 = pd.DataFrame(result)
final_df = pd.concat([df2, df3], axis=1, sort=False)
print(final_df)
# date2 Place avgsum date100
# 0 1/1/2014 A 123.0 3/1/2014
# 1 2/1/2014 C NaN NaN
Here is a direct solution, with following assumptions:
df1 is sorted by date
one solution exists for every date in df2
You can then do:
df2 = df2.join(pd.concat([
pd.DataFrame(pd.DataFrame(df1.loc[df1.date1 >= d].AVG.cumsum()).query('AVG>=100')
.iloc[0]).transpose()
for d in df2.date2]).rename_axis('ix').reset_index())\
.join(df1.drop(columns='AVG'), on='ix').rename(columns={'AVG': 'sum', 'date1': 'date100'})\
.drop(columns='ix')[['date2', 'date100', 'sum']]
This does the following:
for each date in df2 find the first date when the cumul on AVG will be at least 100
combine the results in one single dataframe indexed by the index of that line in df1
store that index in an ix column and reset the index to join that dataframe to df2
join that to df1 minus the AVG column using the ix column
rename the columns, remove the ix column, and re-order everything

Inner join of dataframes based on datetime

I have two dataframes df1 and df2.
df1.index
DatetimeIndex(['2001-09-06', '2002-08-04', '2000-01-22', '2000-12-19',
'2008-02-09', '2010-07-07', '2011-06-04', '2007-03-14',
'2003-05-17', '2016-02-27',..dtype='datetime64[ns]', name=u'DateTime', length=6131, freq=None)
df2.index
DatetimeIndex(['2002-01-01 01:00:00', '2002-01-01 10:00:00',
'2002-01-01 11:00:00', '2002-01-01 12:00:00',
'2002-01-01 13:00:00', '2002-01-01 14:00:00',..dtype='datetime64[ns]', length=129273, freq=None)
i.e. df1 has index as days and df2 has index as datetime. I want to perform inner join of df1 and df2 on indexes such that if dates corresponding to hours in df2 is available in df1 we consider the inner join as true else false.
I want to obtain two df11 and df22 as output. df11 will have common dates and corresponding columns from df1. df22 will have common date-hours and corresponding columns from df2.
E.g. '2002-08-04' in df1 and '2002-08-04 01:00:00' in df2 is considered present in both.
If however '1802-08-04' in df1 has no hour in df2, it is not present in df11.
If however '2045-08-04 01:00:00' in df2 has no date in df1, it is not present in df22.
Right now I am using numpy in1d and pandas normalize functions to achieve this task in a lengthy manner. I was looking for pythonic way to achieve this.
Consider a dummy DF constructed as shown:
idx1 = pd.date_range(start='2000/1/1', periods=100, freq='12D')
idx2 = pd.date_range(start='2000/1/1', periods=100, freq='300H')
np.random.seed([42, 314])
DF containing DateTimeIndex as only date attribute:
df1 = pd.DataFrame(np.random.randint(0,10,(100,2)), idx1)
df1.head()
DF containing DateTimeIndex as date + time attribute:
df2 = pd.DataFrame(np.random.randint(0,10,(100,2)), idx2)
df2.head()
Get common index considering only matching dates as the distinguishing parameter.
intersect = pd.Index(df2.index.date).intersection(df1.index)
First common index DF containing columns of it's original dataframe :
df11 = df1.loc[intersect]
df11
Second common index DF containing columns of it's original dataframe:
df22 = df2.iloc[np.where(df2.index.date.reshape(-1,1) == intersect.values)[0]]
df22

Converting rows in pandas dataframe to columns

I want to convert rows in the foll. pandas dataframe to column headers:
transition area
0 A_to_B -9.339710e+10
1 B_to_C 2.135599e+02
result:
A_to_B B_to_C
0 -9.339710e+10 2.135599e+02
I tried using pivot table, but that does not seem to give the result I want.
I think you can first set_index with column transition, then transpose by T, remove columns name by rename_axis and last reset_index:
print df.set_index('transition').T.rename_axis(None, axis=1).reset_index(drop=True)
A_to_B B_to_C
0 -9.339710e+10 213.5599
df = df.T
df.columns = df.iloc[0, :]
df = df.iloc[1:, :]

pandas compare two dataframes with criteria

I have two dataframes. df1 and df2.
I would like to get whatever values are common from df1 and df2 and the dt value of df2 must be greater than df1's dt value
In this case, the expected value is fee
df1 = pd.DataFrame([['2015-01-01 06:00','foo'],
['2015-01-01 07:00','fee'], ['2015-01-01 08:00','fum']],
columns=['dt', 'value'])
df1.dt=pd.to_datetime(df1.dt)
df2=pd.DataFrame([['2015-01-01 06:10','zoo'],
['2015-01-01 07:10','fee'],['2015-01-01 08:10','feu'],
['2015-01-01 09:10','boo']], columns=['dt', 'value'])
df2.dt=pd.to_datetime(df2.dt)
One way would be to merge on 'value' column so this will produce only matching rows, you can then filter the merged df using the 'dt_x', 'dt_y' columns:
In [15]:
merged = df2.merge(df1, on='value')
merged[merged['dt_x'] > merged['dt_y']]
Out[15]:
dt_x value dt_y
0 2015-01-01 07:10:00 fee 2015-01-01 07:00:00
You can't do something like the following because the lengths don't match:
df2[ (df2['value'].isin(df1['value'])) & (df2['dt'] > df1['dt']) ]
raises:
ValueError: Series lengths must match to compare

Categories

Resources