Groupby on categorical column with date and period producing unexpected result - python

The issue below was created in Python 2.7.11 with Pandas 0.17.1
When grouping a categorical column with both a period and date column, unexpected rows appear in the grouping. Is this a Pandas bug, or could it be something else?
df = pd.DataFrame({'date': pd.date_range('2015-12-29', '2016-1-3'),
'val1': [1] * 6,
'val2': range(6),
'cat1': ['a', 'b', 'c'] * 2,
'cat2': ['A', 'B', 'C'] * 2})
df['cat1'] = df.cat1.astype('category')
df['month'] = [d.to_period('M') for d in df.date]
>>> df
cat1 cat2 date val1 val2 month
0 a A 2015-12-29 1 0 2015-12
1 b B 2015-12-30 1 1 2015-12
2 c C 2015-12-31 1 2 2015-12
3 a A 2016-01-01 1 3 2016-01
4 b B 2016-01-02 1 4 2016-01
5 c C 2016-01-03 1 5 2016-01
Grouping the month and date with a regular series (e.g. cat2) works as expected:
>>> df.groupby(['month', 'date', 'cat2']).sum().unstack()
val1 val2
cat2 A B C A B C
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
But grouping on a categorical produces unexpected results. You'll notice in the index that the extra dates do not correspond to the grouped month.
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-02 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-03 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01 2015-12-29 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-30 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-31 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
Grouping the categorical by month periods or dates works fine, but not when both are combined as in the example above.
>>> df.groupby(['month', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month
2015-12 1 1 1 0 1 2
2016-01 1 1 1 3 4 5
>>> df.groupby(['date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
date
2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
EDIT
This behavior originated in the 0.15.0 update. Prior to that, this was the output:
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5

As defined in pandas, grouping with a categorical will always have the full set of categories, even if there isn't any data with that category, e.g., doc example here
You can either not use a categorical, or add a .dropna(how='all') after your grouping step.

Related

Copying a column from one pandas dataframe to another

I have the following code where i try to copy the EXPIRATION from the recent dataframe to the EXPIRATION column in the destination dataframe:
recent = pd.read_excel(r'Y:\Attachments' + '\\' + '962021.xlsx')
print('HERE\n',recent)
print('HERE2\n', recent['EXPIRATION'])
destination= pd.read_excel(r'Y:\Attachments' + '\\' + 'Book1.xlsx')
print('HERE3\n', destination)
destination['EXPIRATION']= recent['EXPIRATION']
print('HERE4\n', destination)
The problem is that destination has less rows than recent so some of the lower rows in the EXPIRATION column from recent do not end up in the destination dataframe. I want all the EXPIRATION values from recent to be in the destination dataframe, even if all the other values are NaN.
Example Output:
HERE
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION NAME OPTION_TYPE PRICE QUANTITY STRATEGY STRIKE TIME_TRADE TYPE UNDERLYING
0 0 21 6/9/2021 B 08/06/2021 BNP FP E C 12 12 CONDORI 12 9:23:40 ETF NASDAQ
1 1 22 6/9/2021 B 16/06/2021 BNP FP E P 12 12 GOLD/SILVER 12 10:9:19 ETF NASDAQ
2 2 23 6/9/2021 B 16/06/2021 TEST P 12 12 CONDORI 21 10:32:12 EQT TEST
3 3 24 6/9/2021 B 22/06/2021 TEST P 12 12 GOLD/SILVER 12 10:35:5 EQT NASDAQ
4 4 0 6/9/2021 B 26/06/2021 TEST P 12 12 GOLD/SILVER 12 10:37:11 ETF FTSE100
HERE2
0 08/06/2021
1 16/06/2021
2 16/06/2021
3 22/06/2021
4 26/06/2021
Name: EXPIRATION, dtype: object
HERE3
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION NAME OPTION_TYPE PRICE QUANTITY STRATEGY STRIKE TIME_TRADE TYPE UNDERLYING
0 NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
HERE4
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION NAME OPTION_TYPE PRICE QUANTITY STRATEGY STRIKE TIME_TRADE TYPE UNDERLYING
0 NaN NaN NaN NaN 08/06/2021 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN 16/06/2021 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 16/06/2021 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Joining is generally the best approach, but I see that you have no id column apart from native pandas indexing, and there are only Nans in destination, so if you are sure that ordering is not a problem you can just use:
>>> destination = pd.concat([recent,destination[['EXPIRATION']]], ignore_index=True, axis=1)
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION ...
0 NaN NaN NaN NaN 08/06/2021 ...
1 NaN NaN NaN NaN 16/06/2021 ...
2 NaN NaN NaN NaN 16/06/2021 ...
3 NaN NaN NaN NaN 22/06/2021 ...
4 NaN NaN NaN NaN 26/06/2021 ...

How can I split this excel file into two data frames?

When I try and load this excel spreadsheet into a dataframe I get a lot of NAN due to all the random white space in the file. I'd really like to split class I and class A from this excel file into two seperate pandas dataframe
In:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
excel_file = 'EXAMPLE.xlsx'
df = pd.read_excel(excel_file, header=8)
print(df)
sys.exit()
Out:
Class I Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Class A Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
0 Date NaN column 1 NaN column 2 NaN NaN NaN Date NaN column 1 NaN column 2
1 2019-12-31 00:00:00 NaN 1 NaN A NaN NaN NaN 2019-12-31 00:00:00 NaN A NaN 1
2 2020-01-01 00:00:00 NaN 2 NaN B NaN NaN NaN 2020-01-01 00:00:00 NaN B NaN 2
3 2020-01-02 00:00:00 NaN 3 NaN C NaN NaN NaN 2020-01-02 00:00:00 NaN C NaN 3
4 2020-01-03 00:00:00 NaN 4 NaN D NaN NaN NaN 2020-01-03 00:00:00 NaN D NaN 4
5 2020-01-04 00:00:00 NaN 5 NaN E NaN NaN NaN 2020-01-04 00:00:00 NaN E NaN 5
6 2020-01-05 00:00:00 NaN 6 NaN F NaN NaN NaN 2020-01-05 00:00:00 NaN F NaN 6
7 2020-01-06 00:00:00 NaN 7 NaN G NaN NaN NaN 2020-01-06 00:00:00 NaN G NaN 7
8 2020-01-07 00:00:00 NaN 8 NaN H NaN NaN NaN 2020-01-07 00:00:00 NaN H NaN 8
Try to use the parameter usecols. From the documentation:
If list of int, then indicates list of column numbers to be parsed.
import pandas as pd
df1 = pd.read_excel(excel_file,usecols=[0,2,4])
df2 = pd.read_excel(excel_file,usecols=[8,10,12])
This should create two dataframes with the columns you want.

replace values in dataframe with zeros and ones

I want to relace values in a dataframe, with a 0 where is a NaN value and with 1 where is a value.
here is my data:
AA AAPL FB GOOG TSLA XOM
Date
2018-02-28 NaN 0.068185 NaN NaN -0.031752 NaN
2018-03-31 -0.000222 NaN NaN NaN NaN -0.014920
2018-04-30 0.138790 NaN NaN NaN 0.104347 NaN
2018-05-31 NaN 0.135124 0.115 NaN NaN NaN
2018-06-30 NaN NaN NaN 0.028258 0.204474 NaN
2018-07-31 NaN 0.027983 NaN 0.091077 NaN NaN
2018-08-31 0.032355 0.200422 NaN NaN NaN NaN
2018-09-30 NaN -0.008303 NaN NaN NaN 0.060496
2018-10-31 NaN -0.030478 NaN NaN 0.274011 NaN
2018-11-30 NaN NaN NaN 0.016401 0.039013 NaN
2018-12-31 NaN NaN NaN -0.053745 -0.050445 NaN
Use mask and fillna:
df = df.mask(df.notna(), 1).fillna(0, downcast='infer')
Use:
df[df.notnull() == True] = 1
df.fillna(0, inplace=True)
Cast the Boolean values to int.
df.notnull().astype(int)
AA AAPL FB GOOG TSLA XOM
2018-02-28 0 1 0 0 1 0
2018-03-31 1 0 0 0 0 1
2018-04-30 1 0 0 0 1 0
2018-05-31 0 1 1 0 0 0
2018-06-30 0 0 0 1 1 0

How to select specific columns from an unstacked Pandas dataframe?

I have some data in text file that I am reading into Pandas. A simplified version of the txt read in is:
idx_level1|idx_level2|idx_level3|idx_level4|START_NODE|END_NODE|OtherData...
353386066294006|1142|2018-09-20T07:57:26Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:26Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:26Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:31Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:31Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:31Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:36Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:36Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:36Z|3|18260005359901|18260004567689|...
353386066736543|22|2018-04-17T07:08:23Z||||...
353386066736543|22|2018-04-17T07:08:24Z||||...
353386066736543|22|2018-04-17T07:08:25Z||||...
353386066736543|22|2018-04-17T07:08:26Z||||...
353386066736543|403|2018-07-02T16:55:07Z|1|18260004580350|18260005235340|...
...
And the code I use to read in is as follows:
mydata = pd.read_csv('/myloc/my_simple_data.txt', sep='|',
dtype={'idx_level1': 'int',
'idx_level2': 'int',
'idx_level3': 'str',
'idx_level4': 'float',
'START_NODE': 'str',
'END_NODE': 'str',
'OtherData...': 'str'},
parse_dates = ['idx_level3'],
index_col=['idx_level1','idx_level2','idx_level3','idx_level4'])
At some point I unstack this data:
temp_df = mydata.loc[(slice(None)),['START_NODE', 'END_NODE', 'OtherData...']].unstack()
My Data now looks like
START_NODE ... OtherData...
idx_level4 1.0 2.0 3.0 ... 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0
idx_level1 idx_level2 idx_level3 ...
353386066294006 1033 2018-09-03 14:52:27 18260004553260 18260005729143 18260004553259 ... NaN NaN NaN NaN NaN NaN NaN NaN
2018-09-03 14:52:32 18260004553260 18260005729143 18260004553259 ... NaN NaN NaN NaN NaN NaN NaN NaN
2018-09-03 14:52:37 18260004553260 18260005729143 18260004553259 ... NaN NaN NaN NaN NaN NaN NaN NaN
2018-09-03 14:52:42 18260004553260 18260005729143 18260004553259 ... NaN NaN NaN NaN NaN NaN NaN NaN
2018-09-03 14:52:47 18260004553260 18260005729143 18260004553259 ... NaN NaN NaN NaN NaN NaN NaN NaN
2018-09-03 14:52:52 18260004553260 18260005729143 18260004553259 ... NaN NaN NaN NaN NaN NaN NaN NaN
2018-09-03 14:52:57 18260004553260 18260005729143 18260004553259 ... NaN NaN NaN NaN NaN NaN NaN NaN
...
Is there now a way that I can select specific columns to apply some action on - say I wanted to shift(1) on the 'START_NODE' column where it has idx_level4 = 1.0?
You can select by tuple:
s = df[('START_NODE', 4.0)].shift(1)
EDIT:
For multiple Multiindex columns use boolean indexing with loc for select columns by mask:
mux = pd.MultiIndex.from_product([['START_NODE','END_NODE'], range(1, 5)])
df = pd.DataFrame([[1] * 8], columns=mux)
print (df)
START_NODE END_NODE
1 2 3 4 1 2 3 4
0 1 1 1 1 1 1 1 1
v = [('START_NODE', 4.0), ('END_NODE', 3.0)]
df1 = df.loc[:, df.columns.isin(v)]
print (df1)
START_NODE END_NODE
4 3
0 1 1

pandas DataFrame filter by rows and columns

I have a python pandas DataFrame that looks like this:
A B C ... ZZ
2008-01-01 00 NaN NaN NaN ... 1
2008-01-02 00 NaN NaN NaN ... NaN
2008-01-03 00 NaN NaN 1 ... NaN
... ... ... ... ... ...
2012-12-31 00 NaN 1 NaN ... NaN
and I can't figure out how to get a subset of the DataFrame where there is one or more '1' in it, so that the final df should be something like this:
B C ... ZZ
2008-01-01 00 NaN NaN ... 1
2008-01-03 00 NaN 1 ... NaN
... ... ... ... ...
2012-12-31 00 1 NaN ... NaN
This is, removing all rows and columns that do not have a 1 in it.
I try this which seems to remove the rows with no 1:
df_filtered = df[df.sum(1)>0]
And the try to remove columns with:
df_filtered = df_filtered[df.sum(0)>0]
but get this error after the second line:
IndexingError('Unalignable boolean Series key provided')
Do it with loc:
In [90]: df
Out[90]:
0 1 2 3 4 5
0 1 NaN NaN 1 1 NaN
1 NaN NaN NaN NaN NaN NaN
2 1 1 NaN NaN 1 NaN
3 1 NaN 1 1 NaN NaN
4 NaN NaN NaN NaN NaN NaN
In [91]: df.loc[df.sum(1) > 0, df.sum(0) > 0]
Out[91]:
0 1 2 3 4
0 1 NaN NaN 1 1
2 1 1 NaN NaN 1
3 1 NaN 1 1 NaN
Here's why you get that error:
Let's say I have the following frame, df, (similar to yours):
In [112]: df
Out[112]:
a b c d e
0 0 1 1 NaN 1
1 NaN NaN NaN NaN NaN
2 0 0 0 NaN 0
3 0 0 1 NaN 1
4 1 1 1 NaN 1
5 0 0 0 NaN 0
6 1 0 1 NaN 0
When I sum along the rows and threshold at 0, I get:
In [113]: row_sum = df.sum()
In [114]: row_sum > 0
Out[114]:
a True
b True
c True
d False
e True
dtype: bool
Since the index of row_sum is the columns of df, it doesn't make sense in this case to try to use the values of row_sum > 0 to fancy-index into the rows of df, since their row indices are not aligned and they cannot be aligned.
Alternatively to remove all NaN rows or columns you can use .any() too.
In [1680]: df
Out[1680]:
0 1 2 3 4 5
0 1.0 NaN NaN 1.0 1.0 NaN
1 NaN NaN NaN NaN NaN NaN
2 1.0 1.0 NaN NaN 1.0 NaN
3 1.0 NaN 1.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN NaN
In [1681]: df.loc[df.any(axis=1), df.any(axis=0)]
Out[1681]:
0 1 2 3 4
0 1.0 NaN NaN 1.0 1.0
2 1.0 1.0 NaN NaN 1.0
3 1.0 NaN 1.0 1.0 NaN

Categories

Resources