How can I split this excel file into two data frames? - python

When I try and load this excel spreadsheet into a dataframe I get a lot of NAN due to all the random white space in the file. I'd really like to split class I and class A from this excel file into two seperate pandas dataframe
In:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
excel_file = 'EXAMPLE.xlsx'
df = pd.read_excel(excel_file, header=8)
print(df)
sys.exit()
Out:
Class I Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Class A Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
0 Date NaN column 1 NaN column 2 NaN NaN NaN Date NaN column 1 NaN column 2
1 2019-12-31 00:00:00 NaN 1 NaN A NaN NaN NaN 2019-12-31 00:00:00 NaN A NaN 1
2 2020-01-01 00:00:00 NaN 2 NaN B NaN NaN NaN 2020-01-01 00:00:00 NaN B NaN 2
3 2020-01-02 00:00:00 NaN 3 NaN C NaN NaN NaN 2020-01-02 00:00:00 NaN C NaN 3
4 2020-01-03 00:00:00 NaN 4 NaN D NaN NaN NaN 2020-01-03 00:00:00 NaN D NaN 4
5 2020-01-04 00:00:00 NaN 5 NaN E NaN NaN NaN 2020-01-04 00:00:00 NaN E NaN 5
6 2020-01-05 00:00:00 NaN 6 NaN F NaN NaN NaN 2020-01-05 00:00:00 NaN F NaN 6
7 2020-01-06 00:00:00 NaN 7 NaN G NaN NaN NaN 2020-01-06 00:00:00 NaN G NaN 7
8 2020-01-07 00:00:00 NaN 8 NaN H NaN NaN NaN 2020-01-07 00:00:00 NaN H NaN 8

Try to use the parameter usecols. From the documentation:
If list of int, then indicates list of column numbers to be parsed.
import pandas as pd
df1 = pd.read_excel(excel_file,usecols=[0,2,4])
df2 = pd.read_excel(excel_file,usecols=[8,10,12])
This should create two dataframes with the columns you want.

Related

Copying a column from one pandas dataframe to another

I have the following code where i try to copy the EXPIRATION from the recent dataframe to the EXPIRATION column in the destination dataframe:
recent = pd.read_excel(r'Y:\Attachments' + '\\' + '962021.xlsx')
print('HERE\n',recent)
print('HERE2\n', recent['EXPIRATION'])
destination= pd.read_excel(r'Y:\Attachments' + '\\' + 'Book1.xlsx')
print('HERE3\n', destination)
destination['EXPIRATION']= recent['EXPIRATION']
print('HERE4\n', destination)
The problem is that destination has less rows than recent so some of the lower rows in the EXPIRATION column from recent do not end up in the destination dataframe. I want all the EXPIRATION values from recent to be in the destination dataframe, even if all the other values are NaN.
Example Output:
HERE
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION NAME OPTION_TYPE PRICE QUANTITY STRATEGY STRIKE TIME_TRADE TYPE UNDERLYING
0 0 21 6/9/2021 B 08/06/2021 BNP FP E C 12 12 CONDORI 12 9:23:40 ETF NASDAQ
1 1 22 6/9/2021 B 16/06/2021 BNP FP E P 12 12 GOLD/SILVER 12 10:9:19 ETF NASDAQ
2 2 23 6/9/2021 B 16/06/2021 TEST P 12 12 CONDORI 21 10:32:12 EQT TEST
3 3 24 6/9/2021 B 22/06/2021 TEST P 12 12 GOLD/SILVER 12 10:35:5 EQT NASDAQ
4 4 0 6/9/2021 B 26/06/2021 TEST P 12 12 GOLD/SILVER 12 10:37:11 ETF FTSE100
HERE2
0 08/06/2021
1 16/06/2021
2 16/06/2021
3 22/06/2021
4 26/06/2021
Name: EXPIRATION, dtype: object
HERE3
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION NAME OPTION_TYPE PRICE QUANTITY STRATEGY STRIKE TIME_TRADE TYPE UNDERLYING
0 NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
HERE4
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION NAME OPTION_TYPE PRICE QUANTITY STRATEGY STRIKE TIME_TRADE TYPE UNDERLYING
0 NaN NaN NaN NaN 08/06/2021 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN 16/06/2021 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 16/06/2021 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Joining is generally the best approach, but I see that you have no id column apart from native pandas indexing, and there are only Nans in destination, so if you are sure that ordering is not a problem you can just use:
>>> destination = pd.concat([recent,destination[['EXPIRATION']]], ignore_index=True, axis=1)
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION ...
0 NaN NaN NaN NaN 08/06/2021 ...
1 NaN NaN NaN NaN 16/06/2021 ...
2 NaN NaN NaN NaN 16/06/2021 ...
3 NaN NaN NaN NaN 22/06/2021 ...
4 NaN NaN NaN NaN 26/06/2021 ...

Cannot append pandas DataFrame to another DataFrame

I have two pandas dataframes: df and temp_df. The df has a header, but temp_df does not have it.
df =
IdTravel TravelDate IdProfesional InformAsign
0 8178429 2017-10-25 11:25:16.550 NaN NaN
1 8180074 2017-10-25 13:49:09.640 NaN NaN
2 8180287 2017-10-25 14:28:04.000 123 ABC
3 8182810 2017-10-26 09:55:14.930 NaN NaN
4 8182849 2017-10-26 09:59:11.187 NaN NaN
temp_df =
0 1 2 3
0 4189915 2017-10-25 13:49:09.640 NaN NaN
1 4100334 2017-10-25 14:28:04.000 111 ABC
2 4102833 2017-10-26 09:55:14.930 NaN NaN
3 4102845 2017-10-26 09:59:11.187 NaN NaN
I want to append temp_df to df. Expected result:
IdTravel TravelDate IdProfesional InformAsign
0 8178429 2017-10-25 11:25:16.550 NaN NaN
1 8180074 2017-10-25 13:49:09.640 NaN NaN
2 8180287 2017-10-25 14:28:04.000 123 ABC
3 8182810 2017-10-26 09:55:14.930 NaN NaN
4 8182849 2017-10-26 09:59:11.187 NaN NaN
5 4189915 2017-10-25 13:49:09.640 NaN NaN
6 4100334 2017-10-25 14:28:04.000 111 ABC
7 4102833 2017-10-26 09:55:14.930 NaN NaN
8 4102845 2017-10-26 09:59:11.187 NaN NaN
I tried:
result = df.append(temp_df)
I was expecting to get a new dataframe with 9 rows. But instead I got a dataframe result with 9 rows and 8 columns.
Also, I tried this, but got the same wrong result:
result = pd.concat([df,temp_df],axis=1,ignore_index=True) # and axis=0
You can rename columns in temp_df to match the original df:
temp_df.columns = ['IdTravel', 'TravelDate', 'IdProfesional', 'InformAsign']
result = df.append(temp_df).reset_index(drop=True)
print(result)
Prints:
IdTravel TravelDate IdProfesional InformAsign
0 8178429 2017-10-25 11:25:16.550 NaN NaN
1 8180074 2017-10-25 13:49:09.640 NaN NaN
2 8180287 2017-10-25 14:28:04.000 123.0 ABC
3 8182810 2017-10-26 09:55:14.930 NaN NaN
4 8182849 2017-10-26 09:59:11.187 NaN NaN
5 4189915 2017-10-25 13:49:09.640 NaN NaN
6 4100334 2017-10-25 14:28:04.000 111.0 ABC
7 4102833 2017-10-26 09:55:14.930 NaN NaN
8 4102845 2017-10-26 09:59:11.187 NaN NaN

how to equalize two dataframes?

Hello I have two DFs (rateQualityOut and subsetOut):
"rateQualityOut" is an empty DF that I created to store the temporary output "subsetOut". The idea is that all outputs (once the loop is finished) should be stored in that DF.
rateQualityOut[['pID', 'carry_dt','position', 'product_type' ,'positionLength']].loc[currLength:currLength+addLength,:]
pID carry_dt position product_type positionLength
0 NaN NaT NaN NaN NaN
1 NaN NaT NaN NaN NaN
2 NaN NaT NaN NaN NaN
3 NaN NaT NaN NaN NaN
4 NaN NaT NaN NaN NaN
5 NaN NaT NaN NaN NaN
and another DF which has the temporary output
subsetOut
subsetOut[['pID', 'carry_dt','position', 'product_type' ,'positionLength']]
pID carry_dt position product_type positionLength
2739 1 2018-11-01 CITI_52299G66_201210 Physical 5
2738 1 2018-11-02 CITI_52299G66_201210 Physical 5
2737 1 2018-11-05 CITI_52299G66_201210 Physical 5
2736 1 2018-11-06 CITI_52299G66_201210 Physical 5
2735 1 2018-11-07 CITI_52299G66_201210 Physical 5
I am looking to store the temporary output "subsetOut" into "rateQualityOut". and What I have done in the past is simply do this:
rateQualityOut.loc[currLength:currLength+addLength,:] = subsetOut
However it seems that it is not working as planned. The output shows that the NaN are not populated as expected.
pID carry_dt position product_type positionLength
0 NaN NaT NaN NaN NaN
1 NaN NaT NaN NaN NaN
2 NaN NaT NaN NaN NaN
3 NaN NaT NaN NaN NaN
4 NaN NaT NaN NaN NaN
5 NaN NaT NaN NaN NaN
Can i please have some suggestions? Thank you so much
Typically it is easier and faster not to put subsetOut into rateQualityOut with each iteration. Instead you could store the subsets into a list and concatenate them at the end:
import pandas as pd
rateQualityOut = [] #Make a list
for i in someIterator:
#do something here
rateQualityOut.append(subsetOut)
rateQualityOut = pd.concat(rateQualityOut)

replace values in dataframe with zeros and ones

I want to relace values in a dataframe, with a 0 where is a NaN value and with 1 where is a value.
here is my data:
AA AAPL FB GOOG TSLA XOM
Date
2018-02-28 NaN 0.068185 NaN NaN -0.031752 NaN
2018-03-31 -0.000222 NaN NaN NaN NaN -0.014920
2018-04-30 0.138790 NaN NaN NaN 0.104347 NaN
2018-05-31 NaN 0.135124 0.115 NaN NaN NaN
2018-06-30 NaN NaN NaN 0.028258 0.204474 NaN
2018-07-31 NaN 0.027983 NaN 0.091077 NaN NaN
2018-08-31 0.032355 0.200422 NaN NaN NaN NaN
2018-09-30 NaN -0.008303 NaN NaN NaN 0.060496
2018-10-31 NaN -0.030478 NaN NaN 0.274011 NaN
2018-11-30 NaN NaN NaN 0.016401 0.039013 NaN
2018-12-31 NaN NaN NaN -0.053745 -0.050445 NaN
Use mask and fillna:
df = df.mask(df.notna(), 1).fillna(0, downcast='infer')
Use:
df[df.notnull() == True] = 1
df.fillna(0, inplace=True)
Cast the Boolean values to int.
df.notnull().astype(int)
AA AAPL FB GOOG TSLA XOM
2018-02-28 0 1 0 0 1 0
2018-03-31 1 0 0 0 0 1
2018-04-30 1 0 0 0 1 0
2018-05-31 0 1 1 0 0 0
2018-06-30 0 0 0 1 1 0

Groupby on categorical column with date and period producing unexpected result

The issue below was created in Python 2.7.11 with Pandas 0.17.1
When grouping a categorical column with both a period and date column, unexpected rows appear in the grouping. Is this a Pandas bug, or could it be something else?
df = pd.DataFrame({'date': pd.date_range('2015-12-29', '2016-1-3'),
'val1': [1] * 6,
'val2': range(6),
'cat1': ['a', 'b', 'c'] * 2,
'cat2': ['A', 'B', 'C'] * 2})
df['cat1'] = df.cat1.astype('category')
df['month'] = [d.to_period('M') for d in df.date]
>>> df
cat1 cat2 date val1 val2 month
0 a A 2015-12-29 1 0 2015-12
1 b B 2015-12-30 1 1 2015-12
2 c C 2015-12-31 1 2 2015-12
3 a A 2016-01-01 1 3 2016-01
4 b B 2016-01-02 1 4 2016-01
5 c C 2016-01-03 1 5 2016-01
Grouping the month and date with a regular series (e.g. cat2) works as expected:
>>> df.groupby(['month', 'date', 'cat2']).sum().unstack()
val1 val2
cat2 A B C A B C
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
But grouping on a categorical produces unexpected results. You'll notice in the index that the extra dates do not correspond to the grouped month.
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-02 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-03 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01 2015-12-29 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-30 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-31 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
Grouping the categorical by month periods or dates works fine, but not when both are combined as in the example above.
>>> df.groupby(['month', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month
2015-12 1 1 1 0 1 2
2016-01 1 1 1 3 4 5
>>> df.groupby(['date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
date
2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
EDIT
This behavior originated in the 0.15.0 update. Prior to that, this was the output:
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
As defined in pandas, grouping with a categorical will always have the full set of categories, even if there isn't any data with that category, e.g., doc example here
You can either not use a categorical, or add a .dropna(how='all') after your grouping step.

Categories

Resources