How can I split this excel file into two data frames?

How can I split this excel file into two data frames? - python

When I try and load this excel spreadsheet into a dataframe I get a lot of NAN due to all the random white space in the file. I'd really like to split class I and class A from this excel file into two seperate pandas dataframe
In:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
excel_file = 'EXAMPLE.xlsx'
df = pd.read_excel(excel_file, header=8)
print(df)
sys.exit()
Out:
Class I Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Class A Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
0 Date NaN column 1 NaN column 2 NaN NaN NaN Date NaN column 1 NaN column 2
1 2019-12-31 00:00:00 NaN 1 NaN A NaN NaN NaN 2019-12-31 00:00:00 NaN A NaN 1
2 2020-01-01 00:00:00 NaN 2 NaN B NaN NaN NaN 2020-01-01 00:00:00 NaN B NaN 2
3 2020-01-02 00:00:00 NaN 3 NaN C NaN NaN NaN 2020-01-02 00:00:00 NaN C NaN 3
4 2020-01-03 00:00:00 NaN 4 NaN D NaN NaN NaN 2020-01-03 00:00:00 NaN D NaN 4
5 2020-01-04 00:00:00 NaN 5 NaN E NaN NaN NaN 2020-01-04 00:00:00 NaN E NaN 5
6 2020-01-05 00:00:00 NaN 6 NaN F NaN NaN NaN 2020-01-05 00:00:00 NaN F NaN 6
7 2020-01-06 00:00:00 NaN 7 NaN G NaN NaN NaN 2020-01-06 00:00:00 NaN G NaN 7
8 2020-01-07 00:00:00 NaN 8 NaN H NaN NaN NaN 2020-01-07 00:00:00 NaN H NaN 8

Try to use the parameter usecols. From the documentation:
If list of int, then indicates list of column numbers to be parsed.
import pandas as pd
df1 = pd.read_excel(excel_file,usecols=[0,2,4])
df2 = pd.read_excel(excel_file,usecols=[8,10,12])
This should create two dataframes with the columns you want.

Related

Copying a column from one pandas dataframe to another

I have the following code where i try to copy the EXPIRATION from the recent dataframe to the EXPIRATION column in the destination dataframe:
recent = pd.read_excel(r'Y:\Attachments' + '\\' + '962021.xlsx')
print('HERE\n',recent)
print('HERE2\n', recent['EXPIRATION'])
destination= pd.read_excel(r'Y:\Attachments' + '\\' + 'Book1.xlsx')
print('HERE3\n', destination)
destination['EXPIRATION']= recent['EXPIRATION']
print('HERE4\n', destination)
The problem is that destination has less rows than recent so some of the lower rows in the EXPIRATION column from recent do not end up in the destination dataframe. I want all the EXPIRATION values from recent to be in the destination dataframe, even if all the other values are NaN.
Example Output:
HERE
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION NAME OPTION_TYPE PRICE QUANTITY STRATEGY STRIKE TIME_TRADE TYPE UNDERLYING
0 0 21 6/9/2021 B 08/06/2021 BNP FP E C 12 12 CONDORI 12 9:23:40 ETF NASDAQ
1 1 22 6/9/2021 B 16/06/2021 BNP FP E P 12 12 GOLD/SILVER 12 10:9:19 ETF NASDAQ
2 2 23 6/9/2021 B 16/06/2021 TEST P 12 12 CONDORI 21 10:32:12 EQT TEST
3 3 24 6/9/2021 B 22/06/2021 TEST P 12 12 GOLD/SILVER 12 10:35:5 EQT NASDAQ
4 4 0 6/9/2021 B 26/06/2021 TEST P 12 12 GOLD/SILVER 12 10:37:11 ETF FTSE100
HERE2
0 08/06/2021
1 16/06/2021
2 16/06/2021
3 22/06/2021
4 26/06/2021
Name: EXPIRATION, dtype: object
HERE3
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION NAME OPTION_TYPE PRICE QUANTITY STRATEGY STRIKE TIME_TRADE TYPE UNDERLYING
0 NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
HERE4
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION NAME OPTION_TYPE PRICE QUANTITY STRATEGY STRIKE TIME_TRADE TYPE UNDERLYING
0 NaN NaN NaN NaN 08/06/2021 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN 16/06/2021 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 16/06/2021 NaN NaN NaN NaN NaN NaN NaN NaN NaN

Joining is generally the best approach, but I see that you have no id column apart from native pandas indexing, and there are only Nans in destination, so if you are sure that ordering is not a problem you can just use:
>>> destination = pd.concat([recent,destination[['EXPIRATION']]], ignore_index=True, axis=1)
Unnamed: 0 IGNORE DATE_TRADE DIRECTION EXPIRATION ...
0 NaN NaN NaN NaN 08/06/2021 ...
1 NaN NaN NaN NaN 16/06/2021 ...
2 NaN NaN NaN NaN 16/06/2021 ...
3 NaN NaN NaN NaN 22/06/2021 ...
4 NaN NaN NaN NaN 26/06/2021 ...

Cannot append pandas DataFrame to another DataFrame

I have two pandas dataframes: df and temp_df. The df has a header, but temp_df does not have it.
df =
IdTravel TravelDate IdProfesional InformAsign
0 8178429 2017-10-25 11:25:16.550 NaN NaN
1 8180074 2017-10-25 13:49:09.640 NaN NaN
2 8180287 2017-10-25 14:28:04.000 123 ABC
3 8182810 2017-10-26 09:55:14.930 NaN NaN
4 8182849 2017-10-26 09:59:11.187 NaN NaN
temp_df =
0 1 2 3
0 4189915 2017-10-25 13:49:09.640 NaN NaN
1 4100334 2017-10-25 14:28:04.000 111 ABC
2 4102833 2017-10-26 09:55:14.930 NaN NaN
3 4102845 2017-10-26 09:59:11.187 NaN NaN
I want to append temp_df to df. Expected result:
IdTravel TravelDate IdProfesional InformAsign
0 8178429 2017-10-25 11:25:16.550 NaN NaN
1 8180074 2017-10-25 13:49:09.640 NaN NaN
2 8180287 2017-10-25 14:28:04.000 123 ABC
3 8182810 2017-10-26 09:55:14.930 NaN NaN
4 8182849 2017-10-26 09:59:11.187 NaN NaN
5 4189915 2017-10-25 13:49:09.640 NaN NaN
6 4100334 2017-10-25 14:28:04.000 111 ABC
7 4102833 2017-10-26 09:55:14.930 NaN NaN
8 4102845 2017-10-26 09:59:11.187 NaN NaN
I tried:
result = df.append(temp_df)
I was expecting to get a new dataframe with 9 rows. But instead I got a dataframe result with 9 rows and 8 columns.
Also, I tried this, but got the same wrong result:
result = pd.concat([df,temp_df],axis=1,ignore_index=True) # and axis=0

You can rename columns in temp_df to match the original df:
temp_df.columns = ['IdTravel', 'TravelDate', 'IdProfesional', 'InformAsign']
result = df.append(temp_df).reset_index(drop=True)
print(result)
Prints:
IdTravel TravelDate IdProfesional InformAsign
0 8178429 2017-10-25 11:25:16.550 NaN NaN
1 8180074 2017-10-25 13:49:09.640 NaN NaN
2 8180287 2017-10-25 14:28:04.000 123.0 ABC
3 8182810 2017-10-26 09:55:14.930 NaN NaN
4 8182849 2017-10-26 09:59:11.187 NaN NaN
5 4189915 2017-10-25 13:49:09.640 NaN NaN
6 4100334 2017-10-25 14:28:04.000 111.0 ABC
7 4102833 2017-10-26 09:55:14.930 NaN NaN
8 4102845 2017-10-26 09:59:11.187 NaN NaN

how to equalize two dataframes?

Hello I have two DFs (rateQualityOut and subsetOut):
"rateQualityOut" is an empty DF that I created to store the temporary output "subsetOut". The idea is that all outputs (once the loop is finished) should be stored in that DF.
rateQualityOut[['pID', 'carry_dt','position', 'product_type' ,'positionLength']].loc[currLength:currLength+addLength,:]
pID carry_dt position product_type positionLength
0 NaN NaT NaN NaN NaN
1 NaN NaT NaN NaN NaN
2 NaN NaT NaN NaN NaN
3 NaN NaT NaN NaN NaN
4 NaN NaT NaN NaN NaN
5 NaN NaT NaN NaN NaN
and another DF which has the temporary output
subsetOut
subsetOut[['pID', 'carry_dt','position', 'product_type' ,'positionLength']]
pID carry_dt position product_type positionLength
2739 1 2018-11-01 CITI_52299G66_201210 Physical 5
2738 1 2018-11-02 CITI_52299G66_201210 Physical 5
2737 1 2018-11-05 CITI_52299G66_201210 Physical 5
2736 1 2018-11-06 CITI_52299G66_201210 Physical 5
2735 1 2018-11-07 CITI_52299G66_201210 Physical 5
I am looking to store the temporary output "subsetOut" into "rateQualityOut". and What I have done in the past is simply do this:
rateQualityOut.loc[currLength:currLength+addLength,:] = subsetOut
However it seems that it is not working as planned. The output shows that the NaN are not populated as expected.
pID carry_dt position product_type positionLength
0 NaN NaT NaN NaN NaN
1 NaN NaT NaN NaN NaN
2 NaN NaT NaN NaN NaN
3 NaN NaT NaN NaN NaN
4 NaN NaT NaN NaN NaN
5 NaN NaT NaN NaN NaN
Can i please have some suggestions? Thank you so much

Typically it is easier and faster not to put subsetOut into rateQualityOut with each iteration. Instead you could store the subsets into a list and concatenate them at the end:
import pandas as pd
rateQualityOut = [] #Make a list
for i in someIterator:
#do something here
rateQualityOut.append(subsetOut)
rateQualityOut = pd.concat(rateQualityOut)

replace values in dataframe with zeros and ones

I want to relace values in a dataframe, with a 0 where is a NaN value and with 1 where is a value.
here is my data:
AA AAPL FB GOOG TSLA XOM
Date
2018-02-28 NaN 0.068185 NaN NaN -0.031752 NaN
2018-03-31 -0.000222 NaN NaN NaN NaN -0.014920
2018-04-30 0.138790 NaN NaN NaN 0.104347 NaN
2018-05-31 NaN 0.135124 0.115 NaN NaN NaN
2018-06-30 NaN NaN NaN 0.028258 0.204474 NaN
2018-07-31 NaN 0.027983 NaN 0.091077 NaN NaN
2018-08-31 0.032355 0.200422 NaN NaN NaN NaN
2018-09-30 NaN -0.008303 NaN NaN NaN 0.060496
2018-10-31 NaN -0.030478 NaN NaN 0.274011 NaN
2018-11-30 NaN NaN NaN 0.016401 0.039013 NaN
2018-12-31 NaN NaN NaN -0.053745 -0.050445 NaN

Use mask and fillna:
df = df.mask(df.notna(), 1).fillna(0, downcast='infer')

Use:
df[df.notnull() == True] = 1
df.fillna(0, inplace=True)

Cast the Boolean values to int.
df.notnull().astype(int)
AA AAPL FB GOOG TSLA XOM
2018-02-28 0 1 0 0 1 0
2018-03-31 1 0 0 0 0 1
2018-04-30 1 0 0 0 1 0
2018-05-31 0 1 1 0 0 0
2018-06-30 0 0 0 1 1 0

Groupby on categorical column with date and period producing unexpected result

The issue below was created in Python 2.7.11 with Pandas 0.17.1
When grouping a categorical column with both a period and date column, unexpected rows appear in the grouping. Is this a Pandas bug, or could it be something else?
df = pd.DataFrame({'date': pd.date_range('2015-12-29', '2016-1-3'),
'val1': [1] * 6,
'val2': range(6),
'cat1': ['a', 'b', 'c'] * 2,
'cat2': ['A', 'B', 'C'] * 2})
df['cat1'] = df.cat1.astype('category')
df['month'] = [d.to_period('M') for d in df.date]
>>> df
cat1 cat2 date val1 val2 month
0 a A 2015-12-29 1 0 2015-12
1 b B 2015-12-30 1 1 2015-12
2 c C 2015-12-31 1 2 2015-12
3 a A 2016-01-01 1 3 2016-01
4 b B 2016-01-02 1 4 2016-01
5 c C 2016-01-03 1 5 2016-01
Grouping the month and date with a regular series (e.g. cat2) works as expected:
>>> df.groupby(['month', 'date', 'cat2']).sum().unstack()
val1 val2
cat2 A B C A B C
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
But grouping on a categorical produces unexpected results. You'll notice in the index that the extra dates do not correspond to the grouped month.
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-02 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-03 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01 2015-12-29 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-30 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-31 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
Grouping the categorical by month periods or dates works fine, but not when both are combined as in the example above.
>>> df.groupby(['month', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month
2015-12 1 1 1 0 1 2
2016-01 1 1 1 3 4 5
>>> df.groupby(['date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
date
2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
EDIT
This behavior originated in the 0.15.0 update. Prior to that, this was the output:
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5

As defined in pandas, grouping with a categorical will always have the full set of categories, even if there isn't any data with that category, e.g., doc example here
You can either not use a categorical, or add a .dropna(how='all') after your grouping step.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I split this excel file into two data frames? - python

Related

Copying a column from one pandas dataframe to another

Cannot append pandas DataFrame to another DataFrame

how to equalize two dataframes?

replace values in dataframe with zeros and ones

Groupby on categorical column with date and period producing unexpected result

Categories

Resources