Cannot append pandas DataFrame to another DataFrame - python

I have two pandas dataframes: df and temp_df. The df has a header, but temp_df does not have it.
df =
IdTravel TravelDate IdProfesional InformAsign
0 8178429 2017-10-25 11:25:16.550 NaN NaN
1 8180074 2017-10-25 13:49:09.640 NaN NaN
2 8180287 2017-10-25 14:28:04.000 123 ABC
3 8182810 2017-10-26 09:55:14.930 NaN NaN
4 8182849 2017-10-26 09:59:11.187 NaN NaN
temp_df =
0 1 2 3
0 4189915 2017-10-25 13:49:09.640 NaN NaN
1 4100334 2017-10-25 14:28:04.000 111 ABC
2 4102833 2017-10-26 09:55:14.930 NaN NaN
3 4102845 2017-10-26 09:59:11.187 NaN NaN
I want to append temp_df to df. Expected result:
IdTravel TravelDate IdProfesional InformAsign
0 8178429 2017-10-25 11:25:16.550 NaN NaN
1 8180074 2017-10-25 13:49:09.640 NaN NaN
2 8180287 2017-10-25 14:28:04.000 123 ABC
3 8182810 2017-10-26 09:55:14.930 NaN NaN
4 8182849 2017-10-26 09:59:11.187 NaN NaN
5 4189915 2017-10-25 13:49:09.640 NaN NaN
6 4100334 2017-10-25 14:28:04.000 111 ABC
7 4102833 2017-10-26 09:55:14.930 NaN NaN
8 4102845 2017-10-26 09:59:11.187 NaN NaN
I tried:
result = df.append(temp_df)
I was expecting to get a new dataframe with 9 rows. But instead I got a dataframe result with 9 rows and 8 columns.
Also, I tried this, but got the same wrong result:
result = pd.concat([df,temp_df],axis=1,ignore_index=True) # and axis=0

You can rename columns in temp_df to match the original df:
temp_df.columns = ['IdTravel', 'TravelDate', 'IdProfesional', 'InformAsign']
result = df.append(temp_df).reset_index(drop=True)
print(result)
Prints:
IdTravel TravelDate IdProfesional InformAsign
0 8178429 2017-10-25 11:25:16.550 NaN NaN
1 8180074 2017-10-25 13:49:09.640 NaN NaN
2 8180287 2017-10-25 14:28:04.000 123.0 ABC
3 8182810 2017-10-26 09:55:14.930 NaN NaN
4 8182849 2017-10-26 09:59:11.187 NaN NaN
5 4189915 2017-10-25 13:49:09.640 NaN NaN
6 4100334 2017-10-25 14:28:04.000 111.0 ABC
7 4102833 2017-10-26 09:55:14.930 NaN NaN
8 4102845 2017-10-26 09:59:11.187 NaN NaN

Related

How can I split this excel file into two data frames?

When I try and load this excel spreadsheet into a dataframe I get a lot of NAN due to all the random white space in the file. I'd really like to split class I and class A from this excel file into two seperate pandas dataframe
In:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
excel_file = 'EXAMPLE.xlsx'
df = pd.read_excel(excel_file, header=8)
print(df)
sys.exit()
Out:
Class I Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Class A Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
0 Date NaN column 1 NaN column 2 NaN NaN NaN Date NaN column 1 NaN column 2
1 2019-12-31 00:00:00 NaN 1 NaN A NaN NaN NaN 2019-12-31 00:00:00 NaN A NaN 1
2 2020-01-01 00:00:00 NaN 2 NaN B NaN NaN NaN 2020-01-01 00:00:00 NaN B NaN 2
3 2020-01-02 00:00:00 NaN 3 NaN C NaN NaN NaN 2020-01-02 00:00:00 NaN C NaN 3
4 2020-01-03 00:00:00 NaN 4 NaN D NaN NaN NaN 2020-01-03 00:00:00 NaN D NaN 4
5 2020-01-04 00:00:00 NaN 5 NaN E NaN NaN NaN 2020-01-04 00:00:00 NaN E NaN 5
6 2020-01-05 00:00:00 NaN 6 NaN F NaN NaN NaN 2020-01-05 00:00:00 NaN F NaN 6
7 2020-01-06 00:00:00 NaN 7 NaN G NaN NaN NaN 2020-01-06 00:00:00 NaN G NaN 7
8 2020-01-07 00:00:00 NaN 8 NaN H NaN NaN NaN 2020-01-07 00:00:00 NaN H NaN 8
Try to use the parameter usecols. From the documentation:
If list of int, then indicates list of column numbers to be parsed.
import pandas as pd
df1 = pd.read_excel(excel_file,usecols=[0,2,4])
df2 = pd.read_excel(excel_file,usecols=[8,10,12])
This should create two dataframes with the columns you want.

Display the first 5 largest values of a DataFrame for each month

I'm trying to work on a dataframe with a lot of columns (505) and I want to select only the top 5 values for each month.
You'll find below the link of an image of my DataFrame.
link photo
Here is the sample:
Dates 1 2 3 4 5 6
2002-07-31 -31.710916 NaN -5.208684 -29.773404 NaN -7.308558
2002-08-31 -44.941351 NaN 3.665286 -23.987135 NaN 3.134669
2002-09-30 -36.725548 NaN 4.114474 -19.536571 NaN -0.986986
2002-10-31 -25.377286 NaN -0.486158 -5.887594 NaN -0.787117
2002-11-30 19.766328 NaN -5.298877 -10.672174 NaN -21.057946
2002-12-31 1.996514 NaN -7.570497 -9.257122 NaN -19.630112
2003-01-31 -0.366083 NaN -14.124492 -5.434475 NaN -8.053424
2003-02-28 -17.869297 NaN -20.075997 1.009837 NaN -11.616974
How can I do it? I already tried with df.max(axis=1) but I would like to add 4 other values after the maximum.
Thanks for your help
I assume that you want the max 5 columns for each row, since that is the way I interpret your question. The following picks the max 2 rows in your example input, since it has only 4 non-nan columns.
import io
import re
import pandas as pd
# First read in the data you supplied.
data=io.StringIO(re.sub(" +","\t",
"""Dates 1 2 3 4 5 6
2002-07-31 -31.710916 NaN -5.208684 -29.773404 NaN -7.308558
2002-08-31 -44.941351 NaN 3.665286 -23.987135 NaN 3.134669
2002-09-30 -36.725548 NaN 4.114474 -19.536571 NaN -0.986986
2002-10-31 -25.377286 NaN -0.486158 -5.887594 NaN -0.787117
2002-11-30 19.766328 NaN -5.298877 -10.672174 NaN -21.057946
2002-12-31 1.996514 NaN -7.570497 -9.257122 NaN -19.630112
2003-01-31 -0.366083 NaN -14.124492 -5.434475 NaN -8.053424
2003-02-28 -17.869297 NaN -20.075997 1.009837 NaN -11.616974"""))
df = pd.read_csv(data,sep="\t")
# Then we preprocess the data, so it is in a long format instead of a wide
df = df.melt(id_vars='Dates',var_name='Column_name',value_name='Value')
# Finally extract the top 2 values for each date, but first set the index so the output knows what column the input came from
print(df.set_index('Column_name').groupby('Dates')['Value'].apply(lambda grp: grp.nlargest(2)))
and the output is
Dates Column_name
2002-07-31 3 -5.208684
6 -7.308558
2002-08-31 3 3.665286
6 3.134669
2002-09-30 3 4.114474
6 -0.986986
2002-10-31 3 -0.486158
6 -0.787117
2002-11-30 1 19.766328
3 -5.298877
2002-12-31 1 1.996514
3 -7.570497
2003-01-31 1 -0.366083
4 -5.434475
2003-02-28 4 1.009837
6 -11.616974
Name: Value, dtype: float64
It is tough to give a more suitable answer unless you become more explicit with exactly what output you desire.
By reading DocString of method Maybe your looking for nlargest method.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html
You can try this:
df['Dates'] = pd.to_datetime(df['Dates'])
df = df.groupby(pd.Grouper(key='Dates', freq='1M'))
df2 = df.apply(lambda x: x.sort_values(['1', '2', '3', '4', '5', '6'], ascending=False))
df3 = df2.reset_index(drop=True)
print(df3.groupby(pd.Grouper(key='Dates', freq='1M')).head(5))
Output:
Dates 1 2 3 4 5 6
0 2002-07-31 -31.710916 NaN -5.208684 -29.773404 NaN -7.308558
1 2002-08-31 -44.941351 NaN 3.665286 -23.987135 NaN 3.134669
2 2002-09-30 -36.725548 NaN 4.114474 -19.536571 NaN -0.986986
3 2002-10-31 -25.377286 NaN -0.486158 -5.887594 NaN -0.787117
4 2002-11-30 19.766328 NaN -5.298877 -10.672174 NaN -21.057946
5 2002-12-31 1.996514 NaN -7.570497 -9.257122 NaN -19.630112
6 2003-01-31 -0.366083 NaN -14.124492 -5.434475 NaN -8.053424
7 2003-02-28 -17.869297 NaN -20.075997 1.009837 NaN -11.616974
8 2003-02-28 -18.869297 NaN -20.075997 1.009837 NaN -11.616974
9 2003-02-28 -19.869297 NaN -20.075997 1.009837 NaN -11.616974
10 2003-02-28 -20.869297 NaN -20.075997 1.009837 NaN -11.616974
11 2003-02-28 -21.869297 NaN -20.075997 1.009837 NaN -11.616974

replace values in dataframe with zeros and ones

I want to relace values in a dataframe, with a 0 where is a NaN value and with 1 where is a value.
here is my data:
AA AAPL FB GOOG TSLA XOM
Date
2018-02-28 NaN 0.068185 NaN NaN -0.031752 NaN
2018-03-31 -0.000222 NaN NaN NaN NaN -0.014920
2018-04-30 0.138790 NaN NaN NaN 0.104347 NaN
2018-05-31 NaN 0.135124 0.115 NaN NaN NaN
2018-06-30 NaN NaN NaN 0.028258 0.204474 NaN
2018-07-31 NaN 0.027983 NaN 0.091077 NaN NaN
2018-08-31 0.032355 0.200422 NaN NaN NaN NaN
2018-09-30 NaN -0.008303 NaN NaN NaN 0.060496
2018-10-31 NaN -0.030478 NaN NaN 0.274011 NaN
2018-11-30 NaN NaN NaN 0.016401 0.039013 NaN
2018-12-31 NaN NaN NaN -0.053745 -0.050445 NaN
Use mask and fillna:
df = df.mask(df.notna(), 1).fillna(0, downcast='infer')
Use:
df[df.notnull() == True] = 1
df.fillna(0, inplace=True)
Cast the Boolean values to int.
df.notnull().astype(int)
AA AAPL FB GOOG TSLA XOM
2018-02-28 0 1 0 0 1 0
2018-03-31 1 0 0 0 0 1
2018-04-30 1 0 0 0 1 0
2018-05-31 0 1 1 0 0 0
2018-06-30 0 0 0 1 1 0

Compare two dataframes, one column, and add certain values on match?

So I have two dataframes
eqdf
symbol qty
0 DABIND 1
1 INFTEC 6
2 DISHTV 8
3 HINDAL 40
4 NATMIN 5
5 POWGRI 40
6 CHEPET 6
premdf
share strike lprice premperc d_strike
0 HINDAL 250.0 237.90 1.975620 5.086171
1 RELIND 1280.0 1254.30 1.642350 2.048952
2 POWGRI 205.0 201.15 1.118568 1.913995
I want to compare columns premdf['share'] and eqdf['symbol'] and if there is a match premperc,d_strike,strike value is to be added to the end of the eqdf row in which there is a match.
I have tried
eqdf.loc[eqdf['symbol']==premdf['share'],eqdf['premperc'] == premdf['premperc']]
I keep getting errors
ValueError: Can only compare identically-labeled Series objects
Expected Output:
eqdf
symbol qty premperc d_strike strike
0 DABIND 1 NaN NaN NaN
1 INFTEC 6 NaN NaN NaN
2 DISHTV 8 NaN NaN NaN
3 HINDAL 40 1.975620 5.086171 250.0
4 NATMIN 5 NaN NaN NaN
5 POWGRI 40 1.118568 1.913995 205.0
6 CHEPET 6 NaN NaN NaN
What is the correct way to do this?
Thanks
rename and merge
eqdf.merge(premdf.rename(columns={'share': 'symbol'}), 'left')
symbol qty strike lprice premperc d_strike
0 DABIND 1 NaN NaN NaN NaN
1 INFTEC 6 NaN NaN NaN NaN
2 DISHTV 8 NaN NaN NaN NaN
3 HINDAL 40 250.0 237.90 1.975620 5.086171
4 NATMIN 5 NaN NaN NaN NaN
5 POWGRI 40 205.0 201.15 1.118568 1.913995
6 CHEPET 6 NaN NaN NaN NaN

Groupby on categorical column with date and period producing unexpected result

The issue below was created in Python 2.7.11 with Pandas 0.17.1
When grouping a categorical column with both a period and date column, unexpected rows appear in the grouping. Is this a Pandas bug, or could it be something else?
df = pd.DataFrame({'date': pd.date_range('2015-12-29', '2016-1-3'),
'val1': [1] * 6,
'val2': range(6),
'cat1': ['a', 'b', 'c'] * 2,
'cat2': ['A', 'B', 'C'] * 2})
df['cat1'] = df.cat1.astype('category')
df['month'] = [d.to_period('M') for d in df.date]
>>> df
cat1 cat2 date val1 val2 month
0 a A 2015-12-29 1 0 2015-12
1 b B 2015-12-30 1 1 2015-12
2 c C 2015-12-31 1 2 2015-12
3 a A 2016-01-01 1 3 2016-01
4 b B 2016-01-02 1 4 2016-01
5 c C 2016-01-03 1 5 2016-01
Grouping the month and date with a regular series (e.g. cat2) works as expected:
>>> df.groupby(['month', 'date', 'cat2']).sum().unstack()
val1 val2
cat2 A B C A B C
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
But grouping on a categorical produces unexpected results. You'll notice in the index that the extra dates do not correspond to the grouped month.
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-02 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-03 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01 2015-12-29 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-30 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-31 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
Grouping the categorical by month periods or dates works fine, but not when both are combined as in the example above.
>>> df.groupby(['month', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month
2015-12 1 1 1 0 1 2
2016-01 1 1 1 3 4 5
>>> df.groupby(['date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
date
2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
EDIT
This behavior originated in the 0.15.0 update. Prior to that, this was the output:
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
As defined in pandas, grouping with a categorical will always have the full set of categories, even if there isn't any data with that category, e.g., doc example here
You can either not use a categorical, or add a .dropna(how='all') after your grouping step.

Categories

Resources