Cannot append pandas DataFrame to another DataFrame

Cannot append pandas DataFrame to another DataFrame - python

I have two pandas dataframes: df and temp_df. The df has a header, but temp_df does not have it.
df =
IdTravel TravelDate IdProfesional InformAsign
0 8178429 2017-10-25 11:25:16.550 NaN NaN
1 8180074 2017-10-25 13:49:09.640 NaN NaN
2 8180287 2017-10-25 14:28:04.000 123 ABC
3 8182810 2017-10-26 09:55:14.930 NaN NaN
4 8182849 2017-10-26 09:59:11.187 NaN NaN
temp_df =
0 1 2 3
0 4189915 2017-10-25 13:49:09.640 NaN NaN
1 4100334 2017-10-25 14:28:04.000 111 ABC
2 4102833 2017-10-26 09:55:14.930 NaN NaN
3 4102845 2017-10-26 09:59:11.187 NaN NaN
I want to append temp_df to df. Expected result:
IdTravel TravelDate IdProfesional InformAsign
0 8178429 2017-10-25 11:25:16.550 NaN NaN
1 8180074 2017-10-25 13:49:09.640 NaN NaN
2 8180287 2017-10-25 14:28:04.000 123 ABC
3 8182810 2017-10-26 09:55:14.930 NaN NaN
4 8182849 2017-10-26 09:59:11.187 NaN NaN
5 4189915 2017-10-25 13:49:09.640 NaN NaN
6 4100334 2017-10-25 14:28:04.000 111 ABC
7 4102833 2017-10-26 09:55:14.930 NaN NaN
8 4102845 2017-10-26 09:59:11.187 NaN NaN
I tried:
result = df.append(temp_df)
I was expecting to get a new dataframe with 9 rows. But instead I got a dataframe result with 9 rows and 8 columns.
Also, I tried this, but got the same wrong result:
result = pd.concat([df,temp_df],axis=1,ignore_index=True) # and axis=0

You can rename columns in temp_df to match the original df:
temp_df.columns = ['IdTravel', 'TravelDate', 'IdProfesional', 'InformAsign']
result = df.append(temp_df).reset_index(drop=True)
print(result)
Prints:
IdTravel TravelDate IdProfesional InformAsign
0 8178429 2017-10-25 11:25:16.550 NaN NaN
1 8180074 2017-10-25 13:49:09.640 NaN NaN
2 8180287 2017-10-25 14:28:04.000 123.0 ABC
3 8182810 2017-10-26 09:55:14.930 NaN NaN
4 8182849 2017-10-26 09:59:11.187 NaN NaN
5 4189915 2017-10-25 13:49:09.640 NaN NaN
6 4100334 2017-10-25 14:28:04.000 111.0 ABC
7 4102833 2017-10-26 09:55:14.930 NaN NaN
8 4102845 2017-10-26 09:59:11.187 NaN NaN

Related

How can I split this excel file into two data frames?

When I try and load this excel spreadsheet into a dataframe I get a lot of NAN due to all the random white space in the file. I'd really like to split class I and class A from this excel file into two seperate pandas dataframe
In:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
excel_file = 'EXAMPLE.xlsx'
df = pd.read_excel(excel_file, header=8)
print(df)
sys.exit()
Out:
Class I Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Class A Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
0 Date NaN column 1 NaN column 2 NaN NaN NaN Date NaN column 1 NaN column 2
1 2019-12-31 00:00:00 NaN 1 NaN A NaN NaN NaN 2019-12-31 00:00:00 NaN A NaN 1
2 2020-01-01 00:00:00 NaN 2 NaN B NaN NaN NaN 2020-01-01 00:00:00 NaN B NaN 2
3 2020-01-02 00:00:00 NaN 3 NaN C NaN NaN NaN 2020-01-02 00:00:00 NaN C NaN 3
4 2020-01-03 00:00:00 NaN 4 NaN D NaN NaN NaN 2020-01-03 00:00:00 NaN D NaN 4
5 2020-01-04 00:00:00 NaN 5 NaN E NaN NaN NaN 2020-01-04 00:00:00 NaN E NaN 5
6 2020-01-05 00:00:00 NaN 6 NaN F NaN NaN NaN 2020-01-05 00:00:00 NaN F NaN 6
7 2020-01-06 00:00:00 NaN 7 NaN G NaN NaN NaN 2020-01-06 00:00:00 NaN G NaN 7
8 2020-01-07 00:00:00 NaN 8 NaN H NaN NaN NaN 2020-01-07 00:00:00 NaN H NaN 8

Try to use the parameter usecols. From the documentation:
If list of int, then indicates list of column numbers to be parsed.
import pandas as pd
df1 = pd.read_excel(excel_file,usecols=[0,2,4])
df2 = pd.read_excel(excel_file,usecols=[8,10,12])
This should create two dataframes with the columns you want.

Display the first 5 largest values of a DataFrame for each month

I'm trying to work on a dataframe with a lot of columns (505) and I want to select only the top 5 values for each month.
You'll find below the link of an image of my DataFrame.
link photo
Here is the sample:
Dates 1 2 3 4 5 6
2002-07-31 -31.710916 NaN -5.208684 -29.773404 NaN -7.308558
2002-08-31 -44.941351 NaN 3.665286 -23.987135 NaN 3.134669
2002-09-30 -36.725548 NaN 4.114474 -19.536571 NaN -0.986986
2002-10-31 -25.377286 NaN -0.486158 -5.887594 NaN -0.787117
2002-11-30 19.766328 NaN -5.298877 -10.672174 NaN -21.057946
2002-12-31 1.996514 NaN -7.570497 -9.257122 NaN -19.630112
2003-01-31 -0.366083 NaN -14.124492 -5.434475 NaN -8.053424
2003-02-28 -17.869297 NaN -20.075997 1.009837 NaN -11.616974
How can I do it? I already tried with df.max(axis=1) but I would like to add 4 other values after the maximum.
Thanks for your help

I assume that you want the max 5 columns for each row, since that is the way I interpret your question. The following picks the max 2 rows in your example input, since it has only 4 non-nan columns.
import io
import re
import pandas as pd
# First read in the data you supplied.
data=io.StringIO(re.sub(" +","\t",
"""Dates 1 2 3 4 5 6
2002-07-31 -31.710916 NaN -5.208684 -29.773404 NaN -7.308558
2002-08-31 -44.941351 NaN 3.665286 -23.987135 NaN 3.134669
2002-09-30 -36.725548 NaN 4.114474 -19.536571 NaN -0.986986
2002-10-31 -25.377286 NaN -0.486158 -5.887594 NaN -0.787117
2002-11-30 19.766328 NaN -5.298877 -10.672174 NaN -21.057946
2002-12-31 1.996514 NaN -7.570497 -9.257122 NaN -19.630112
2003-01-31 -0.366083 NaN -14.124492 -5.434475 NaN -8.053424
2003-02-28 -17.869297 NaN -20.075997 1.009837 NaN -11.616974"""))
df = pd.read_csv(data,sep="\t")
# Then we preprocess the data, so it is in a long format instead of a wide
df = df.melt(id_vars='Dates',var_name='Column_name',value_name='Value')
# Finally extract the top 2 values for each date, but first set the index so the output knows what column the input came from
print(df.set_index('Column_name').groupby('Dates')['Value'].apply(lambda grp: grp.nlargest(2)))
and the output is
Dates Column_name
2002-07-31 3 -5.208684
6 -7.308558
2002-08-31 3 3.665286
6 3.134669
2002-09-30 3 4.114474
6 -0.986986
2002-10-31 3 -0.486158
6 -0.787117
2002-11-30 1 19.766328
3 -5.298877
2002-12-31 1 1.996514
3 -7.570497
2003-01-31 1 -0.366083
4 -5.434475
2003-02-28 4 1.009837
6 -11.616974
Name: Value, dtype: float64
It is tough to give a more suitable answer unless you become more explicit with exactly what output you desire.

By reading DocString of method Maybe your looking for nlargest method.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html

You can try this:
df['Dates'] = pd.to_datetime(df['Dates'])
df = df.groupby(pd.Grouper(key='Dates', freq='1M'))
df2 = df.apply(lambda x: x.sort_values(['1', '2', '3', '4', '5', '6'], ascending=False))
df3 = df2.reset_index(drop=True)
print(df3.groupby(pd.Grouper(key='Dates', freq='1M')).head(5))
Output:
Dates 1 2 3 4 5 6
0 2002-07-31 -31.710916 NaN -5.208684 -29.773404 NaN -7.308558
1 2002-08-31 -44.941351 NaN 3.665286 -23.987135 NaN 3.134669
2 2002-09-30 -36.725548 NaN 4.114474 -19.536571 NaN -0.986986
3 2002-10-31 -25.377286 NaN -0.486158 -5.887594 NaN -0.787117
4 2002-11-30 19.766328 NaN -5.298877 -10.672174 NaN -21.057946
5 2002-12-31 1.996514 NaN -7.570497 -9.257122 NaN -19.630112
6 2003-01-31 -0.366083 NaN -14.124492 -5.434475 NaN -8.053424
7 2003-02-28 -17.869297 NaN -20.075997 1.009837 NaN -11.616974
8 2003-02-28 -18.869297 NaN -20.075997 1.009837 NaN -11.616974
9 2003-02-28 -19.869297 NaN -20.075997 1.009837 NaN -11.616974
10 2003-02-28 -20.869297 NaN -20.075997 1.009837 NaN -11.616974
11 2003-02-28 -21.869297 NaN -20.075997 1.009837 NaN -11.616974

replace values in dataframe with zeros and ones

I want to relace values in a dataframe, with a 0 where is a NaN value and with 1 where is a value.
here is my data:
AA AAPL FB GOOG TSLA XOM
Date
2018-02-28 NaN 0.068185 NaN NaN -0.031752 NaN
2018-03-31 -0.000222 NaN NaN NaN NaN -0.014920
2018-04-30 0.138790 NaN NaN NaN 0.104347 NaN
2018-05-31 NaN 0.135124 0.115 NaN NaN NaN
2018-06-30 NaN NaN NaN 0.028258 0.204474 NaN
2018-07-31 NaN 0.027983 NaN 0.091077 NaN NaN
2018-08-31 0.032355 0.200422 NaN NaN NaN NaN
2018-09-30 NaN -0.008303 NaN NaN NaN 0.060496
2018-10-31 NaN -0.030478 NaN NaN 0.274011 NaN
2018-11-30 NaN NaN NaN 0.016401 0.039013 NaN
2018-12-31 NaN NaN NaN -0.053745 -0.050445 NaN

Use mask and fillna:
df = df.mask(df.notna(), 1).fillna(0, downcast='infer')

Use:
df[df.notnull() == True] = 1
df.fillna(0, inplace=True)

Cast the Boolean values to int.
df.notnull().astype(int)
AA AAPL FB GOOG TSLA XOM
2018-02-28 0 1 0 0 1 0
2018-03-31 1 0 0 0 0 1
2018-04-30 1 0 0 0 1 0
2018-05-31 0 1 1 0 0 0
2018-06-30 0 0 0 1 1 0

Compare two dataframes, one column, and add certain values on match?

So I have two dataframes
eqdf
symbol qty
0 DABIND 1
1 INFTEC 6
2 DISHTV 8
3 HINDAL 40
4 NATMIN 5
5 POWGRI 40
6 CHEPET 6
premdf
share strike lprice premperc d_strike
0 HINDAL 250.0 237.90 1.975620 5.086171
1 RELIND 1280.0 1254.30 1.642350 2.048952
2 POWGRI 205.0 201.15 1.118568 1.913995
I want to compare columns premdf['share'] and eqdf['symbol'] and if there is a match premperc,d_strike,strike value is to be added to the end of the eqdf row in which there is a match.
I have tried
eqdf.loc[eqdf['symbol']==premdf['share'],eqdf['premperc'] == premdf['premperc']]
I keep getting errors
ValueError: Can only compare identically-labeled Series objects
Expected Output:
eqdf
symbol qty premperc d_strike strike
0 DABIND 1 NaN NaN NaN
1 INFTEC 6 NaN NaN NaN
2 DISHTV 8 NaN NaN NaN
3 HINDAL 40 1.975620 5.086171 250.0
4 NATMIN 5 NaN NaN NaN
5 POWGRI 40 1.118568 1.913995 205.0
6 CHEPET 6 NaN NaN NaN
What is the correct way to do this?
Thanks

rename and merge
eqdf.merge(premdf.rename(columns={'share': 'symbol'}), 'left')
symbol qty strike lprice premperc d_strike
0 DABIND 1 NaN NaN NaN NaN
1 INFTEC 6 NaN NaN NaN NaN
2 DISHTV 8 NaN NaN NaN NaN
3 HINDAL 40 250.0 237.90 1.975620 5.086171
4 NATMIN 5 NaN NaN NaN NaN
5 POWGRI 40 205.0 201.15 1.118568 1.913995
6 CHEPET 6 NaN NaN NaN NaN

Groupby on categorical column with date and period producing unexpected result

The issue below was created in Python 2.7.11 with Pandas 0.17.1
When grouping a categorical column with both a period and date column, unexpected rows appear in the grouping. Is this a Pandas bug, or could it be something else?
df = pd.DataFrame({'date': pd.date_range('2015-12-29', '2016-1-3'),
'val1': [1] * 6,
'val2': range(6),
'cat1': ['a', 'b', 'c'] * 2,
'cat2': ['A', 'B', 'C'] * 2})
df['cat1'] = df.cat1.astype('category')
df['month'] = [d.to_period('M') for d in df.date]
>>> df
cat1 cat2 date val1 val2 month
0 a A 2015-12-29 1 0 2015-12
1 b B 2015-12-30 1 1 2015-12
2 c C 2015-12-31 1 2 2015-12
3 a A 2016-01-01 1 3 2016-01
4 b B 2016-01-02 1 4 2016-01
5 c C 2016-01-03 1 5 2016-01
Grouping the month and date with a regular series (e.g. cat2) works as expected:
>>> df.groupby(['month', 'date', 'cat2']).sum().unstack()
val1 val2
cat2 A B C A B C
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
But grouping on a categorical produces unexpected results. You'll notice in the index that the extra dates do not correspond to the grouped month.
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-02 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-03 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01 2015-12-29 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-30 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-31 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
Grouping the categorical by month periods or dates works fine, but not when both are combined as in the example above.
>>> df.groupby(['month', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month
2015-12 1 1 1 0 1 2
2016-01 1 1 1 3 4 5
>>> df.groupby(['date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
date
2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
EDIT
This behavior originated in the 0.15.0 update. Prior to that, this was the output:
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5

As defined in pandas, grouping with a categorical will always have the full set of categories, even if there isn't any data with that category, e.g., doc example here
You can either not use a categorical, or add a .dropna(how='all') after your grouping step.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cannot append pandas DataFrame to another DataFrame - python

Related

How can I split this excel file into two data frames?

Display the first 5 largest values of a DataFrame for each month

replace values in dataframe with zeros and ones

Compare two dataframes, one column, and add certain values on match?

Groupby on categorical column with date and period producing unexpected result

Categories

Resources