I use read_csv to fill a pandas. In this pandas I have a full NaN empty columns and this turns into a problem when I use pivot_table.
Here my situation:
d= {'dates': ['01/01/20','01/02/20','01/03/20'], 'country':['Fra','Fra','Fra'], 'val': [np.nan,np.nan,np.nan]}
df = pd.DataFrame(data=d)
piv=df.pivot_table(index='country',values='val',columns='dates')
print(piv)
Empty DataFrame
Columns: []
Index: []
I would like to have this :
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN
from docs, set dropna = False DataFrame.pivot_table
piv = df.pivot_table(index='country',values='val',columns='dates', dropna=False)
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN
Just use the dropna argument of pivot:
df.pivot_table(index='country',columns='dates', values='val', dropna = False)
The output is:
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN
Related
I have a pandas series called mean_return:
BHP 0.094214
GOOG 0.180892
INTC -0.179899
MRK 0.065741
MSFT 0.205519
MXL 0.153332
SHEL 0.001714
TSM 0.162741
WBD -0.233863
dtype: float64
pandas.core.series.Series
When i try to merge the above with the dataframe below I get Nan values for mean return: (excuse the formatting Im not sure how to copy and paste the dataframe). I see the tickers are not ordered the same for the series as the DF , what can i do to merge both DF and series?
0
mkt_value weights investment shares mean_return
0 GOOG 51.180000 0.115308 14413.469698 281.623087 NaN
1 BHP 99.570000 0.140488 17560.996495 176.368349 NaN
INTC 25.719999 0.092804 11600.473577 451.029311 NaN
MXL 87.599998 0.110175 13771.865664 157.213081 NaN
MRK 234.240005 0.102416 12801.944297 54.653108 NaN
MSFT 34.220001 0.123298 15412.217160 450.386225 NaN
SHEL 51.970001 0.142114 17764.225757 341.816920 NaN
TSM 69.750000 0.134963 16870.389838 241.869388 NaN
WBD 11.980000 0.038435 4804.417515 40
Here is the code is used:
df=pd.DataFrame(tickers_list)
df.rename({'index':'tickers_list'},axis='columns',inplace=True)
df['mkt_value']=data.values[-1]
df['weights']=weights
df['investment']=port_size*weights
df['shares']=df['investment']/df['mkt_value']
df['mean_return']=mean_return
df
If the tickers in the df are unique, then I would set them as an index. and then use pd.concat which concatenates based on index.
df = df.set_index('tickers_list')
mean_returns.name = "mean_returns"
df = pd.concat([df, mean_returns], axis=1)
You set the name attribute as that will be the name of the new column.
I am surprised that my reindex is producing NaNs in whole dataframe when the original dataframe does have numerical values init. Don't know why?
Code:
df =
A ... D
Unnamed: 0 ...
2022-04-04 11:00:05 NaN ... 2419.0
2022-04-04 11:00:10 NaN ... 2419.0
## exp start and end times
exp_start, exp_end = '2022-04-04 11:00:00','2022-04-04 13:00:00'
## one second index
onesec_idx = pd.date_range(start=exp_start,end=exp_end,freq='1s')
## map new index to the df
df = df.reindex(onesec_idx)
Result:
df =
A ... D
2022-04-04 11:00:00 NaN ... NaN
2022-04-04 11:00:01 NaN ... NaN
2022-04-04 11:00:02 NaN ... NaN
2022-04-04 11:00:03 NaN ... NaN
2022-04-04 11:00:04 NaN ... NaN
2022-04-04 11:00:05 NaN ... NaN
From the documentation you can see that df.reindex() will Places NA/NaN in locations having no value in the previous index.
However you can also provide a value that you want to replace missing values with (It defaults to NaN):
df.reindex(onesec_idx, fill_value='')
If you want to replace the NaN in a particular column or even in the whole dataframe you can run something like after doing a reindex:
df.fillna('',inplace=True) # for replacing NaN in the entire df with ''
df['d'].fillna(0, inplace=True) # if you want to replace all NaN in the D column with 0
Sources:
Documentation for reindex: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
Documentation for fillna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
I have a pandas dataframe whose index is created by
pd.bdate_range. The index column consists of business days (Monday through Friday) starting 1993/1/5. The first 12 rows are:
df_xx[0:12]
Out[163]:
aaa aaa_f
1993-01-05 125.25 NaN
1993-01-06 124.84 NaN
1993-01-07 125.09 NaN
1993-01-08 125.42 NaN
1993-01-11 125.36 NaN
1993-01-12 125.05 NaN
1993-01-13 125.87 NaN
1993-01-14 125.65 NaN
1993-01-15 126.05 NaN
1993-01-18 125.82 NaN
1993-01-19 125.46 NaN
1993-01-20 125.39 NaN
How can I create a subset with only Friday data?
Get names of days by DatetimeIndex.day_name and filter DataFrame byboolean indexing :
df = df[df.index.day_name() == 'Friday']
print (df)
aaa aaa_f
1993-01-08 125.42 NaN
1993-01-15 126.05 NaN
My question is related to this one. I have a file named 'test.csv' with 'NA' as a value for region. I want to read this in as 'NA', not 'NaN'. However, there are missing values in other columns in test.csv, which I want to retain as 'NaN'. How can I do this?
# test.csv looks like this:
Here's what I've tried:
import pandas as pd
# This reads NA as NaN
df = pd.read_csv(test.csv)
df
region date expenses
0 NaN 1/1/2019 53
1 EU 1/2/2019 NaN
# This reads NA as NA, but doesn't read missing expense as NaN
df = pd.read_csv('test.csv', keep_default_na=False, na_values='_')
df
region date expenses
0 NA 1/1/2019 53
1 EU 1/2/2019
# What I want:
region date expenses
0 NA 1/1/2019 53
1 EU 1/2/2019 NaN
The problem with adding the argument keep_default_na=False is that the second value for expenses does not get read in as NaN. So if I then try pd.isnull(df['value'][1]) this is returned as False.
For me, this works:
df = pd.read_csv('file.csv', keep_default_na=False, na_values=[''])
which gives:
region date expenses
0 NA 1/1/2019 53.0
1 EU 1/2/2019 NaN
But I'd rather play safe, due to possible other NaN in other columns, and do
df = pd.read_csv('file.csv')
df['region'] = df['region'].fillna('NA')
when specifying keep_default=False all defaults values are not considered as nan so you should specify them:
use keep_default_na=False, na_values= [‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’]
This approach work for me:
import pandas as pd
df = pd.read_csv('Test.csv')
co1 col2 col3 col4
a b c d e
NaN NaN NaN NaN NaN
2 3 4 5 NaN
I copied the value and created a list which are by default interpreted as NaN then comment out NA which I wanted to be interpreted as not NaN. This approach still treat other values as NaN except for NA.
#You can also create your own list of value that should be treated as NaN and
# then pass the values to na_values and set keep_default_na=False.
na_values = ["",
"#N/A",
"#N/A N/A",
"#NA",
"-1.#IND",
"-1.#QNAN",
"-NaN",
"-nan",
"1.#IND",
"1.#QNAN",
"<NA>",
"N/A",
# "NA",
"NULL",
"NaN",
"n/a",
"nan",
"null"]
df1 = pd.read_csv('Test.csv',na_values=na_values,keep_default_na=False )
co1 col2 col3 col4
a b c d e
NaN NA NaN NA NaN
2 3 4 5 NaN
This has been the bane of my life for the past couple of days. I have numerous Pandas Dataframes that contain time series data with irregular frequencies. I try to align these into a single dataframe.
Below is some code, with representative dataframes, df1, df2, and df3 ( I actually have n=5, and would appreciate a solution that would work for all n>2):
# df1, df2, df3 are given at the bottom
import pandas as pd
import datetime
# I can align df1 to df2 easily
df1aligned, df2aligned = df1.align(df2)
# And then concatenate into a single dataframe
combined_1_n_2 = pd.concat([df1aligned, df2aligned], axis =1 )
# Since I don't know any better, I then try to align df3 to combined_1_n_2 manually:
combined_1_n_2.align(df3)
error: Reindexing only valid with uniquely valued Index objects
I have an idea why I get this error, so I get rid of the duplicate indices in combined_1_n_2 and try again:
combined_1_n_2 = combined_1_n_2.groupby(combined_1_n_2.index).first()
combined_1_n_2.align(df3) # But stll get the same error
error: Reindexing only valid with uniquely valued Index objects
Why am I getting this error? Even if this worked, it is completely manual and ugly. How can I align >2 time series and combine them in a single dataframe?
Data:
df1 = pd.DataFrame( {'price' : [62.1250,62.2500,62.2375,61.9250,61.9125 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:03:59.614000', '2008-06-01 06:03:59.692000',
'2008-06-01 06:15:42.004000', '2008-06-01 06:15:42.083000','2008-06-01 06:17:01.654000' ] ])
df2 = pd.DataFrame({'price': [241.0625, 241.5000, 241.3750, 241.2500, 241.3750 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:13:34.524000', '2008-06-01 06:13:34.602000',
'2008-06-01 06:15:05.399000', '2008-06-01 06:15:05.399000','2008-06-01 06:15:42.082000' ] ])
df3 = pd.DataFrame({'price': [67.656, 67.875, 67.8125, 67.75, 67.6875 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:03:52.281000', '2008-06-01 06:03:52.359000',
'2008-06-01 06:13:34.848000', '2008-06-01 06:13:34.926000','2008-06-01 06:15:05.321000' ] ])
Your specific error is due the column names of combined_1_n_2 having duplicates (both columns will be named 'price'). You could rename the columns and the second align would work.
One alternative way would be to chain the join operator, which merges frames on the index, as below.
In [23]: df1.join(df2, how='outer', rsuffix='_1').join(df3, how='outer', rsuffix='_2')
Out[23]:
price price_1 price_2
2008-06-01 06:03:52.281000 NaN NaN 67.6560
2008-06-01 06:03:52.359000 NaN NaN 67.8750
2008-06-01 06:03:59.614000 62.1250 NaN NaN
2008-06-01 06:03:59.692000 62.2500 NaN NaN
2008-06-01 06:13:34.524000 NaN 241.0625 NaN
2008-06-01 06:13:34.602000 NaN 241.5000 NaN
2008-06-01 06:13:34.848000 NaN NaN 67.8125
2008-06-01 06:13:34.926000 NaN NaN 67.7500
2008-06-01 06:15:05.321000 NaN NaN 67.6875
2008-06-01 06:15:05.399000 NaN 241.3750 NaN
2008-06-01 06:15:05.399000 NaN 241.2500 NaN
2008-06-01 06:15:42.004000 62.2375 NaN NaN
2008-06-01 06:15:42.082000 NaN 241.3750 NaN
2008-06-01 06:15:42.083000 61.9250 NaN NaN
2008-06-01 06:17:01.654000 61.9125 NaN NaN