Pandas Timeseries reindex producing NaNs

Pandas Timeseries reindex producing NaNs - python

I am surprised that my reindex is producing NaNs in whole dataframe when the original dataframe does have numerical values init. Don't know why?
Code:
df =
A ... D
Unnamed: 0 ...
2022-04-04 11:00:05 NaN ... 2419.0
2022-04-04 11:00:10 NaN ... 2419.0
## exp start and end times
exp_start, exp_end = '2022-04-04 11:00:00','2022-04-04 13:00:00'
## one second index
onesec_idx = pd.date_range(start=exp_start,end=exp_end,freq='1s')
## map new index to the df
df = df.reindex(onesec_idx)
Result:
df =
A ... D
2022-04-04 11:00:00 NaN ... NaN
2022-04-04 11:00:01 NaN ... NaN
2022-04-04 11:00:02 NaN ... NaN
2022-04-04 11:00:03 NaN ... NaN
2022-04-04 11:00:04 NaN ... NaN
2022-04-04 11:00:05 NaN ... NaN

From the documentation you can see that df.reindex() will Places NA/NaN in locations having no value in the previous index.
However you can also provide a value that you want to replace missing values with (It defaults to NaN):
df.reindex(onesec_idx, fill_value='')
If you want to replace the NaN in a particular column or even in the whole dataframe you can run something like after doing a reindex:
df.fillna('',inplace=True) # for replacing NaN in the entire df with ''
df['d'].fillna(0, inplace=True) # if you want to replace all NaN in the D column with 0
Sources:
Documentation for reindex: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
Documentation for fillna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

Related

Pandas, how to calculate delta between one cell and another in different rows

I have the following frame:
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2
123,45,,,
123,,46,,
123,,47,,
123,,48,,
123,,49,,
123,,51,,
124,45,,,
124,,46,,
124,,47,,
124,,48,,
124,,49,,
124,,51,,
I'd like to add a 4th column that is (EVENT2TIME - EVENT1TIME)
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2, DELTA
123,45,,,,
123,,46,,,1
123,,47,,,2
123,,48,,,3
123,,49,,,4
123,,51,,,6
124,45,,,,
124,,46,,,1
124,,47,,,2
124,,48,,,3
124,,49,,,4
124,,51,,,6
I think the first thing to do is to copy the value from the row where EVENT1TIME is populated into the other instances of that USERID. But I suspect there may be a better way.

I am making some assumptions:
You want to calculate the difference between column EVENT2TIME and first row of EVENT1TIME
You want to store the results into DELTA
You can do this as follows:
import pandas as pd
df = pd.read_csv('abc.txt')
print (df)
df['DELTA'] = df.iloc[:,2] - df.iloc[0,1]
print (df)
The output of this will be:
USERID EVENT1TIME EVENT2TIME MISC1 MISC2 DELTA
0 123 45.0 NaN NaN NaN
1 123 NaN 46.0 NaN NaN 1.0
2 123 NaN 47.0 NaN NaN 2.0
3 123 NaN 48.0 NaN NaN 3.0
4 123 NaN 49.0 NaN NaN 4.0
5 123 NaN 51.0 NaN NaN 6.0

If you know EVENT1TIME is always and only in the first row, just store it as a variable and subtract it.
val = df.EVENT1TIME[0]
df['DELTA'] = df.EVENT2TIME - val
If you have multiple values every so often in EVENT1TIME, use some logic to back or forward fill all the empty rows for EVENT1TIME. This fill is not stored in the final output df.
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.ffill() # forward fill (down) all nan values
# OR
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.bfill() # back fill (up) all nan values
EDIT: Keeping this for continuity despite how hacky it is.
locations = list(df[~np.isnan(df.EVENT1TIME)].index)
vals = df.EVENT1TIME.loc[locations] # all EVENT1TIME values
locations.append(df.EVENT1TIME.index[-1]+1) # last row index + 1
last_loc = locations[0]
for idx, next_loc in enumerate(locations[1:]):
temp = df.loc[last_loc:next_loc-1]
df['DELTA'].loc[last_loc:next_loc-1] = temp.EVENT2VALUE - vals[last_loc]
last_loc = next_loc

Merging from df1 to df2 on same column - expanding dataset out

I am making a small mistake and I'm not sure how to merge two df correctly. I want to merge on IBES_cusip to get gvkey into df1.
I try the following, but it just expands the dataset out and does not match correctly:
df1 = df1.merge(df2, how = 'left', on =['IBES_cusip'])
df1
IBES_cusip pends pdicity ... ltg_eps futepsgrowth
0 00036110 1983-05-31 ANN ... NaN NaN
1 00036110 1983-05-31 ANN ... NaN NaN
2 00036110 1983-05-31 ANN ... NaN NaN
3 98970110 1983-05-31 ANN ... NaN NaN
4 98970110 1983-05-31 ANN ... NaN NaN
... ... ... ... ... ...
373472 98970111 2018-12-31 ANN ... 10.00 0.381119
373473 98970111 2018-12-31
df2
gvkey IBES_cusip
0 024538 86037010
1 004678 33791510
2 066367 26357810
3 137024 06985P20
4 137024 06985P20
... ...
833796 028955 33975610
833797 061676 17737610
833798 011096 92035510
833799 005774 44448210
833800 008286 69489010

Your main problem is that your df2 contains duplicate values in IBES_cusip column.
from the sample you gave I can see that
3 137024 06985P20
4 137024 06985P20
are the same values, this would cause the to get unwanted results (duplicate rows in the output).
try this
df1 = df1.merge(df2.drop_duplicates(subset=['IBES_cusip']), how='left', on='IBES_cusip')
Which should technically just add a gvkey column to your df1.
This assumes that you are pretty sure that you don't have rows with the same IBES_cusip that are matched with different gvkey otherwise you need to figure that out first.

NaN columns pandas turning into empty row when using pivot_table

I use read_csv to fill a pandas. In this pandas I have a full NaN empty columns and this turns into a problem when I use pivot_table.
Here my situation:
d= {'dates': ['01/01/20','01/02/20','01/03/20'], 'country':['Fra','Fra','Fra'], 'val': [np.nan,np.nan,np.nan]}
df = pd.DataFrame(data=d)
piv=df.pivot_table(index='country',values='val',columns='dates')
print(piv)
Empty DataFrame
Columns: []
Index: []
I would like to have this :
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN

from docs, set dropna = False DataFrame.pivot_table
piv = df.pivot_table(index='country',values='val',columns='dates', dropna=False)
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN

Just use the dropna argument of pivot:
df.pivot_table(index='country',columns='dates', values='val', dropna = False)
The output is:
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN

Python pandas time series dataframe: return only Friday data

I have a pandas dataframe whose index is created by
pd.bdate_range. The index column consists of business days (Monday through Friday) starting 1993/1/5. The first 12 rows are:
df_xx[0:12]
Out[163]:
aaa aaa_f
1993-01-05 125.25 NaN
1993-01-06 124.84 NaN
1993-01-07 125.09 NaN
1993-01-08 125.42 NaN
1993-01-11 125.36 NaN
1993-01-12 125.05 NaN
1993-01-13 125.87 NaN
1993-01-14 125.65 NaN
1993-01-15 126.05 NaN
1993-01-18 125.82 NaN
1993-01-19 125.46 NaN
1993-01-20 125.39 NaN
How can I create a subset with only Friday data?

Get names of days by DatetimeIndex.day_name and filter DataFrame byboolean indexing :
df = df[df.index.day_name() == 'Friday']
print (df)
aaa aaa_f
1993-01-08 125.42 NaN
1993-01-15 126.05 NaN

pandas dropna not working as expected on finding mean

When I run the code below I get the error:
TypeError: 'NoneType' object has no attribute 'getitem'
import pyarrow
import pandas
import pyarrow.parquet as pq
df = pq.read_table("file.parquet").to_pandas()
df = df.iloc[1:,:]
df = df.dropna (how="any", inplace = True) # modifies it in place, creates new dataset without NAN
average_age = df["_c2"].mean()
print average_age
The dataframe looks like this:
_c0 _c1 _c2
0 RecId Class Age
1 1 1st 29
2 2 1st NA
3 3 1st 30
If I print the df after calling the dropna method, I get 'None'.
Shouldn't it be creating a new dataframe without the 'NA' in it, which would then allow me to get the average age without throwing an error?

As per OP’s comment, the NA is a string rather than NaN. So dropna() is no good here. One of many possible options for filtering out the string value ‘NA’ is:
df = df[df["_c2"] != "NA"]
A better option to catch inexact matches (e.g. with trailing spaces) as suggested by #DJK in the comments:
df = df[~df["_c2"].str.contains('NA')]
This one should remove any strings rather than only ‘NA’:
df = df[df[“_c2”].apply(lambda x: x.isnumeric())]

This will work, also if you the NA in your df is NaN (np.nan), this will not affect your getting the mean of the column, only if your NA is 'NA', which is string
(df.apply(pd.to_numeric,errors ='coerce',axis=1)).describe()
Out[9]:
_c0 _c1 _c2
count 3.0 0.0 2.000000
mean 2.0 NaN 29.500000
std 1.0 NaN 0.707107
min 1.0 NaN 29.000000
25% 1.5 NaN 29.250000
50% 2.0 NaN 29.500000
75% 2.5 NaN 29.750000
max 3.0 NaN 30.000000
More info
df.apply(pd.to_numeric,errors ='coerce',axis=1)# all object change to NaN and will not affect getting mean
Out[10]:
_c0 _c1 _c2
0 NaN NaN NaN
1 1.0 NaN 29.0
2 2.0 NaN NaN
3 3.0 NaN 30.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Timeseries reindex producing NaNs - python

Related

Pandas, how to calculate delta between one cell and another in different rows

Merging from df1 to df2 on same column - expanding dataset out

NaN columns pandas turning into empty row when using pivot_table

Python pandas time series dataframe: return only Friday data

pandas dropna not working as expected on finding mean

Categories

Resources