I am working on jupyter lab with pandas, version 0.20.1. I have a pivot table with a DatetimeIndex such as
In [1]:
pivot = df.pivot_table(index='Date', columns=['State'], values='B',
fill_value=0, aggfunc='count')
pivot
Out [1]:
State SAFE UNSAFE
Date
2017-11-18 1 0
2017-11-22 57 42
2017-11-23 155 223
The table counts all occurrences of events on a specific date, which can be either SAFE or UNSAFE. I need to resample the resulting table and sum the results.
Resampling the table with a daily frequency introduces NaNs on the days without data. Surprisingly, I cannot imput those NaNs with pandas' fillna().
In [2]:
pivot = pivot.resample('D').sum().fillna(0.)
pivot
Out [2]:
State SAFE UNSAFE
Date
2017-11-18 1.0 0.0
2017-11-19 NaN NaN
2017-11-20 NaN NaN
2017-11-21 NaN NaN
2017-11-22 57.0 42.0
2017-11-23 155.0 223.0
Anyone can explain why this happens and how can I get rid of those NaNs? I could do something in the line of
for col in ['SAFE', 'UNSAFE']:
mov.loc[mov[col].isnull(), col] = 0
However that look rather ugly, plus I'd like to understand why the first approach is not working.
Related
I have a dataframe like this (the real DF has 94 columns and 40 rows):
NAME
TIAS
EFGA
SOE
KERA
CODE
SURVIVAL
SOAP corp
1.391164e+10
1.265005e+10
0.000000e+00
186522000.0
366
21
NiANO inc
42673.0
0.0
0.0
42673.0
366
3
FFS jv
9.523450e+05
NaN
NaN
8.754379e+09
737
4
KELL Corp
1.045967e+07
9.935970e+05
0.000000e+00
NaN
737
4
Os inc
7.732654e+10
4.046270e+07
1.391164e+10
8.754379e+09
737
4
I need to make a correlation for each group in frame by CODE. The target value is SURVIVAL column.
I tried this:
df = df.groupby('CODE').corr()[['SURVIVAL']]
but it returns something like this:
CODE
SURVIVAL
366
TIAS
NaN
EFGA
NaN
SOE
NaN
KERA
NaN
SURVIVAL
NaN
737
TIAS
NaN
EFGA
NaN
SOE
NaN
KERA
NaN
SURVIVAL
NaN
Why is it NaN in all columns?
I tried to fill NaNs in DataFrame with mean values before making a correlations:
df = df.fillna(df.mean())
or drop them but it does not work.
But when I make the correlation for all dataframe without any modifications like this:
df.corr()[['SURVIVAL']]
everything works good and I have correlations, not NaNs.
All types are float64 and int64.
Is there the way to get correlation by group without NaNs? I have no idea why it works on all dataframe but does not work in groups.
Thank you in advance for help!
You can do it this way
df = df.groupby('CODE')[['SURVIVAL']].corr()
Try this:
survival_corr = lambda x: x.corrwith(x['SURVIVAL'])
by_code = df.groupby('CODE')
by_code.apply(survival_corr)
I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.
I want to do 2 things:
I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).
Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.
Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):
Date
2016-01-02 NumericIndex 1611664
OriginMovementID 4744
DestinationMovementID 5084
MeanTravelTimeSeconds 1233
RangeLowerBoundTravelTimeSeconds 756
...
2020-03-31 DestinationMovementID 3594
MeanTravelTimeSeconds 1778
RangeLowerBoundTravelTimeSeconds 1601
RangeUpperBoundTravelTimeSeconds 1973
DayOfWeek Tuesday
Length: 11281655, dtype: object
When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.
If I try to do df.stack().unstack() or df.stack().T I get then following error:
Index contains duplicate entries, cannot reshape
How do I plot the boxplot and how do I turn the rows into columns?
You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.
import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
"RangeLowerBoundTravelTimeSeconds"]
df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)],
"Value":random.randint(1000,10000)}
for i in range(random.randint(5,20))
for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])
# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])
# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
if i==0 or p!=v: s=0
else: s+=1
seq[i] = s
p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())
output
Value
Observation DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date SeqNo
2016-01-02 0 NaN NaN 2560.0 5324.0 5085.0
1 NaN NaN 1066.0 7372.0 NaN
2016-01-16 0 NaN 6226.0 NaN 7832.0 NaN
1 NaN 1384.0 NaN 8839.0 NaN
2 NaN 7892.0 NaN NaN NaN
I've got about 7.5 million rows of data in the following format:
ndc_description effective_date ...
12-HR DECONGEST 120MG CAPLET 2015-08-19 2015-08-26 G NaN NaN 1 0.36062 36800005452 Y C/I EA
2015-07-22 2015-08-12 G NaN NaN 1 0.37681 36800005452 Y C/I EA
2015-06-17 2015-07-15 G NaN NaN 1 0.36651 36800005452 Y C/I EA
Some Other drug 2016-11-21 2015-08-26 G NaN NaN 1 0.36062 36800005452 Y C/I EA
2016-07-23 2015-08-12 G NaN NaN 1 0.37681 36800005452 Y C/I EA
2016-05-17 2015-07-15 G NaN NaN 1 0.36651 36800005452 Y C/I EA
ndc_description and effective_date are a multi-indice.
I have an additional dataset that I'm merging with the above. They'll be merged by the columns ndc_description and effective_date (the other columns shown are purely for demonstration that other data of various types exists in the dataset).
Current problem: The dates in each dataset don't match up. In the dataset above, they're (mostly) on a weekly basis, but that can't be guaranteed. In the other dataset, there's also no guaranteed regularity. Therefore, I suppose I'll need to add rows for all dates in between those listed in the effective_date column above so I can then make the merge on ndc_description and effective_date. Is this the best way to go about this process? Because of the amount of data involved, I'd like to optimize the code involved before I end up running all the data through.
Potential solutions: I've seen that .resample() may be of value here, but I've not yet been able to get it to work. Something like this: Cleaned_Price_Data.effective_date.resample('1D', fill_method = 'ffill', level = 1)
I think it would also improve efficiency to include the max & min dates somewhere in the above so that it doesn't ffill anything more than necessary. Also, including the value in the ndc_description so that dates added to the effective_date column aren't duplicated for each identical ndc_description value.
EDIT:
Here's some code illustrating the current state of my dataframe and how it should look after transformations are complete.
I'm trying to transform a dataframe such as this:
idx = pd.MultiIndex.from_product([['drug_a', 'drug_b', 'drug_c'],
['2015-08-19', '2015-08-17', '2015-08-14']],
names=['drug_name', 'effective_date'])
col = ['other_data_1', 'other_data_2', 'other_data_3']
pre_transform = pd.DataFrame('-', idx, col)
pre_transform
To one like this (note dates are added):
idx = pd.MultiIndex.from_product([['drug_a', 'drug_b', 'drug_c'],
['2015-08-19', '2015-08-18', '2015-08-17', '2015-08-16', '2015-08-15', '2015-08-14']],
names=['drug_name', 'effective_date'])
col = ['other_data_1', 'other_data_2', 'other_data_3']
post_change = pd.DataFrame('-', idx, col)
post_change
EDIT 2: I've come up with the following code (via Parfait's answer here) which seems to do the trick:
def expand_dates(ser):
return pd.DataFrame({'effective_date': pd.date_range(ser['effective_date'].min(), ser['effective_date'].max(), freq='D')})
price_cols = list(Cleaned_Price_Data.columns)
all_effective_dates = Cleaned_Price_Data.groupby(['ndc']).apply(expand_dates).reset_index().merge(Cleaned_Price_Data, how = 'left')[price_cols].ffill()
However, at 55million rows, the file's quite bloated, and I'll be attempting to merge this with another dataset. Any attempts at optimizing this (or suggesting a more efficient alternative) would be greatly appreciated.
I have a pandas DataFrame like this.
As you can see, the data corresponds to end of month data. The problem is that the end of month date is not the same for all the columns. ( The underlying reason is that the last trading day of the month does not always coincide with the end of the month. )
Currently, the end of 2016 January have two rows "2016-01-29" and "2016-01-31." It should be just one row. For example, the end of 2016 January should just be 451.1473 1951.218 1401.093 for Index A, Index B and Index C.
Another point is that even though each row almost always corresponds to the end of monthly data, the data might not be nice enough and can conceivably include the middle of the month data for a random columns. In that case, I don't want to make any adjustment so that any prior data collection error would be caught.
What is the most efficient way to achieve this goal.
EDIT:
Index A Index B Index C
DATE
2015-03-31 2067.89 1535.07 229.1
2015-04-30 2085.51 1543 229.4
2015-05-29 2107.39 NaN NaN
2015-05-31 NaN 1550.39 229.1
2015-06-30 2063.11 1534.96 229
2015-07-31 2103.84 NaN 228.8
2015-08-31 1972.18 1464.32 NaN
2015-09-30 1920.03 1416.84 227.5
2015-10-30 2079.36 NaN NaN
2015-10-31 NaN 1448.39 227.7
2015-11-30 2080.41 1421.6 227.6
2015-12-31 2043.94 1408.33 227.5
2016-01-29 1940.24 NaN NaN
2016-01-31 NaN 1354.66 227.5
2016-02-29 1932.23 1355.42 227.3
So, in this case, I need to combine rows at the end of 2015-05, 2015-10, 2016-01. However, rows at 2015-07 and 2015-08 simply does not have data. So, in this case, I would like to leave 2015-07 and 2015-08 as NaN while I like to merge the end of month rows at 2015-05, 2015-10, 2016-01. Hopefully, this provides more insight to what I am trying to do.
You can use:
df = df.groupby(pd.TimeGrouper('M')).fillna(method='ffill')
df = df.resample(rule='M', how='last')
to create a new DateTimeIndex ending on the last day of the months and sample the last available data point for each months. fillna() ensures that, for columns with of missing data for the last available date, you use the prior available value.