I have two dataframes. One has the workdays and the stockprice for the Apple-stock. The other one, holds quarterly data on the EPS. However, the list of dates differ, but are in cronological order. I want add the cronological values of the eps frame to the existing price dataframe.
date close
0 2020-07-06 373.85
1 2020-07-02 364.11
2 2020-07-01 364.11
3 2020-06-30 364.80
4 2020-06-29 361.78
... ... ...
9969 1980-12-18 0.48
9970 1980-12-17 0.46
9971 1980-12-16 0.45
9972 1980-12-15 0.49
9973 1980-12-12 0.51
EPS:
date eps
0 2020-03-28 2.59
1 2019-12-28 5.04
2 2019-09-28 3.05
3 2019-06-29 2.22
4 2019-03-30 2.48
... ... ...
71 2002-06-29 0.09
72 2002-03-30 0.11
73 2001-12-29 0.11
74 2001-09-29 -0.11
75 2001-06-30 0.17
So my result should look something like this:
close eps
date
...
2020-04-01 240.91 NaN
2020-03-31 254.29 NaN
2020-03-30 254.81 NaN
2020-03-28 NaN 2.59
2020-03-27 247.74 NaN
2020-03-26 258.44 NaN
Notice the value "2020-03-28", which previously only existed in the eps frame, and sits now neatly were it belongs.
However, I can't get it to work. First i thought there must be a simple join, merge or concat that has this function and fits the data right were it belongs, in cronological order, but so far, I couldn't find it.
My failed attempts:
pd.concat([df, eps], axis=0, sort=True) - does simply append the two dataframes
pd.merge_ordered(df, eps, fill_method="ffill", left_by="date") - Simply ignores the eps dates
The goal is to plot this Dataframe with two graphs - One the stock price, and the other one with the eps data.
I think you need:
pd.concat([df.set_index('date'), eps.set_index('date')]).sort_index(ascending=False)
You can simply sort the concatenated dataframe afterwards by index. Thanks to #jezrael for the tip!
pd.concat([df.set_index('date'), eps.set_index('date')]).sort_index(ascending=False)
For some reason, the sort argument in the concat function doesn't sort my dataframe.
Related
why can't the pandas data frame append appropriately to form one data frame in this loop?
#Produce the overall data frame
def processed_data(data1_,f_loc,open,close):
"""data1_: is the csv file to be modified
f_loc: is the location of csv files to be processed
open and close: are the columns to undergo computations
returns a new dataframe of modified columns"""
main_file=drop_col(data1_)#Dataframe to append more data columns to
for i in files_path(f_loc):
data=get_data_frame(i[0])#returns the dataframe, takes file path location of the csv file and returns the data frame
perc=perc_df(data,open,close,i[1])#Dataframe to append
copy_data=main_file.append(perc)
return copy_data
heres the output:
Date WTRX-USD
0 2021-05-27 NaN
1 2021-05-28 NaN
2 2021-05-29 NaN
3 2021-05-30 NaN
4 2021-05-31 NaN
.. ... ...
79 NaN -2.311576
80 NaN 5.653349
81 NaN 5.052950
82 NaN -2.674435
83 NaN -3.082957
[450 rows x 2 columns]
My intention is to return something like this(where each append operation adds a column):
Date Open High Low Close Adj Close Volume
0 2021-05-27 0.130793 0.136629 0.124733 0.128665 0.128665 70936563
1 2021-05-28 0.128659 0.129724 0.111244 0.113855 0.113855 71391441
2 2021-05-29 0.113752 0.119396 0.108206 0.111285 0.111285 62049940
3 2021-05-30 0.111330 0.115755 0.107028 0.112185 0.112185 70101821
4 2021-05-31 0.112213 0.126197 0.111899 0.125617 0.125617 83502219
.. ... ... ... ... ... ... ...
361 2022-05-23 0.195637 0.201519 0.185224 0.185231 0.185231 47906144
362 2022-05-24 0.185242 0.190071 0.181249 0.189553 0.189553 33312065
363 2022-05-25 0.189550 0.193420 0.183710 0.183996 0.183996 33395138
364 2022-05-26 0.184006 0.186190 0.165384 0.170173 0.170173 57218888
365 2022-05-27 0.170636 0.170660 0.165052 0.166864 0.166864 63560568
[366 rows x 7 columns]
pandas.concat
pandas.DataFrame.append has been deprecated. Use pandas.concat instead.
Combine DataFrame objects horizontally along the x-axis by passing in
axis=1
copy_data=pd.concat([copy_data,perc], axis=1)
I have the following two datasets:
df_ff.head()
Out[382]:
Date Mkt-RF SMB HML RF
0 192607 2.96 -2.38 -2.73 0.22
1 192608 2.64 -1.47 4.14 0.25
2 192609 0.36 -1.39 0.12 0.23
3 192610 -3.24 -0.13 0.65 0.32
4 192611 2.53 -0.16 -0.38 0.31
df_ibm.head()
Out[384]:
Date Open High ... Close Adj_Close Volume
0 2012-01-01 178.518158 184.608032 ... 184.130020 128.620193 115075689
1 2012-02-01 184.713196 190.468445 ... 188.078400 131.378296 82435156
2 2012-03-01 188.556412 199.923523 ... 199.474182 139.881134 92149356
3 2012-04-01 199.770554 201.424469 ... 197.973236 138.828659 90586736
4 2012-05-01 198.068832 199.741867 ... 184.416824 129.322250 89961544
Regarding the type of the date variable, we have the following:
df_ff.dtypes
Out[383]:
Date int64
df_ibm.dtypes
Out[385]:
Date datetime64[ns]
I would like to merge (in SQL language: "Inner join") these two data sets and are therefore writing:
testMerge = pd.merge(df_ibm, df_ff, on = 'Date')
This yields the error:
ValueError: You are trying to merge on datetime64[ns] and int64 columns. If you wish to proceed you should use pd.concat
This merge does not work due to different formats on the date variable. Any tips on how I could solve this? My first thought was to translate dates (in the df_ff data set) of the format:
"192607" to the format "1926-07-01" but I did not manage to do it.
Use pd.to_datetime:
df['Date2'] = pd.to_datetime(df['Date'].astype(str), format="%Y%m")
print(df)
# Output
Date Date2
0 192607 1926-07-01
1 192608 1926-08-01
2 192609 1926-09-01
3 192610 1926-10-01
4 192611 1926-11-01
The first step is to convert to datetime64[ns] and harmonize the Date column:
df_ff['Date'] = pd.to_datetime(df_ff['Date'].astype(str), format='%Y%m')
Then convert them into Indexes (since it's more efficient):
df_ff = df_ff.set_index('Date')
df_ibm = df_ibm.set_index('Date')
Finally pd.merge the two pd.DataFrame:
out = pd.merge(df_ff, df_ibm, left_index=True, right_index=True)
I have a df that contains some revenue values and I want to interpolate the values to the dates that are not included in the index. To do so, I am finding the difference between rows and interpolating:
rev_diff = df.revenue.diff().fillna(0)
df = df.resample("M").mean()
df["revenue"] = df.revenue.interpolate().diff()
I have this in a function and it is looped over thousands of such calculations (each one creating such a df). This works for most cases, but there are a few where the 'checkout till' resets and thus the diff is negative:
revenue
2015-10-19 203.0
2016-04-03 271.0
2016-06-13 301.0
2016-06-13 0.0
2016-09-27 30.0
2017-03-14 77.0
2017-09-19 128.0
2018-09-19 0.0
2018-03-19 10.0
2019-03-22 287.0
2020-03-20 398.0
The above code will give out negative interpolating values, so I am wondering whether there is a quick way to take that into account when it happens, without putting too much toll on the execution time because it's called thousands of times. The end result for the revenue df (before the interpolation is carried out) should be:
revenue
2015-10-19 203.0
2016-04-03 271.0
2016-06-13 301.0
2016-09-27 331.0
2017-03-14 378.0
2017-09-19 429.0
2018-03-19 439.0
2019-03-22 716.0
2020-03-20 827.0
So basically if there is a 'reset', the diff should be added to the value in the row above. And that will happen for all rows below.
I hope this makes sense. I am struggling to find a way of doing it which is not costly computationally.
Thanks in advance.
No magic. Steps:
Identify the breakpoints by computing revenue difference.
Populate the revenue values to be added for subsequent data.
Sum it up.
Remove duplicate records.
Code
import pandas as pd
import numpy as np
df.reset_index(inplace=True)
# 1. compute difference
df["rev_diff"] = 0.0
df.loc[1:, "rev_diff"] = df["revenue"].values[1:] - df["revenue"].values[:-1]
# get breakpoint locations
breakpoints = df[df["rev_diff"] < 0].index.values
# 2. accumulate the values to be added
df["rev_add"] = 0.0
for idx in breakpoints:
add_value = df.at[idx-1, "revenue"]
df.loc[idx:, "rev_add"] += add_value # accumulate
# 3. sum up
df["rev_new"] = df["revenue"] + df["rev_add"]
# 4. remove duplicate rows
df_new = df[["index", "rev_new"]].drop_duplicates().set_index("index")
df_new.index.name = None
Result
df_new
Out[85]:
rev_new
2015-10-19 203.0
2016-04-03 271.0
2016-06-13 301.0
2016-09-27 331.0
2017-03-14 378.0
2017-09-19 429.0
2018-03-19 439.0
2019-03-22 716.0
2020-03-20 827.0
Im sorry for not posting the data but it wouldn't really help. The thing is a need to make a graph and I have a csv file full of information organised by date. It has 'Cases' 'Deaths' 'Recoveries' 'Critical' 'Hospitalized' 'States' as categories. It goes in order by date and has the amount of cases,deaths,recoveries per day of each state. How do I sum this categories to make a graph that shows how the total is increasing? I really have no idea how to start so I can't post my data. Below are some numbers that try to explain what I have.
0 2020-02-20 1 Andalucía NaN NaN NaN
1 2020-02-20 2 Aragón NaN NaN NaN
2 2020-02-20 3 Asturias NaN NaN NaN
3 2020-02-20 4 Baleares 1.0 NaN NaN
4 2020-02-20 5 Canarias 1.0 NaN NaN
.. ... ... ... ... ... ...
888 2020-04-06 19 Melilla 92.0 40.0 3.0
889 2020-04-06 14 Murcia 1283.0 500.0 84.0
890 2020-04-06 15 Navarra 3355.0 1488.0 124.0
891 2020-04-06 16 País Vasco 9021.0 4856.0 417.0
892 2020-04-06 17 La Rioja 2846.0 918.0 66.0
It's unclear exactly what you mean by "sum this categories". I'm assuming you mean that for each date, you want to sum the values across all different regions to come up with the total values for Spain?
In which case, you will want to groupby date, then .sum() the columns (you can drop the States category.
grouped_df = df.groupby("date")["Cases", "Deaths", ...].sum()
grouped_df.set_index("date").plot()
This snippet will probably not work directly, you may need to reformat the dates etc. But should be enough to get you started.
I think you are looking for groupby followed by a cumsum not including dates.
columns_to_group = ['Cases', 'Deaths',
'Recoveries', 'Critical', 'Hospitalized', 'date']
new_columns = ['Cases_sum', 'Deaths_sum',
'Recoveries_sum', 'Critical_sum', 'Hospitalized_sum']
df_grouped = df[columns_to_group].groupby('date').sum().reset_index()
For plotting seaborn provides an easy functions:
import seaborn as sns
df_melted = df_grouped.melt(id_vars=["date"])
sns.lineplot(data=df_melted, x='date', y = 'value', hue='variable')
I am working on jupyter lab with pandas, version 0.20.1. I have a pivot table with a DatetimeIndex such as
In [1]:
pivot = df.pivot_table(index='Date', columns=['State'], values='B',
fill_value=0, aggfunc='count')
pivot
Out [1]:
State SAFE UNSAFE
Date
2017-11-18 1 0
2017-11-22 57 42
2017-11-23 155 223
The table counts all occurrences of events on a specific date, which can be either SAFE or UNSAFE. I need to resample the resulting table and sum the results.
Resampling the table with a daily frequency introduces NaNs on the days without data. Surprisingly, I cannot imput those NaNs with pandas' fillna().
In [2]:
pivot = pivot.resample('D').sum().fillna(0.)
pivot
Out [2]:
State SAFE UNSAFE
Date
2017-11-18 1.0 0.0
2017-11-19 NaN NaN
2017-11-20 NaN NaN
2017-11-21 NaN NaN
2017-11-22 57.0 42.0
2017-11-23 155.0 223.0
Anyone can explain why this happens and how can I get rid of those NaNs? I could do something in the line of
for col in ['SAFE', 'UNSAFE']:
mov.loc[mov[col].isnull(), col] = 0
However that look rather ugly, plus I'd like to understand why the first approach is not working.