pandas series add previous row if diff negative - python

I have a df that contains some revenue values and I want to interpolate the values to the dates that are not included in the index. To do so, I am finding the difference between rows and interpolating:
rev_diff = df.revenue.diff().fillna(0)
df = df.resample("M").mean()
df["revenue"] = df.revenue.interpolate().diff()
I have this in a function and it is looped over thousands of such calculations (each one creating such a df). This works for most cases, but there are a few where the 'checkout till' resets and thus the diff is negative:
revenue
2015-10-19 203.0
2016-04-03 271.0
2016-06-13 301.0
2016-06-13 0.0
2016-09-27 30.0
2017-03-14 77.0
2017-09-19 128.0
2018-09-19 0.0
2018-03-19 10.0
2019-03-22 287.0
2020-03-20 398.0
The above code will give out negative interpolating values, so I am wondering whether there is a quick way to take that into account when it happens, without putting too much toll on the execution time because it's called thousands of times. The end result for the revenue df (before the interpolation is carried out) should be:
revenue
2015-10-19 203.0
2016-04-03 271.0
2016-06-13 301.0
2016-09-27 331.0
2017-03-14 378.0
2017-09-19 429.0
2018-03-19 439.0
2019-03-22 716.0
2020-03-20 827.0
So basically if there is a 'reset', the diff should be added to the value in the row above. And that will happen for all rows below.
I hope this makes sense. I am struggling to find a way of doing it which is not costly computationally.
Thanks in advance.

No magic. Steps:
Identify the breakpoints by computing revenue difference.
Populate the revenue values to be added for subsequent data.
Sum it up.
Remove duplicate records.
Code
import pandas as pd
import numpy as np
df.reset_index(inplace=True)
# 1. compute difference
df["rev_diff"] = 0.0
df.loc[1:, "rev_diff"] = df["revenue"].values[1:] - df["revenue"].values[:-1]
# get breakpoint locations
breakpoints = df[df["rev_diff"] < 0].index.values
# 2. accumulate the values to be added
df["rev_add"] = 0.0
for idx in breakpoints:
add_value = df.at[idx-1, "revenue"]
df.loc[idx:, "rev_add"] += add_value # accumulate
# 3. sum up
df["rev_new"] = df["revenue"] + df["rev_add"]
# 4. remove duplicate rows
df_new = df[["index", "rev_new"]].drop_duplicates().set_index("index")
df_new.index.name = None
Result
df_new
Out[85]:
rev_new
2015-10-19 203.0
2016-04-03 271.0
2016-06-13 301.0
2016-09-27 331.0
2017-03-14 378.0
2017-09-19 429.0
2018-03-19 439.0
2019-03-22 716.0
2020-03-20 827.0

Related

Appending from one dataframe to another dataframe (with different sizes) when two values match

I have two pandas dataframes and some of the values overlap and I'd like to append to the original dataframe if the time_date value and the origin values are the same.
Here is my original dataframe called flightsDF which is very long, it has the format:
year month origin dep_time dep_delay arr_time time_hour
2001 01 EWR 15:00 15 17:00 2013-01-01T06:00:00Z
I have another dataframe weatherDF (much shorter than flightsDF) with some extra infomation for some of the values in the original dataframe
origin temp dewp humid wind_dir wind_speed precip visib time_hour
0 EWR 39.02 26.06 59.37 270.0 10.35702 0.0 10.0 2013-01-01T06:00:00Z
1 EWR 39.02 26.96 61.63 250.0 8.05546 0.0 10.0 2013-01-01T07:00:00Z
2 LGH 39.02 28.04 64.43 240.0 11.50780 0.0 10.0 2013-01-01T08:00:00Z
I'd like to append the extra information (temp, dewp, humid,...) from weatherDF to the original data frame if both the time_hour and origin match with the original dataframe flightsDF
I have tried
for x in weatherDF:
if x['time_hour'] == flightsDF['time_hour'] & flightsDF['origin']=='EWR':
flights_df.append(x)
and some other similar ways but I can't seem to get it working, can anyone help?
I am planning to append all the corresponding values and then dropping any from the combined dataframe that don't have those values.
You are probably looking for pd.merge:
flightDF = flightsDF.merge(weatherDF, on=['origin', 'time_hour'], how='left')
print(out)
# Output
year month origin dep_time dep_delay arr_time time_hour temp dewp humid wind_dir wind_speed precip visib
0 2001 1 EWR 15:00 15 17:00 2013-01-01T06:00:00Z 39.02 26.06 59.37 270.0 10.35702 0.0 10.0
If I'm right take the time to read Pandas Merging 101

How to group a df based on one column and apply a function to another column in pandas

I am quite new to pandas here, I have been stuck for weeks in this issue, so as a last resort i have come to this forum.
Below is my dataframe
S2Rate S2BillDate Sale Average Total Sale
0 20.00 2019-05-18 20.000000 20.00
1 15.00 2019-05-18 26.250000 420.00
2 15.00 2019-05-19 36.000000 180.00
3 7.50 2019-05-19 34.500000 172.50
4 7.50 2019-05-21 32.894737 625.00
I am trying to plot a graph where my primary y axis will have the S2rate and secondary Yaxis will have sale average. But I would like my x axis to have the date , for which I will need my df to like like this(below)
S2Rate S2BillDate Sale Average Total Sale
0 20.00 2019-05-18 20.000000 20.00
1 15.00 2019-05-18 to 2019-05-19 31.1250000 600.00
2 7.50 2019-05-19 to 2019-05-21 33.690000 797.50
That is for S2rate 15 min date is 2019-05-18 and max date is 2019-05-19, so it needs to pic the min and max date for the S2rate that needs to be grouped, cause there can be situations when for a same S2rate, there can be many days.
Can anyone guide me towards this, also please do not mistake that I am directly asking help/code, even pointing me to the right concepts will do. I kinda have no clue how to proceed further.
Any help is much appreciated. TIA !
First, since S2Rate values can recur, consecutive dates of a S2Rate must be identified first. This can be done by a diff-cumsum trick. Ignore this step if you'd like to group by all S2Rates.
# identify consecutive groups of S2Rate
df["S2RateGroup"] = (df["S2Rate"].diff() != 0).cumsum()
df
Out[268]:
S2Rate S2BillDate Sale Average Total Sale S2RateGroup
0 20.0 2019-05-18 20.000000 20.0 1
1 15.0 2019-05-18 26.250000 420.0 2
2 15.0 2019-05-19 36.000000 180.0 2
3 7.5 2019-05-19 34.500000 172.5 3
4 7.5 2019-05-21 32.894737 625.0 3
Next, just write your custom title-producing function and put it into .agg() using Named Aggregation:
def date_agg(col):
dmin = col.min()
dmax = col.max()
return f"{dmin} to {dmax}" if dmax > dmin else f"{dmin}"
df.groupby("S2RateGroup").agg( # or .groupby("S2Rate")
s2rate=pd.NamedAgg("S2Rate", np.min),
date=pd.NamedAgg("S2BillDate", date_agg),
sale_avg=pd.NamedAgg("Sale Average", np.mean),
total_sale=pd.NamedAgg("Total Sale", np.sum)
)
# result
Out[270]:
s2rate date sale_avg total_sale
S2RateGroup
1 20.0 2019-05-18 20.000000 20.0
2 15.0 2019-05-18 to 2019-05-19 31.125000 600.0
3 7.5 2019-05-19 to 2019-05-21 33.697368 797.5
Since you are new to pandas, it would also be helpful to go through the official how-to.

Join two dataframes based on different dates

I have two dataframes. One has the workdays and the stockprice for the Apple-stock. The other one, holds quarterly data on the EPS. However, the list of dates differ, but are in cronological order. I want add the cronological values of the eps frame to the existing price dataframe.
date close
0 2020-07-06 373.85
1 2020-07-02 364.11
2 2020-07-01 364.11
3 2020-06-30 364.80
4 2020-06-29 361.78
... ... ...
9969 1980-12-18 0.48
9970 1980-12-17 0.46
9971 1980-12-16 0.45
9972 1980-12-15 0.49
9973 1980-12-12 0.51
EPS:
date eps
0 2020-03-28 2.59
1 2019-12-28 5.04
2 2019-09-28 3.05
3 2019-06-29 2.22
4 2019-03-30 2.48
... ... ...
71 2002-06-29 0.09
72 2002-03-30 0.11
73 2001-12-29 0.11
74 2001-09-29 -0.11
75 2001-06-30 0.17
So my result should look something like this:
close eps
date
...
2020-04-01 240.91 NaN
2020-03-31 254.29 NaN
2020-03-30 254.81 NaN
2020-03-28 NaN 2.59
2020-03-27 247.74 NaN
2020-03-26 258.44 NaN
Notice the value "2020-03-28", which previously only existed in the eps frame, and sits now neatly were it belongs.
However, I can't get it to work. First i thought there must be a simple join, merge or concat that has this function and fits the data right were it belongs, in cronological order, but so far, I couldn't find it.
My failed attempts:
pd.concat([df, eps], axis=0, sort=True) - does simply append the two dataframes
pd.merge_ordered(df, eps, fill_method="ffill", left_by="date") - Simply ignores the eps dates
The goal is to plot this Dataframe with two graphs - One the stock price, and the other one with the eps data.
I think you need:
pd.concat([df.set_index('date'), eps.set_index('date')]).sort_index(ascending=False)
You can simply sort the concatenated dataframe afterwards by index. Thanks to #jezrael for the tip!
pd.concat([df.set_index('date'), eps.set_index('date')]).sort_index(ascending=False)
For some reason, the sort argument in the concat function doesn't sort my dataframe.

Python Pandas totals and dates

Im sorry for not posting the data but it wouldn't really help. The thing is a need to make a graph and I have a csv file full of information organised by date. It has 'Cases' 'Deaths' 'Recoveries' 'Critical' 'Hospitalized' 'States' as categories. It goes in order by date and has the amount of cases,deaths,recoveries per day of each state. How do I sum this categories to make a graph that shows how the total is increasing? I really have no idea how to start so I can't post my data. Below are some numbers that try to explain what I have.
0 2020-02-20 1 Andalucía NaN NaN NaN
1 2020-02-20 2 Aragón NaN NaN NaN
2 2020-02-20 3 Asturias NaN NaN NaN
3 2020-02-20 4 Baleares 1.0 NaN NaN
4 2020-02-20 5 Canarias 1.0 NaN NaN
.. ... ... ... ... ... ...
888 2020-04-06 19 Melilla 92.0 40.0 3.0
889 2020-04-06 14 Murcia 1283.0 500.0 84.0
890 2020-04-06 15 Navarra 3355.0 1488.0 124.0
891 2020-04-06 16 País Vasco 9021.0 4856.0 417.0
892 2020-04-06 17 La Rioja 2846.0 918.0 66.0
It's unclear exactly what you mean by "sum this categories". I'm assuming you mean that for each date, you want to sum the values across all different regions to come up with the total values for Spain?
In which case, you will want to groupby date, then .sum() the columns (you can drop the States category.
grouped_df = df.groupby("date")["Cases", "Deaths", ...].sum()
grouped_df.set_index("date").plot()
This snippet will probably not work directly, you may need to reformat the dates etc. But should be enough to get you started.
I think you are looking for groupby followed by a cumsum not including dates.
columns_to_group = ['Cases', 'Deaths',
'Recoveries', 'Critical', 'Hospitalized', 'date']
new_columns = ['Cases_sum', 'Deaths_sum',
'Recoveries_sum', 'Critical_sum', 'Hospitalized_sum']
df_grouped = df[columns_to_group].groupby('date').sum().reset_index()
For plotting seaborn provides an easy functions:
import seaborn as sns
df_melted = df_grouped.melt(id_vars=["date"])
sns.lineplot(data=df_melted, x='date', y = 'value', hue='variable')

Remove row with null value from pandas data frame

I'm trying to remove a row from my data frame in which one of the columns has a value of null. Most of the help I can find relates to removing NaN values which hasn't worked for me so far.
Here I've created the data frame:
# successfully crated data frame
df1 = ut.get_data(symbols, dates) # column heads are 'SPY', 'BBD'
# can't get rid of row containing null val in column BBD
# tried each of these with the others commented out but always had an
# error or sometimes I was able to get a new column of boolean values
# but i just want to drop the row
df1 = pd.notnull(df1['BBD']) # drops rows with null val, not working
df1 = df1.drop(2010-05-04, axis=0)
df1 = df1[df1.'BBD' != null]
df1 = df1.dropna(subset=['BBD'])
df1 = pd.notnull(df1.BBD)
# I know the date to drop but still wasn't able to drop the row
df1.drop([2015-10-30])
df1.drop(['2015-10-30'])
df1.drop([2015-10-30], axis=0)
df1.drop(['2015-10-30'], axis=0)
with pd.option_context('display.max_row', None):
print(df1)
Here is my output:
Can someone please tell me how I can drop this row, preferably both by identifying the row by the null value and how to drop by date?
I haven't been working with pandas very long and I've been stuck on this for an hour. Any advice would be much appreciated.
This should do the work:
df = df.dropna(how='any',axis=0)
It will erase every row (axis=0) that has "any" Null value in it.
EXAMPLE:
#Recreate random DataFrame with Nan values
df = pd.DataFrame(index = pd.date_range('2017-01-01', '2017-01-10', freq='1d'))
# Average speed in miles per hour
df['A'] = np.random.randint(low=198, high=205, size=len(df.index))
df['B'] = np.random.random(size=len(df.index))*2
#Create dummy NaN value on 2 cells
df.iloc[2,1]=None
df.iloc[5,0]=None
print(df)
A B
2017-01-01 203.0 1.175224
2017-01-02 199.0 1.338474
2017-01-03 198.0 NaN
2017-01-04 198.0 0.652318
2017-01-05 199.0 1.577577
2017-01-06 NaN 0.234882
2017-01-07 203.0 1.732908
2017-01-08 204.0 1.473146
2017-01-09 198.0 1.109261
2017-01-10 202.0 1.745309
#Delete row with dummy value
df = df.dropna(how='any',axis=0)
print(df)
A B
2017-01-01 203.0 1.175224
2017-01-02 199.0 1.338474
2017-01-04 198.0 0.652318
2017-01-05 199.0 1.577577
2017-01-07 203.0 1.732908
2017-01-08 204.0 1.473146
2017-01-09 198.0 1.109261
2017-01-10 202.0 1.745309
See the reference for further detail.
If everything is OK with your DataFrame, dropping NaNs should be as easy as that. If this is still not working, make sure you have the proper datatypes defined for your column (pd.to_numeric comes to mind...)
----clear null all colum-------
df = df.dropna(how='any',axis=0)
---if you want to clean NULL by based on 1 column.---
df[~df['B'].isnull()]
A B
2017-01-01 203.0 1.175224
2017-01-02 199.0 1.338474
**2017-01-03 198.0 NaN** clean
2017-01-04 198.0 0.652318
2017-01-05 199.0 1.577577
2017-01-06 NaN 0.234882
2017-01-07 203.0 1.732908
2017-01-08 204.0 1.473146
2017-01-09 198.0 1.109261
2017-01-10 202.0 1.745309
Please forgive any mistakes.
To remove all the null values dropna() method will be helpful
df.dropna(inplace=True)
To remove remove which contain null value of particular use this code
df.dropna(subset=['column_name_to_remove'], inplace=True)
It appears that the value in your column is "null" and not a true NaN which is what dropna is meant for. So I would try:
df[df.BBD != 'null']
or, if the value is actually a NaN then,
df[pd.notnull(df.BBD)]
I recommend giving one of these two lines a try:
df_clean = df1[df1['BBD'].isnull() == False]
df_clean = df1[df1['BBD'].isna() == False]

Categories

Resources