I'm trying to identify the IDs in a dataframe that have increased in the past XYZ cycles.
My data look like below, I would like to identify IDs like A, because the values for A have been increasing in the past 4 month, from Jan to Apr. My data is updated monthly with 5 years total and it always look at the most recent cycles.
ID Month Value
A Jan 1
A Feb 2
A Mar 3
A Apr 4
B Jan 1
B Feb 1
B Mar 3
B Apr 2
I have tried to use def function for a list to apply to the dataframe, but it doesn't work.
import numpy as np
def Consecutive():
n = len() - 1
return (sum(np.diff(sorted()) == 1) >= n)
Any suggestion is appreciated! Thank you!
Related
I have a below dataframe:
OUTLET_UNQ_CODE Category_Code month
0 2018020000065 SSSI January 21
1 2018020000066 SSSI January 21
2 2018020000067 SSSI January 21
...
512762 2021031641195 CH March 21
512763 2021031642445 CH March 21
512764 2021031643357 GM March 21
512765 2021031643863 GM March 21
there are few OUTLET_UNQ_CODE who have changed their Category_Code within a month and next month as well. I need to count the number of hops every outlet has done. For ex: If 2021031643863 had Category_code GM in Jan 21 and CH in Jan 21 again, CH in Feb and Kirana in March. This will be counted as 2 hops.
This is what i have tried:
s=pd.to_numeric(new_df.Category_Code,errors='coerce')
df=new_df.assign(New=s.bfill())[s.isnull()].groupby('OUTLET_UNQ_CODE').agg({'Category_Code':list})
df.reset_index(inplace=True)
O/P is:
OUTLET_UNQ_CODE Category_Code
0 2021031643863 [GM,CH,CH,Kirana]
regardless if there is maybe a better way starting from the beginning, to achieve the goal based on your output, here is a piece of code to get the number of changes in the list:
cat_lst = ['GM','CH','CH','Kirana']
a = sum((1 for i,x in enumerate(cat_lst[:-1]) if x!= cat_lst[i+1]))
# in this case the result of a is 2
I have the following column in my dataframe
year-month
2020-01
2020-01
2020-01
2020-02
2020-02
...
2021-06
This column is stored as an "object" type in my dataframe. I didn't convert it to a "datetime" type from the onset because then my values would change to "2020-01-01" instead(?)
Anyway, I wanted to get do a value_counts(), by month, so that I can plot it out subsequently. How can I order the value_counts() by month while reflecting the month as "Jan", "Feb"..."Dec" at the same time?
I've tried this:
pd.DateTime(df['year-month']).dt.month.value_counts().sort_index()
However, the months are reflected as "1","2"..."12" which isn't what I want
I then tried this:
pd.DateTime(df['year-month']).dt.strftime('%b').value_counts().sort_index()
Which gives me the month by "Jan","Feb"..."Dec" indeed but now it's sorted by alphabetical order instead of by the actual month sequence.
From this point of yours:
result = pd.to_datetime(df["year-month"]).dt.strftime("%b").value_counts()
we can reindex the result so that the index becomes the month name abbreviations in order. This can be borrowed from the calendar module:
import calendar
# slicing out the first since it is empty string
month_names = calendar.month_abbr[1:]
# reindex and put 0 to those that didn't appear at all
result = result.reindex(month_names, fill_value=0)
to get
>>> result
Jan 3
Feb 2
Mar 0
Apr 0
May 0
Jun 1
Jul 0
Aug 0
Sep 0
Oct 0
Nov 0
Dec 0
(The reason calendar.month_abbr has an empty string in the begining is because Python is 0-indexed but we say 2nd month is February; so putting an empty string there results in month_abbr[2] == "February".)
I am working on a dataframe in python (mostly pandas and numpy). Example is given below.
Name ID Guess Date Topic Delta
0 a 23 5 2019 1 (Person A's Guess in 2019 - Guess in 2018)
1 a 23 8 2018 1
2 c 7 7 2019 1 (Person C's Guess in 2019 - Guess in 2018)
3 c 7 4 2018 1
4 e 12 9 2018 1
5 a 23 3 2020 2
I want to fill the empty column Delta, which is just the difference between the last guess on the same topic and the now updated guess. I am having troubles since I need to keep both the topic and the person's ID.
The dataset is fairly large (> 1mio. entries) which is why my approach to iterate over it did cause troubles when using the full dataframe.
I sorted the dataframe in they above way in order to try and solve it with .shift(); however I guess there must be a solution without sorting the df since I have enough identifiers (ID, Date, Topic).
for i in df.index:
if df['ID'].iloc[i] == test['ID'].iloc[i+1]:
df['Delta'].iloc[i+1] = df['Guess'].iloc[i+1] - test['Guess'].iloc[i]
else:
df['Delta'].iloc[i+1] = "NaN"
If anyone know a more efficient (maybe vectorized) solution for that problem, I would greatly appreciate the hints and help
i have a data set like this :
YEAR MONTH VALUE
2018 3 59.507
2018 3 26.03
2018 5 6.489
2018 2 -3.181
i am trying to perform a calculation like
((VALUE1 + 1) * (VALUE2 + 1) * (VALUE3+1).. * (VALUEn +1)-1) over VALUE column
Whats the best way to accomplish this?
Use:
df['VALUE'].add(1).prod()-1
#-26714.522733572892
If you want cumulative product to create a new column use Series.cumprod:
df['new_column']=df['VALUE'].add(1).cumprod().sub(1)
print(df)
YEAR MONTH VALUE new_column
0 2018 3 59.507 59.507000
1 2018 3 26.030 1634.504210
2 2018 5 6.489 12247.291029
3 2018 2 -3.181 -26714.522734
I think you're after...
cum_prod = (1 + df['VALUE'].cumprod()) - 1
First you should understand the objects you're dealing with, what attributes and methods they have. This is a Dataframe and the Value column is a Series.
here is the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html
I have a dataframe that looks as follows
SIM Sim_1 Sim_2
2015 100.0000 100.0000
2016 2.504613 0.123291
2017 3.802958 -0.919886
2018 4.513224 -1.976056
2019 -0.775783 3.914312
The following function
df = sims.shift(1, axis = 0)*(1+sims/100)
returns a dataframe which looks like this
SIMULATION Sim_1 Sim_2
2015 NaN NaN
2016 102.504613 100.123291
2017 2.599862 0.122157
2018 3.974594 -0.901709
The value in 2016 is exactly the one that should be calculated. But the value in 2017 should take the output of the formula in 2016 (102.504613 and 100.123291) as input for the calculation in 2017. Here the formula takes the original values (2.599862 and 0.122157)
Is there a simple way to run this in pyhton?
you are trying to show the growth of 100 given subsequent returns. Your problem is that the initial 100 is not in the same space. If you replace it with zero (0% return) then do a cumprod, your problem is solved.
sims.iloc[0] = 0
sims.div(100).add(1).cumprod().mul(100)
Just a crude way of implementing this:
for i in range(len(df2)):
try:
df2['Sim1'][i] = float(df2['Sim1'][i]) + float(df2['Sim1'][i-1])
df2['Sim2'][i] = float(df2['Sim2'][i]) + float(df2['Sim2'][i-1])
except:
pass
There may be a better way to optimize this.