% Increase value in number record - python

my dataset
name date record
A 2018-09-18 95
A 2018-10-11 104
A 2018-10-30 230
A 2018-11-23 124
B 2020-01-24 95
B 2020-02-11 167
B 2020-03-07 78
As you can see, there are several records by name and date.
Compared to the previous record, I would like to see the record that rose the most.
output what I want
name record_before_date record_before record_increase_date record_increase increase_rate
A 2018-10-11 104 2018-10-30 230 121.25
B 2020-01-24 95 2020-02-11 167 75.79
I`m not comparing the lowest to the highest, but I want to check the record with the highest ascent rate when the next record comes, and the rate of ascent.
increase rate formula = (record_increase - record_before) / record_before * 100
Any help would be appreciated.
thanks for reading.

Use:
#get percento change per groups
s = df.groupby("name")["record"].pct_change()
#get row with maximal percent change
df1 = df.loc[s.groupby(df['name']).idxmax()].add_suffix('_increase')
#get row with previous maximal percent change
df2 = (df.loc[s.groupby(df['name'])
.apply(lambda x: x.shift(-1).idxmax())].add_suffix('_before'))
#join together
df = pd.concat([df2.set_index('name_before'),
df1.set_index('name_increase')], axis=1).rename_axis('name').reset_index()
#apply formula
df['increase_rate'] = (df['record_increase'].sub(df['record_before'])
.div(df['record_before'])
.mul(100))
print (df)
name date_before record_before date_increase record_increase \
0 A 2018-10-11 104 2018-10-30 230
1 B 2020-01-24 95 2020-02-11 167
increase_rate
0 121.153846
1 75.789474

Related

Calculate Multiple Column Growth in Python Dataframe

The data I used look like this
data
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2
1 100 50 120 45 110 50
2 95 40 100 45 105 50
3 110 45 100 45 110 40
I want to calculate each variable growth for each year so the result will look like this
Subject 2001_X1_gro 2001_X2_gro 2002_X1_gro 2002_X2_gro
1 0.2 -0.1 -0.08333 0.11111
2 0.052632 0.125 0.05 0.11111
3 -0.09091 0 0.1 -0.11111
I already do it manually for each variable for each year with code like this
data[2001_X1_gro]= (data[2001_X1]-data[2000_X1])/data[2000_X1]
data[2002_X1_gro]= (data[2002_X1]-data[2001_X1])/data[2001_X1]
data[2001_X2_gro]= (data[2001_X2]-data[2000_X2])/data[2000_X2]
data[2002_X2_gro]= (data[2002_X2]-data[2001_X2])/data[2001_X2]
Is there a way to do it more efficient escpecially if I have more year and/or more variable?
import pandas as pd
df = pd.read_csv('data.txt', sep=',', header=0)
Input
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2
0 1 100 50 120 45 110 50
1 2 95 40 100 45 105 50
2 3 110 45 100 45 110 40
Next, a loop is created and the columns are filled:
qqq = '_gro'
new_name = ''
year = ''
for i in range(1, len(df.columns) - 2):
year = str(int(df.columns[i][:4]) + 1) + df.columns[i][4:]
new_name = year + qqq
df[new_name] = (df[year] - df[df.columns[i]])/df[df.columns[i]]
print(df)
Output
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2 2001_X1_gro \
0 1 100 50 120 45 110 50 0.200000
1 2 95 40 100 45 105 50 0.052632
2 3 110 45 100 45 110 40 -0.090909
2001_X2_gro 2002_X1_gro 2002_X2_gro
0 -0.100 -0.083333 0.111111
1 0.125 0.050000 0.111111
2 0.000 0.100000 -0.111111
In the loop, the year is extracted from the column name, converted to int, 1 is added to it. The value is again converted to a string, the prefix '_Xn' is added. A new_name variable is created, to which the string '_gro ' is also appended. A column is created and filled with calculated values.
If you want to count, for example, for three years, then you need to add not 1, but 3. This is with the condition that your data will be ordered. And note that the loop does not go through all the elements: for i in range(1, len(df.columns) - 2):. In this case, it skips the Subject column and stops short of the last two values. That is, you need to know where to stop it.

Selecting top % of rows in pandas

I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!
Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)

How to unify (collapse) multiple columns into one assigning unique values

Edited my previous question:
Want to distinguish each Devices (FOUR types) that are attached to a particular Building's particular Elevator (represented by height).
As there is no unique IDs for the devices, want to identify them and assign unique IDs to each of them by Grouping ('BldID', 'BldHt', 'Deivce') to identify any particular 'Device'.
Count their testing results, i.e. how many times it failed (NG) out of total number of testing (NG + OK) for any particular date for the entire duration consisting of few months.
Original dataframe looks like this
BldgID BldgHt Device Date Time Result
1074 34.0 790 2018/11/20 10:30 OK
1072 31.0 780 2018/11/19 11:10 NG
1072 36.0 780 2018/11/17 05:30 OK
1074 10.0 790 2018/11/19 06:10 OK
1074 10.0 790 2018/12/20 11:50 NG
1076 17.0 760 2018/08/15 09:20 NG
1076 17.0 760 2018/09/20 13:40 OK
As 'Time' is irrelevant, dropped it. Want to find the number of [NG] per day for each set (consists of 'BldgID', 'BlgHt', 'Device'].
#aggregate both functions only once by groupby
df1 = mel_df.groupby(['BldgID','BldgHt','Device','Date'])\
['Result'].agg([('NG', lambda x :(x=='NG').sum()), \
('ALL','count')]).round(2).reset_index()
#create New_ID by insert with Series with zero fill 3 values
s = pd.Series(np.arange(1, len(mel_df2) + 1),
index=mel_df2.index).astype(str).str.zfill(3)
mel_df2.insert(0, 'New_ID', s)
Now the filtered DataFrame looks like:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
1 001 1072 31.0 780 2018/11/19 1 2
8 002 1076 17.0 760 2018/11/20 1 1
If I groupby ['BldgID', 'BldgHt', 'Device', 'Date'] then I get per day 'NG'.
But it would consider every day differently and if I assign 'unique' IDs I can plot how the unique Devices behave in every other single day.
If I groupby ['BldgId', 'BldgHt', 'Device'] then I get the overall 'NG' for that set (or unique Device), which is not my goal.
What I want to achieve is:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
001 1072 31.0 780 2018/11/19 1 2
1072 31.0 780 2018/12/30 3 4
002 1076 17.0 760 2018/11/20 1 1
1076 17.0 760 2018/09/20 2 4
003 1072 36.0 780 2018/08/15 1 3
Any tips would be very much appreciated.
Use:
#aggregate both aggregate function only in once groupby
df1 = mel_df.groupby(['BldgID','BldgHt','Device','Date'])\
['Result'].agg([('NG', lambda x :(x=='NG').sum()), ('ALL','count')]).round(2).reset_index()
#filter non 0 rows
mel_df2 = df1[df1.NG != 0]
#filter first rows by Date
mel_df2 = mel_df2.drop_duplicates('Date')
#create New_ID by insert with Series with zero fill 3 values
s = pd.Series(np.arange(1, len(mel_df2) + 1), index=mel_df2.index).astype(str).str.zfill(3)
mel_df2.insert(0, 'New_ID', s)
Output from data from question:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
1 001 1072 31.0 780 2018/11/19 1 1
8 002 1076 17.0 780 2018/11/20 1 1

Merging 2 csv data sets with Python a common ID column- one csv has multiple records for a unique ID

I'm very new to Python.Any support is much appreciated
I have two csv files that I'm trying to Merge using a Student_ID column and create a new csv file.
csv 1 : every entry has a unique studentID
Student_ID Age Course startYear
119 24 Bsc 2014
csv2: has multiple records for a studentID as it has a new entry for every subject the student is taking
Student_ID sub_name marks Sub_year_level
119 Botany1 60 2
119 Anatomy 70 2
119 cell bio 75 3
129 Physics1 78 2
129 Math1 60 1
i want to merge the two csv file so that I have all records and columns from csv1 and new additional created columns where I want to get from csv2 the average mark(has to be calculated) for each subject_year_level per student. So the final csv file will have unique Student_Ids in all records
What I want my new output csv file to look like:
Student_ID Age Course startYear level1_avg_mark levl2_avg_mark levl3_avgmark
119 24 Bsc 2014 60 65 70
You can use pivot_table with join:
Notice: parameter fill_value replace NaN to 0, if not necessary remove it and default aggregate function is mean.
df2 = df2.pivot_table(index='Student_ID', \
columns='Sub_year_level', \
values='marks', \
fill_value=0) \
.rename(columns='level{}_avg_mark'.format)
print (df2)
Sub_year_level level1_avg_mark level2_avg_mark level3_avg_mark
Student_ID
119 0 65 75
129 60 78 0
df = df1.join(df2, on='Student_ID')
print (df)
Student_ID Age Course startYear level1_avg_mark level2_avg_mark \
0 119 24 Bsc 2014 0 65
level3_avg_mark
0 75
EDIT:
Need custom function:
print (df2)
Student_ID sub_name marks Sub_year_level
0 119 Botany1 0 2
1 119 Botany1 0 2
2 119 Anatomy 72 2
3 119 cell bio 75 3
4 129 Physics1 78 2
5 129 Math1 60 1
f = lambda x: x[x != 0].mean()
df2 = df2.pivot_table(index='Student_ID',columns='Sub_year_level', values='marks',aggfunc=f)
.rename(columns='level{}_avg_mark'.format).reset_index()
print (df2)
Sub_year_level Student_ID level1_avg_mark level2_avg_mark level3_avg_mark
0 119 NaN 72.0 75.0
1 129 60.0 78.0 NaN
You can use groupby to calculate the average marks per level.
Then unstack to get all levels in one row.
rename the columns.
Once that is done, the groupby + unstack has conveniently left Student_ID in the index which allows for an easy join. All that is left is to do the join and specify the on parameter.
d1.join(
d2.groupby(
['Student_ID', 'Sub_year_level']
).marks.mean().unstack().rename(columns='level{}_avg_mark'.format),
on='Student_ID'
)

calculate values between two pandas dataframe based on a column value

EDITED: let me copy the whole data set
df is the store sales/inventory data
branch daqu store store_name style color size stocked sold in_stock balance
0 huadong wenning C301 EE #��#��##�� EEBW52301M 39 160 7 4 3 -5
1 huadong wenning C301 EE #��#��##�� EEBW52301M 39 165 1 0 1 1
2 huadong wenning C301 EE #��#��##�� EEBW52301M 39 170 6 3 3 -3
dh is the transaction (move 'amount' from store 'from' to 'to')
branch daqu from to style color size amount box_sum
8 huadong shanghai C306 C30C EEOM52301M 59 160 1 162
18 huadong shanghai C306 C30C EEOM52301M 39 160 1 162
25 huadong shanghai C306 C30C EETJ52301M 52 160 9 162
26 huadong shanghai C306 C30C EETJ52301M 52 155 1 162
32 huadong shanghai C306 C30C EEOW52352M 19 160 2 162
What I want is the store inventory data after the transaction, which would look exactly the same format as the df, but only 'in_stock' numbers would have changed from the original df according to numbers in dh.
below is what I tried:
df['full_code'] = df['store']+df['style']+df['color'].astype(str)+df['size'].astype(str)
dh['from_code'] = dh['from']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
dh['to_code'] = dh['to']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
# subtract from 'from' store
dh_from = pd.DataFrame(dh.groupby('from_code')['amount'].sum())
for code, stock in dh_from.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] - stock
# add to 'to' store
dh_to = pd.DataFrame(dh.groupby('to_code')['amount'].sum())
for code, stock in dh_to.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock
df.to_csv('d:/after_dh.csv')
But when I open the csv file then the 'in_stock' values for those which transaction occured are all blanks.
I think df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock this has some problem. What's the correct way of updating the value?
ORIGINAL: I have two pandas dataframe: df1 is for the inventory, df2 is for the transaction
df1 look something like this:
full_code in_stock
1 AAA 200
2 BBB 150
3 CCC 150
df2 look something like this:
from to full_code amount
1 XX XY AAA 30
2 XX XZ AAA 35
3 ZY OI BBB 50
4 AQ TR AAA 15
What I want is the inventory after all transactions are done.
In this case,
full_code in_stock
1 AAA 120
2 BBB 100
3 CCC 150
Note that full_code is unique in df1, but not unique in df2.
Is there any pandas way of doing this? I got messed up with the original dataframe and a view of the dataframe and got it solved by turning them into numpy array and finding matching full_codes. But the resulting code is also a mess and wonder if there is a simpler way of doing this not turning everything into a numpy array.
What I would do is to set the index in df1 to the 'full_code' column and then call sub to subtract the other df.
What we pass for the values is the result of grouping on 'full_code' and calling sum on 'amount' column.
An additional param for sub is fill_values this is because product 'CCC' does not exist on the rhs so we want this value to be preserved, otherwise it becomes NaN:
In [25]:
total = df1.set_index('full_code')['in_stock'].sub(df2.groupby('full_code')['amount'].sum(), fill_value=0)
total.reset_index()
​
Out[25]:
full_code in_stock
0 AAA 120
1 BBB 100
2 CCC 150

Categories

Resources