Using Pandas how to use next step analysis for getting data - python

I have data :
Village Workers Level
Aagar 10 Small
Dhagewadi 32 Small
Sherewadi 34 Small
Shindwad 42 Small
Dhokari 84 Medium
Khanapur 65 Medium
Ambikanagar 45 Medium
Takali 127 Large
Gardhani 122 Large
Pi.Khand 120 Large
Pangri 105 Large
Code :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_csv("/home/desktop/Desktop/t.csv")
df = df.sort('Workers', ascending=False)
df['Level'] = pd.qcut(df['Workers'], 3, ['Small','Medium','Large'])
df['Sum_Level_wise'] = df.groupby('Level')['Workers'].transform('sum')
df['Probability'] = df['Sum_Level_wise'].div(df['Workers'].sum()).round(2)
df['Sample'] = df['Probability'] * df.groupby('Level')['Workers'].transform('size')
df['Selected villages'] = df['Sample'].apply(np.ceil).astype(int)
def f(x):
a = x['Village'].head(x['Selected villages'].iat[0])
print (x['Village'])
print (a)
if (len(x) < len(a)):
print ('original village cannot be filled to Selected village, because length is higher')
return a
df['Selected village'] = df.groupby('Level').apply(f).reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')
print (df)
Next, I have get villages which was selected in sampling
So, I just want to selected village name corresponding workers details and Level column only.
like this :( Excel photo )
So,I just want that village name, because I dont want to show each steps.
Just using sampling that 5 villages are comes, that data will be show, any help?

it seems you need head:
result_df= df.head(n=5)
result_df
result_df will be:
Village Workers Level Sum_Level_wise Probability Sample Selected villages Selected village
7 Takali 127 Large 474 0.60 2.40 3 Takali
8 Gardhani 122 Large 474 0.60 2.40 3 Gardhani
9 Pi.Khand 120 Large 474 0.60 2.40 3 Pi.Khand
10 Pangri 105 Large 474 0.60 2.40 3
4 Dhokari 84 Medium 194 0.25 0.75 1 Dhokari
If you need just columns 'Village','Workers' and 'Level', then try with:
result_df[['Village','Workers','Level']]
It will give you:
Village Workers Level
7 Takali 127 Large
8 Gardhani 122 Large
9 Pi.Khand 120 Large
10 Pangri 105 Large
4 Dhokari 84 Medium
Update:
df['Selected village'].replace('', pd.np.nan, inplace=True)
df.dropna(subset=['Selected village'], inplace=True)
df[['Workers','Level','Selected village']]
it will give:
Workers Level Selected village
0 10 Small Aagar
4 84 Medium Dhokari
7 127 Large Takali
8 122 Large Gardhani
9 120 Large Pi.Khand

Related

Calculate Multiple Column Growth in Python Dataframe

The data I used look like this
data
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2
1 100 50 120 45 110 50
2 95 40 100 45 105 50
3 110 45 100 45 110 40
I want to calculate each variable growth for each year so the result will look like this
Subject 2001_X1_gro 2001_X2_gro 2002_X1_gro 2002_X2_gro
1 0.2 -0.1 -0.08333 0.11111
2 0.052632 0.125 0.05 0.11111
3 -0.09091 0 0.1 -0.11111
I already do it manually for each variable for each year with code like this
data[2001_X1_gro]= (data[2001_X1]-data[2000_X1])/data[2000_X1]
data[2002_X1_gro]= (data[2002_X1]-data[2001_X1])/data[2001_X1]
data[2001_X2_gro]= (data[2001_X2]-data[2000_X2])/data[2000_X2]
data[2002_X2_gro]= (data[2002_X2]-data[2001_X2])/data[2001_X2]
Is there a way to do it more efficient escpecially if I have more year and/or more variable?
import pandas as pd
df = pd.read_csv('data.txt', sep=',', header=0)
Input
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2
0 1 100 50 120 45 110 50
1 2 95 40 100 45 105 50
2 3 110 45 100 45 110 40
Next, a loop is created and the columns are filled:
qqq = '_gro'
new_name = ''
year = ''
for i in range(1, len(df.columns) - 2):
year = str(int(df.columns[i][:4]) + 1) + df.columns[i][4:]
new_name = year + qqq
df[new_name] = (df[year] - df[df.columns[i]])/df[df.columns[i]]
print(df)
Output
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2 2001_X1_gro \
0 1 100 50 120 45 110 50 0.200000
1 2 95 40 100 45 105 50 0.052632
2 3 110 45 100 45 110 40 -0.090909
2001_X2_gro 2002_X1_gro 2002_X2_gro
0 -0.100 -0.083333 0.111111
1 0.125 0.050000 0.111111
2 0.000 0.100000 -0.111111
In the loop, the year is extracted from the column name, converted to int, 1 is added to it. The value is again converted to a string, the prefix '_Xn' is added. A new_name variable is created, to which the string '_gro ' is also appended. A column is created and filled with calculated values.
If you want to count, for example, for three years, then you need to add not 1, but 3. This is with the condition that your data will be ordered. And note that the loop does not go through all the elements: for i in range(1, len(df.columns) - 2):. In this case, it skips the Subject column and stops short of the last two values. That is, you need to know where to stop it.

Selecting top % of rows in pandas

I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!
Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)

Pandas GroupBy with special sum

Lets say I have data like that and I want to group them in terms of feature and type.
feature type size
Alabama 1 100
Alabama 2 50
Alabama 3 40
Wyoming 1 180
Wyoming 2 150
Wyoming 3 56
When I apply df=df.groupby(['feature','type']).sum()[['size']], I get this as expected.
size
(Alabama,1) 100
(Alabama,2) 50
(Alabama,3) 40
(Wyoming,1) 180
(Wyoming,2) 150
(Wyoming,3) 56
However I want to sum sizes with only the same type not both type and feature.While doing this I want to keep indexes as (feature,type) tuple. I mean I want to get something like this,
size
(Alabama,1) 280
(Alabama,2) 200
(Alabama,3) 96
(Wyoming,1) 280
(Wyoming,2) 200
(Wyoming,3) 96
I am stuck trying to find a way to do this. I need some help thanks
Use set_index for MultiIndex and then transform with sum for return same length Series by aggregate function:
df = df.set_index(['feature','type'])
df['size'] = df.groupby(['type'])['size'].transform('sum')
print (df)
size
feature type
Alabama 1 280
2 200
3 96
Wyoming 1 280
2 200
3 96
EDIT: First aggregate both columns and then use transform
df = df.groupby(['feature','type']).sum()
df['size'] = df.groupby(['type'])['size'].transform('sum')
print (df)
size
feature type
Alabama 1 280
2 200
3 96
Wyoming 1 280
2 200
3 96
Here is one way:
df['size'] = df['type'].map(df.groupby('type')['size'].sum())
df.groupby(['feature', 'type'])['size_type'].sum()
# feature type
# Alabama 1 280
# 2 200
# 3 96
# Wyoming 1 280
# 2 200
# 3 96
# Name: size_type, dtype: int64

Using Pandas & Pivot table how to use column(level) groupby sum values for the next steps analysis?

I want to find out how many sample will be taken from each level using proportion allocation method.
I have total 3 level's : [Small , Medium , Large ].
First , I want to take a sum for this 3 level's.
Next, I want to find out probability for this 3 levels
Next, I want to use this probability answer with multiply by how many samples given for this 3 levels
And, Last step is : sample will be select as top village's for the each level.
Data :
Village Workers Level
Aagar 10 Small
Dhagewadi 32 Small
Sherewadi 34 Small
Shindwad 42 Small
Dhokari 84 Medium
Khanapur 65 Medium
Ambikanagar 45 Medium
Takali 127 Large
Gardhani 122 Large
Pi.Khand 120 Large
Pangri 105 Large
let me explain, I am attaching excel photo
In the first step: I want to get sum values for level -> Small, Medium and High. i.e ( 10+32+34+42)=118 for Small level.
In the next step I want to find out probability for each levels rounding in 2 decimal.
i.e ( 118/786) =0.15 for small level.
And using length(size) of each level multiply by probability for find out how many sample(village) taken from each level.
i.e for Medium level we have probability 0.25 , and we have 3 villages in Medium level. so, 0.25*3 = 0.75 will be sample taken from medium level.
So, it will rounding to the next whole number 0.75 ~ 1 sample taken from Medium level, and it will take top village in this level. so, in medium level "Dhokri" village will be select,
I have done some work,
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_csv("/home/desktop/Desktop/t.csv")
df = df.sort('Workers', ascending=True)
df['level'] = pd.qcut(df['Workers'], 3, ['Small','Medium','Large'])
df
I am use this command for get the sum for level's. next what to do I am confuse,
df=df.groupby(['level'])['Workers'].aggregate(['sum']).unstack()
Is it possible in python , to get that village name what I get in the using excel ?
You can use:
transform with sum for same length of column
divide by div with sum and round
another transform with size
last custom function
df['Sum_Level_wise'] = df.groupby('Level')['Workers'].transform('sum')
df['Probability'] = df['Sum_Level_wise'].div(df['Workers'].sum()).round(2)
df['Sample'] = df['Probability'] * df.groupby('Level')['Workers'].transform('size')
df['Selected villages'] = df['Sample'].apply(np.ceil).astype(int)
df['Selected village'] = df.groupby('Level')
.apply(lambda x: x['Village'].head(x['Selected villages'].iat[0]))
.reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')
print (df)
Village Workers Level Sum_Level_wise Probability Sample \
0 Aagar 10 Small 118 0.15 0.60
1 Dhagewadi 32 Small 118 0.15 0.60
2 Sherewadi 34 Small 118 0.15 0.60
3 Shindwad 42 Small 118 0.15 0.60
4 Dhokari 84 Medium 194 0.25 0.75
5 Khanapur 65 Medium 194 0.25 0.75
6 Ambikanagar 45 Medium 194 0.25 0.75
7 Takali 127 Large 474 0.60 2.40
8 Gardhani 122 Large 474 0.60 2.40
9 Pi.Khand 120 Large 474 0.60 2.40
10 Pangri 105 Large 474 0.60 2.40
Selected villages Selected village
0 1 Aagar
1 1
2 1
3 1
4 1 Dhokari
5 1
6 1
7 3 Takali
8 3 Gardhani
9 3 Pi.Khand
10 3
You can try debug with custom function:
def f(x):
a = x['Village'].head(x['Selected villages'].iat[0])
print (x['Village'])
print (a)
if (len(x) < len(a)):
print ('original village cannot be filled to Selected village, because length is higher')
return a
df['Selected village'] = df.groupby('Level').apply(f).reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')

Plotting a histogram in Pandas with very heavy-tailed data

I am often working with data that has a very 'long tail'. I want to plot histograms to summarize the distribution, but when I try to using pandas I wind up with a bar graph that has one giant visible bar and everything else invisible.
Here is an example of the series I am working with. Since it's very long, I used value_counts() so it will fit on this page.
In [10]: data.value_counts.sort_index()
Out[10]:
0 8012
25 3710
100 10794
200 11718
300 2489
500 7631
600 34
700 115
1000 3099
1200 1766
1600 63
2000 1538
2200 41
2500 208
2700 2138
5000 515
5500 201
8800 10
10000 10
10900 465
13000 9
16200 74
20000 518
21500 65
27000 64
53000 82
56000 1
106000 35
530000 3
I'm guessing that the answer involves binning the less common results into larger groups somehow (53000, 56000, 106000, and 53000 into one group of >50000, etc.), and also changing the y index to represent percentages of the occurrence rather than the absolute number. However, I don't understand how I would go about doing that automatically.
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
mydict = {0: 8012,25: 3710,100: 10794,200: 11718,300: 2489,500: 7631,600: 34,700: 115,1000: 3099,1200: 1766,1600: 63,2000: 1538,2200: 41,2500: 208,2700: 2138,5000: 515,5500: 201,8800: 10,10000: 10,10900: 465,13000: 9,16200: 74,20000: 518,21500: 65,27000: 64,53000: 82,56000: 1,106000: 35,530000: 3}
mylist = []
for key in mydict:
for e in range(mydict[key]):
mylist.insert(0,key)
df = pd.DataFrame(mylist,columns=['value'])
df2 = df[df.value <= 5000]
Plot as a bar:
fig = df.value.value_counts().sort_index().plot(kind="bar")
plt.savefig("figure.png")
As a histogram (limited to values 5000 & under which is >97% of your data):
I like using linspace to control buckets.
df2 = df[df.value <= 5000]
df2.hist(bins=np.linspace(0,5000,101))
plt.savefig('hist1')
EDIT: Changed np.linspace(0,5000,100) to np.linspace(0,5000,101) & updated histogram.

Categories

Resources