Pandas sum of count per percentile of rows

Pandas sum of count per percentile of rows - python

Here is a link to a working example on Google Colaboratory.
I have a dataset that represents the reviews (between 0.0 to 10.0) that users have left on various books. It looks like this:
user sum count mean
0 2 0.0 1 0.000000
60223 159665 8.0 1 8.000000
60222 159662 8.0 1 8.000000
60221 159655 8.0 1 8.000000
60220 159651 5.0 1 5.000000
... ... ... ... ...
13576 35859 6294.0 5850 1.075897
37356 98391 51418.0 5891 8.728230
58113 153662 17025.0 6109 2.786872
74815 198711 123.0 7550 0.016291
4213 11676 62092.0 13602 4.564917
The first rows have 1 review while the last ones have thousands. I want to see the distribution of the reviews across the user population. I researched percentile or binning data with Pandas and found pd.qcut and pd.cut but using those, I was unable to get the format in the way I want it.
This is what I'm looking to get.
# users: reviews
# top 10%: 65K rev
# 10%-20%: 23K rev
# etc...
I could not figure out a "Pandas" way to do it so I wrote a loop to generate the data in that format myself and graph it.
SLICE_NUMBERS = 5
step_size = int(user_count/SLICE_NUMBERS)
labels = ['100-80', '80-60', '60-40', '40-20', '0-20']
count_per_percentile = []
for chunk_i in range(SLICE_NUMBERS):
start_index = step_size * chunk_i;
end_index = start_index + step_size;
slice_sum = most_active_list.iloc[start_index:end_index]['count'].sum()
count_per_percentile.append(slice_sum)
print(labels)
print(count_per_percentile) // [21056, 21056, 25058, 62447, 992902]
How can I achieve the same outcome more directly with the library?

I think you can use qcut to create the slices, in a groupby.sum. So with the sample data given slightly modified to avoid duplicated edges on this small sample (I replaced all the ones in count by 1,2,3,4,5)
count_per_percentile = (
df['count']
.groupby(pd.qcut(df['count'], q=[0,0.2,0.4,0.6,0.8,1])).sum()
.tolist()
)
print(count_per_percentile)
# [3, 7, 5855, 12000, 21152]
being the same result as with your method.
In case your real data has too many 1, you could also use np.array_split so
count_per_percentile = [_s.sum() for _s in np.array_split(df['count'].sort_values(),5)]
print(count_per_percentile)
# [3, 7, 5855, 12000, 21152] #same result

Related

How to find the maximum number in python repeatedly from a group of numbers

I have a numpy array like below:
[12,544,73,56,30,84,34,29,78,22,73,23,98,83,35,62,52,94,44,67]
In this data there are 20 numbers and they are divided in 4 groups with 5 numbers in each group. so for ex.
12,544,73,56,30
84,34,29,78,22 etc.
I want to find out the maximum number from each group and store them in a list.
Like:
sol=[544,84,98,94]
I am very new to python please help.

Something like that?
import pandas as pd
field = [12,544,73,56,30,84,34,29,78,22,73,23,98,83,35,62,52,94,44,67]
field = pd.DataFrame(field)
field.rolling(window = 5, win_type = None).max().iloc[4::5]
gives:
4 544.0
9 84.0
14 98.0
19 94.0
Every 5th step
Update
and a much faster one:
field = np.array([12,544,73,56,30,84,34,29,78,22,73,23,98,83,35,62,52,94,44,67])
field.reshape(-1, 5).max(axis=1)

try by splitting 1st then find out the max.
x = np.array([12,544,73,56,30,84,34,29,78,22,73,23,98,83,35,62,52,94,44,67])
n = 4
res = np.array(np.array_split(x, n)).max(axis=1)
res:
array([544, 84, 98, 94])

Python - converting list of lists results from a function?

:Edit: fixed a misunderstanding on my part - i am getting a nested list, not an array.
i'm working with a function in a for loop - bootstrapping some model predictions.
code looks like this:
def revenue(product):
revenue = predict * 4500
profit = revenue - 500000
return profit
and the loop i am feeding it into looks like this:
# set up a loop to select 500 random samples and train our region 2 data set
model = LinearRegression(fit_intercept = True, normalize = False)
features = r2_test.drop(['product'],axis=1)
values = []
for i in range(1000):
subsample = r2_test.sample(500,replace=False)
features = subsample.drop(['product'],axis=1)
predict = model2.predict(features)
result = (revenue(predict))
values.append(result)
so doing a 1000 loop of predictions on 500 samples from this dataframe:
id f0 f1 f2 product
0 74613 -15.001348 -8.276000 -0.005876 3.179103
1 9753 14.272088 -3.475083 0.999183 26.953261
2 93502 6.263187 -5.948386 5.001160 134.766305
3 33405 -13.081196 -11.506057 4.999415 137.945408
4 16486 12.702195 -8.147433 5.004363 134.766305
5 27901 -3.327590 -2.205276 3.003647 84.038886
6 69620 -11.142655 -10.133399 4.002382 110.992147
7 78940 4.234715 -0.001354 2.004588 53.906522
8 56159 13.355129 -0.332068 4.998647 134.766305
9 73142 1.069227 -11.025667 4.997844 137.945408
10 12663 11.777049 -5.334084 2.003033 53.906522
11 39849 16.320755 -0.562946 -0.001783 0.000000
12 61800 7.736313 -6.093374 3.982531 107.813044
13 72213 6.695604 -0.749449 -0.007630 0.000000
14 5479 -10.985487 -5.605994 2.991130 84.038886
15 6297 -0.347599 -6.275884 -0.003448 3.179103
16 88123 12.300570 2.944454 2.005541 53.906522
17 68352 8.900460 -5.632857 4.994324 134.766305
18 99029 -13.412826 -4.729495 2.998590 84.038886
19 64238 -4.373526 -8.590017 2.995379 84.038886
now, once i have my output, i want to select the top 200 predictions from each iteration, i'm using this loop:
# calculate the max value of each of the 500 iterations, then total them for the total profit
top_200 = []
for i in range(0,500):
profits = values.nlargest(200,[i],keep = 'all')
top_200.append(profits)
the problem i am running into is - when i feed values into the top_200 loop, i end up with an array of the selected 200 by column:
[ 0 1 2 3 \
628 125790.297387 -10140.964686 -361625.210913 -243132.040492
32 125429.134599 -368765.455544 -249361.525792 -497190.522207
815 124522.095794 -1793.660411 -11410.126264 114928.508488
645 123891.732231 115946.193531 104048.117460 -246350.752024
119 123063.545808 -124032.987348 -367200.191889 -131237.863430
.. ... ... ... ...
but i'd like to turn it into a dataframe, however, i haven't figured out how to do that while preserving the structure where 0 has it's 200 values, 1 has it's 200 values, etc.
i thought i could do something like:
top_200 = pd.DataFrame(top_200,columns= range(0,500))
and it gives me 500 columns, but only column 0 has anything in it and i end up with a [500,500] dataframe instead of the anticipated 200 rows by 500 columns.
i'm fairly sure there is a good way to do this, but my searching thus far has not turned anything up. I also am not sure what i am looking for is called so, i'm not sure what exactly i am looking for.
any input would be appreciated! Thanks in advance.
:Further editing:
so now that i know i'm getting a lists of lists, not an array, i thought i'd try to write to a dataframe instead:
# calculate the top 200 values of each of the 500 iterations
top_200 = pd.DataFrame(columns=['profits'])
for i in range(0,500):
top_200.loc[i] = i
profits = values.nlargest(200,[i],keep = 'all')
top_200.append(profits)
top_200.head()
but i've futzed something up here as my results are:
profits
0 0
1 1
2 2
3 3
4 4
where my expected results would be something like:
col 1 col2 col3
0 first n_largest first n_largest first n_largest
1 second n_largest second n_largest second n_largest
3 third n_largest third n_largest third n_largest

So, After doing some research based on #CygnusX 's recommended question i figured out that i was laboring under the impression that i had an array as the output, but of course top-200 = [] is a list, which, when combined with the nlargest gives me a list of lists.
Now that i understood the problem better, i converted the list of lists into a dataframe, and then transposed the data - which gave me the results i was looking for.
# calculate the max value of each of the 500 iterations, then total them for the total profit
top_200 = []
for i in range(0,500):
profits = (values.nlargest(200,[i],keep = 'all')).mean()
top_200.append(profits)
test = pd.DataFrame(top_200)
test = test.transpose()
output (screenshot, because, 500 columns.):
there is probably a more elegant way to accomplish this, like not using a list but a dataframe, but, i couldn't get the .append to work the way i wanted in a dataframe, since i wanted to preserve the list of 200 nlargest, not just have a sum or a mean. (which the append worked great for!)

How can I get the next row value in a Python dataframe?

I'm a new Python user and I'm trying to learn this so I can complete a research project on cryptocurrencies. What I want to do is retrieve the value right after having found a condition, and retrieve the value 7 rows later in another variable.
I'm working within an Excel spreadsheet which has 2250 rows and 25 columns. By adding 4 columns as detailed just below, I get to 29 columns. It has lots of 0s (where no pattern has been found), and a few 100s (where a pattern has been found). I want my program to get the row right after the one where 100 is present, and return it's Close Price. That way, I can see the difference between the day of the pattern and the day after the pattern. I also want to do this for seven days down the line, to find the performance of the pattern on a week.
Here's a screenshot of the spreadsheet to illustrate this
You can see -100 cells too, those are bearish pattern recognition. For now I just want to work with the "100" cells so I can at least make this work.
I want this to happen:
import pandas as pd
import talib
import csv
import numpy as np
my_data = pd.read_excel('candlesticks-patterns-excel.xlsx')
df = pd.DataFrame(my_data)
df['Next Close'] = np.nan_to_num(0) #adding these next four columns to my dataframe so I can fill them up with the later variables#
df['Variation2'] = np.nan_to_num(0)
df['Next Week Close'] = np.nan_to_num(0)
df['Next Week Variation'] = np.nan_to_num(0)
df['Close'].astype(float)
for row in df.itertuples(index=True):
str(row[7:23])
if ((row[7:23]) == 100):
nextclose = np.where(row[7:23] == row[7:23]+1)[0] #(I Want this to be the next row after having found the condition)#
if (row.Index + 7 < len(df)):
nextweekclose = np.where(row[7:23] == row[7:23]+7)[0] #(I want this to be the 7th row after having found the condition)#
else:
nextweekclose = 0
The reason I want these values is to later compare them with these variables:
variation2 = (nextclose - row.Close) / row.Close * 100
nextweekvariation = (nextweekclose - row.Close) / row.Close * 100
df.append({'Next Close': nextclose, 'Variation2': variation2, 'Next Week Close': nextweekclose, 'Next Week Variation': nextweekvariation}, ignore_index = true)
My errors come from the fact that I do not know how to retrieve the row+1 value, and the row+7 value. I have searched high and low all day online and haven't found a concrete way to do this. Whichever idea I try to come up with gives me either a "can only concatenate tuple (not "int") to tuple" error, or a "AttributeError: 'Series' object has no attribute 'close'". This second one I get when I try:
for row in df.itertuples(index=True):
str(row[7:23])
if ((row[7:23]) == 100):
nextclose = df.iloc[row.Index + 1,:].close
if (row.Index + 7 < len(df)):
nextweekclose = df.iloc[row.Index + 7,:].close
else:
nextweekclose = 0
I would really love some help on this.
Using Jupyter Notebook.
EDIT : FIXED
I have finally succeeded ! As it often seems to be the case with programming (yeah, I'm new here...), the mistakes were because of my inability to think outside the box. I was persuaded a certain part of my code was the problem, when the issues ran deeper than that.
Thanks to BenB and Michael Gardner, I have fixed my code and it is now returning what I wanted. Here it is.
import pandas as pd
import talib
import csv
import numpy as np
my_data = pd.read_excel('candlesticks-patterns-excel.xlsx')
df = pd.DataFrame(my_data)
#Creating my four new columns. In my first message I thought I needed to fill them up
#with 0s (or NaNs) and then fill them up with their respective content later.
#It is actually much simpler to make the operations right now, keeping in mind
#that I need to reference df['Column Of Interest'] every time.
df['Next Close'] = df['Close'].shift(-1)
df['Variation2'] = (((df['Next Close'] - df['Close']) / df['Close']) * 100)
df['Next Week Close'] = df['Close'].shift(-7)
df['Next Week Variation'] = (((df['Next Week Close'] - df['Close']) / df['Close']) * 100)
#The only use of this is for me to have a visual representation of my newly created columns#
print(df)
for row in df.itertuples(index=True):
if 100 or -100 in row[7:23]:
nextclose = df['Next Close']
if (row.Index + 7 < len(df)) and 100 or -100 in row[7:23]:
nextweekclose = df['Next Week Close']
else:
nextweekclose = 0
variation2 = (nextclose - row.Close) / row.Close * 100
nextweekvariation = (nextweekclose - row.Close) / row.Close * 100
df.append({'Next Close': nextclose, 'Variation2': variation2, 'Next Week Close': nextweekclose, 'Next Week Variation': nextweekvariation}, ignore_index = True)
df.to_csv('gatherinmahdata3.csv')

If I understand correctly, you should be able to use shift to move the rows by the amount you want and then do your conditional calculations.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Close': np.arange(8)})
df['Next Close'] = df['Close'].shift(-1)
df['Next Week Close'] = df['Close'].shift(-7)
df.head(10)
Close Next Close Next Week Close
0 0 1.0 7.0
1 1 2.0 NaN
2 2 3.0 NaN
3 3 4.0 NaN
4 4 5.0 NaN
5 5 6.0 NaN
6 6 7.0 NaN
7 7 NaN NaN
df['Conditional Calculation'] = np.where(df['Close'].mod(2).eq(0), df['Close'] * df['Next Close'], df['Close'])
df.head(10)
Close Next Close Next Week Close Conditional Calculation
0 0 1.0 7.0 0.0
1 1 2.0 NaN 1.0
2 2 3.0 NaN 6.0
3 3 4.0 NaN 3.0
4 4 5.0 NaN 20.0
5 5 6.0 NaN 5.0
6 6 7.0 NaN 42.0
7 7 NaN NaN 7.0

From your update it becomes clear that the first if statement checks that there is the value "100" in your row. You would do that with
if 100 in row[7:23]:
This checks whether the integer 100 is in one of the elements of the tuple containing the columns 7 to 23 (23 itself is not included) of the row.
If you look closely at the error messages you get, you see where the problems are:
TypeError: can only concatenate tuple (not "int") to tuple
comes from
nextclose = np.where(row[7:23] == row[7:23]+1)[0]
row is a tuple and slicing it will just give you a shorter tuple to which you are trying to add an integer, as is said in the error message. Maybe have a look at the documentation of numpy.where and see how it works in general, but I think it is not really needed in this case.
This brings us to your second error message:
AttributeError: 'Series' object has no attribute 'close'
This is case sensitive and for me it works if I just capitalize the close to "Close" (same reason why Index has to be capitalized):
nextclose = df.iloc[row.Index + 1,:].Close
You could in principle use the shift method mentioned in the other reply and I would suggest it for easiness, but I want to point out another method, because I think understanding them is important for working with dataframes:
nextclose = df.iloc[row[0]+1]["Close"]
nextclose = df.iloc[row[0]+1].Close
nextclose = df.loc[row.Index + 1, "Close"]
All of them work and there are probably even more possibilities. I can't really tell you which ones are the fastest or whether there are any differences, but they are very commonly used when working with dataframes. Therefore, I would recommend to have a closer look at the documentation of the methods you used and especially what kind of data type they return. Hope that helps understanding the topic a bit more.

How can I get the normalized matrix out of this function?

I am given a dataset called stocks_df. Each column has stock prices for different stocks in each day. I am trying to normalize it and return it as a matrix. So, each column will have normalized for a stock for each day.
Wrote up this function-
def normalized_prices(stocks_df):
normalized=np.zeros((stocks_df.shape[0],len(stocks_df.columns[1:])))
for i in range(1,len(stocks_df.columns[1:])+1):
for j in range(0,stocks_df.shape[0]+1):
normalized[i,j]=((stocks_df[i][j]/stocks_df[0][i]))
return normalized
And then tried to call the function-
normalized_prices(stocks_df)
But I'm getting this error-
What can be done to fix this?

From your code, it looks you want to divide everything by the first column, so you can simply do:
import numpy as np
import pandas as pd
np.random.seed(123)
stocks_df = pd.DataFrame(np.random.uniform(0,1,(20,10)))
stocks_df.div(stocks_df[0],axis=0)
0 1 2 3 4 5 6 7 8 9
0 1.0 0.410843 0.325716 0.791585 1.033023 0.607502 1.408195 0.983288 0.690529 0.563008
1 1.0 2.124407 1.277973 0.173898 1.159877 2.150474 0.531770 0.511256 1.548909 1.549713
2 1.0 1.338951 1.141952 0.963150 1.138780 0.509077 0.570284 0.359809 0.462979 0.994601
3 1.0 4.708772 4.677955 5.360028 4.623317 3.390277 4.628973 9.699688 10.250916 5.448532
4 1.0 0.185300 0.508509 0.664836 1.388421 0.401401 0.774152 1.579542 0.832571 0.982277
This gives you every column divided by the first. Now you just need to subset this output:
stocks_df.div(stocks_df[0],axis=0).iloc[:,1:]

Loop to perform same upsampling task over several pandas dataframes for logistic regression

I have a series of dataframes containing daily rainfall totals (continuous data) and whether or not a flood occurs (binary data, i.e. 1 or 0). Each data frame represents a year (e.g. df01, df02, df03, etc.), which looks like this:
date ppt fld
01/02/2011 1.5 0
02/02/2011 0.0 0
03/02/2011 2.7 0
04/02/2011 4.6 0
05/02/2011 15.5 1
06/02/2011 1.5 0
...
I wish to perform logistic regression on each year of data, but the data is heavily imbalanced due to the very small number of flood events relative to the number of rainfall events. As such, I wish to upsample just the minority class (values of 1 in 'fld'). So far I know to split each dataframe into two according to the 'fld' value, upsample the resulting '1' dataframe, and then remerge into one dataframe.
# So if I apply to one dataframe it looks like this:
# Separate majority and minority classes
mask = df01.fld == 0
fld_0 = df01[mask]
fld_1 = df01[~mask]
# Upsample minority class
fld_1_upsampled = resample(fld_1,
replace=True, # sample with replacement
n_samples=247, # to match majority class
random_state=123) # reproducible results
# Combine majority class with upsampled minority class
df01_upsampled = pd.concat([fld_0, fld_1_upsampled])
As I have 17 dataframes, it is inefficient to go dataframe-by-dataframe. Are there any thoughts as to how I could be more efficient with this? So far I have tried this (it is probably evident I have no idea what I am doing with loops of this kind, I am quite new to python):
df_all = [df01, df02, df03, df04,
df05, df06, df07, df08,
df09, df10, df11, df12,
df13, df14, df15, df16, df17]
# This is my list of annual data
for i in df_all:
fld_0 = i[mask]
fld_1 = i[~mask]
fld_1_upsampled = resample(fld_1,
replace=True, # sample with replacement
n_samples=len(fld_0), # to match majority class
random_state=123) # reproducible results
i_upsampled = pd.concat([fld_0, fld_1_upsampled])
return i_upsampled
Which returns the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-6fd782d4c469> in <module>()
11 replace=True, # sample with replacement
12 n_samples=247, # to match majority class
---> 13 random_state=123) # reproducible results
14 i_upsampled = pd.concat([fld_0, fld_1_upsampled])
15 return i_upsampled
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/__init__.py in resample(*arrays, **options)
259
260 if replace:
--> 261 indices = random_state.randint(0, n_samples, size=(max_n_samples,))
262 else:
263 indices = np.arange(n_samples)
mtrand.pyx in mtrand.RandomState.randint()
ValueError: low >= high
Any advice or comments greatly appreciated :)
UPDATE: one reply suggested that some of my dataframes may not contain any samples from the minority class. This was correct, so I have removed them, but the same error arises.

Giving you the benefit of the doubt that you're using the same mask syntax in your second code block as in your first, it looks like you may not have any samples to pass in to your resample in one or more of your DFs:
df=pd.DataFrame({'date':[1,2,3,4,5,6],'ppt':[1.5,0,2.7,4.6,15.5,1.5],'fld':[0,1,0,0,1,1]})
date ppt fld
1 1.5 0
2 0.0 1
3 2.7 0
4 4.6 0
5 15.5 1
6 1.5 1
resample(df[df.fld==1], replace=True, n_samples=3, random_state=123)
date ppt fld
6 1.5 1
5 15.5 1
6 1.5 1
resample(df[df.fld==2], replace=True, n_samples=3, random_state=123)
"...ValueError: low >= high"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas sum of count per percentile of rows - python

Related

How to find the maximum number in python repeatedly from a group of numbers

Python - converting list of lists results from a function?

How can I get the next row value in a Python dataframe?

How can I get the normalized matrix out of this function?

Loop to perform same upsampling task over several pandas dataframes for logistic regression

Categories

Resources