Bin values into groups - python

The relevant data in my dataframe looks as follows:
Datapoint
Values
1
0.2
2
0.8
3
0.4
4
0.1
5
1.0
6
0.6
7
0.7
8
0.2
9
0.5
10
0.1
I am hoping to group the numbers in the Values column into three categories: less than 0.25 as 'low', between 0.25 and 0.75 as middle and greater than 0.75 as high.
I want to create a new column which returns 'low', 'middle' or 'high' for each row based off the data in the value column.
What I have tried:
def categorize_values("Values"):
if "Values" > 0.75:
return 'high'
elif 'Values' < 0.25:
return 'low'
else:
return 'middle'
However this is returning an error for me.

If you're using a dataframe, Pandas has a built-in function called pd.cut()
import pandas as pd
import numpy as np
from io import StringIO
df = pd.read_csv(StringIO('''Datapoint Values
1 0.2
2 0.8
3 0.4
4 0.1
5 1.0
6 0.6
7 0.7
8 0.2
9 0.5
10 0.1'''), sep='\t')
df['category'] = pd.cut(df['Values'], [0, 0.25, 0.75, df['Values'].max()], labels=['low', 'middle', 'high'])
#output
>>> df
Datapoint Values category
0 1 0.2 low
1 2 0.8 high
2 3 0.4 middle
3 4 0.1 low
4 5 1.0 high
5 6 0.6 middle
6 7 0.7 middle
7 8 0.2 low
8 9 0.5 middle
9 10 0.1 low

First of all, you cannot put constants in your function parameters.
You need to fix your function first like this,
def categorize_values(Values):
if Values > 0.75:
return 'high'
elif Values < 0.25:
return 'low'
else:
return 'middle'
and then you can apply that function to your 'Values' column as below.
df['Category'] = df['Values'].apply(categorize_values)
df.head()
it will generate that DataFrame,
Values Category
DataPoint
1 0.22 low
2 0.32 middle
3 0.55 middle
4 0.75 middle
5 0.12 low

You should take the '' around the Values away.
That would look like this:
def categorize_values(Values):
if Values > 0.75:
return 'high'
elif Values < 0.25:
return 'low'
else:
return 'middle'

Related

How to create modified dataframe based on list values?

Consider a dataframe df of the following structure:-
Name Slide Height Weight Status General
A X 3 0.1 0.5 0.2
B Y 10 0.2 0.7 0.8
...
I would like to create duplicates for each row in this dataframe (specific to the Name and Slide) for the following combinations of Height and Weight shown by this list:-
list_combinations = [[3,0.1],[10,0.2],[5,1.3]]
The desired output:-
Name Slide Height Weight Status General
A X 3 0.1 0.5 0.2 #original
A X 10 0.2 0.5 0.2 # modified duplicate
A X 5 1.3 0.5 0.2 # modified duplicate
B Y 10 0.2 0.7 0.8 #original
B Y 3 0.1 0.7 0.8 # modified duplicate
B Y 5 1.3 0.7 0.8 # modified duplicate
etc. ...
Any suggestions and help would be much appreciated.
We can do merge with cross
out = pd.DataFrame(list_combinations,columns = ['Height','Weight']).\
merge(df,how='cross',suffixes = ('','_')).\
reindex(columns=df.columns).sort_values('Name')
Name Slide Height Weight Status General
0 A X 3 0.1 0.5 0.2
2 A X 10 0.2 0.5 0.2
4 A X 5 1.3 0.5 0.2
1 B Y 3 0.1 0.7 0.8
3 B Y 10 0.2 0.7 0.8
5 B Y 5 1.3 0.7 0.8

Summing subsets of many dataframes

I have ~1.2k files that when converted into dataframes look like this:
df1
A B C D
0 0.1 0.5 0.2 C
1 0.0 0.0 0.8 C
2 0.5 0.1 0.1 H
3 0.4 0.5 0.1 H
4 0.0 0.0 0.8 C
5 0.1 0.5 0.2 C
6 0.1 0.5 0.2 C
Now, I have to subset each dataframe with a window of fixed size along the rows, and add its contents to a second dataframe, with all its values originally initialized to 0.
df_sum
A B C
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
For example, let's set the window size to 3. The first subset therefore will be
window = df.loc[start:end, 'A':'C']
window
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
window.index = correct_index
df_sum = df_sum.add(window, fill_value=0)
df_sum
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
After that, the window will be the subset of df1 from rows 1-4, then rows 2-5, and finally rows 3-6. Once the first file has been scanned, the second file will begin, until all file have been processed. As you can see, this approach relies on df.loc for the subset and df.add for the addition. However, despite the ease of coding, it is very inefficient. On my machine it takes about 5 minutes to process the whole batch of 1.2k files of 200 lines each. I know that an implementation based on numpy arrays is orders of magnitude faster (about 10 seconds), but a bit more complicated in terms of subsetting and adding. Is there any way to increase the performance of this method while stile using dataframe? For example substituting the loc with a more performing slice method.
Example:
def generate_index_list(window_size):
before_offset = -(window_size - 1)// 2
after_offset = (window_size - 1)// 2
index_list = list()
for n in range(before_offset, after_offset + 1):
index_list.append(str(n))
return index_list
window_size = 3
for file in os.listdir('.'):
df1 = pd.read_csv(file, sep= '\t')
starting_index = (window_size - 1)//2
before_offset = (window_size - 1)// 2
after_offset = (window_size -1)//2
for index in df1.iterrows():
if index < starting_index or index + before_offset + 1 > len(profile.index):
continue
indexes = generate_index_list(window_size)
window = df1.loc[index - before_offset:index + after_offset, 'A':'C']
window.index = indexes
df_sum = df_sum.add(window, fill_value=0)
Expected output:
df_sum
A B C
0 1.0 1.1 2.0
1 1.0 1.1 2.0
2 1.1 1.6 1.4
Consider building a list of subsetted data frames with.loc and .head. Then run groupby aggregation after individual elements are concatenated.
window_size = 3
def window_process(file):
csv_df = pd.read_csv(file, sep= '\t')
window_dfs = [(csv_df.loc[i:,['A', 'B', 'C']] # ROW AND COLUMN SLICE
.head(window) # SELECT FIRST WINDOW ROWS
.reset_index(drop=True) # RESET INDEX TO 0, 1, 2, ...
) for i in range(df.shape[0])]
sum_df = (pd.concat(window_dfs) # COMBINE WINDOW DFS
.groupby(level=0).sum()) # AGGREGATE BY INDEX
return sum_df
# BUILD LONG DF FROM ALL FILES
long_df = pd.concat([window_process(f) for file in os.listdir('.')])
# FINAL AGGREGATION
df_sum = long_df.groupby(level=0).sum()
Using posted data sample, below are the outputs of each window_dfs:
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
A B C
0 0.0 0.0 0.8
1 0.5 0.1 0.1
2 0.4 0.5 0.1
A B C
0 0.5 0.1 0.1
1 0.4 0.5 0.1
2 0.0 0.0 0.8
A B C
0 0.4 0.5 0.1
1 0.0 0.0 0.8
2 0.1 0.5 0.2
A B C
0 0.0 0.0 0.8
1 0.1 0.5 0.2
2 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
1 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
With final df_sum to show accuracy of DataFrame.add():
df_sum
A B C
0 1.2 2.1 2.4
1 1.1 1.6 2.2
2 1.1 1.6 1.4

Algorithm about sum of powers

I'm working on the question shown below, part 2. However when I implement it in python, it fails with "RecursionError: maximum recursion depth exceeded".
Here's my algorithm:
import math
def sumofpowers2(x):
count = 1
if math.isclose(x ** count,0,rel_tol=0.001):
return 0
count += 1
return 1 + x * sumofpowers2(x)
print(sumofpowers2(0.8))
Edited.
In a nutshell, sumofpowers2(x) calls itself with the same argument, resulting in infinite recursion (unless the if condition is true right from the start, it will never be true).
Every time sumofpowers2() calls itself, a new variable called count gets created and set to 1. To make this code work, you need to figure out a way to carry the value of count across calls.
First, please learn basic debugging: add a simple print to track your values just before you depend on them:
def sumofpowers2(x):
count = 1
print(x, count, x**count)
if math.isclose(x ** count,0,rel_tol=0.001):
...
Output:
(0.8, 1, 0.8)
(0.8, 1, 0.8)
(0.8, 1, 0.8)
...
This points up the critical problem: you reset count to 1 every time you enter the routine. The simple fix is to hoist the initialization outside the loop:
count = 1
def sumofpowers2(x):
global count
print(x, count, x**count)
if math.isclose(x ** count,0,rel_tol=0.001):
Output:
0.8 1 0.8
0.8 2 0.6400000000000001
0.8 3 0.5120000000000001
0.8 4 0.4096000000000001
0.8 5 0.3276800000000001
0.8 6 0.2621440000000001
0.8 7 0.20971520000000007
0.8 8 0.1677721600000001
0.8 9 0.13421772800000006
0.8 10 0.10737418240000006
0.8 11 0.08589934592000005
0.8 12 0.06871947673600004
0.8 13 0.054975581388800036
0.8 14 0.043980465111040035
0.8 15 0.03518437208883203
0.8 16 0.028147497671065624
0.8 17 0.022517998136852502
0.8 18 0.018014398509482003
0.8 19 0.014411518807585602
0.8 20 0.011529215046068483
0.8 21 0.009223372036854787
0.8 22 0.00737869762948383
0.8 23 0.005902958103587064
0.8 24 0.004722366482869652
0.8 25 0.0037778931862957215
0.8 26 0.0030223145490365774
0.8 27 0.002417851639229262
0.8 28 0.0019342813113834097
0.8 29 0.0015474250491067279
0.8 30 0.0012379400392853823
0.8 31 0.0009903520314283058
4.993810299803575
Better yet, make count an added parameter to your function:
def sumofpowers2(x, count):
print(x, count, x**count)
if math.isclose(x ** count,0,rel_tol=0.001):
return 0
return 1 + x * sumofpowers2(x, count+1)
Not that your cascaded arithmetic is not the value you expect.

Pandas Timeseries Data - Calculating product over intervals of varying length

I have some timeseries data that basically contains information on price change period by period. For example, let's say:
df = pd.DataFrame(columns = ['TimeStamp','PercPriceChange'])
df.loc[:,'TimeStamp']=[1457280,1457281,1457282,1457283,1457284,1457285,1457286]
df.loc[:,'PercPriceChange']=[0.1,0.2,-0.1,0.1,0.2,0.1,-0.1]
so that df looks like
TimeStamp PercPriceChange
0 1457280 0.1
1 1457281 0.2
2 1457282 -0.1
3 1457283 0.1
4 1457284 0.2
5 1457285 0.1
6 1457286 -0.1
What I want to achieve is to calculate the overall price change before the an increase/decrease streak ends, and store the value in the row where the streak started. That is, what I want is a column 'TotalPriceChange' :
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 1.1 * 1.2 - 1 = 0.31
1 1457281 0.2 0
2 1457282 -0.1 -0.1
3 1457283 0.1 1.1 * 1.2 * 1.1 - 1 = 0.452
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 -0.1
I can identify the starting points using something like:
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
to get
TimeStamp PercPriceChange turn
0 1457280 0.1 NaN or 1?
1 1457281 0.2 0
2 1457282 -0.1 1
3 1457283 0.1 1
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 1
Given this column "turn", I need help proceeding with my quest (or perhaps we don't need this "turn" at all). I am pretty sure I can write a nested for-loop going through the entire DataFrame row by row, calculating what I need and populating the column 'TotalPriceChange', but given that I plan on doing this on a fairly large data set (think minute or hour data for couple of years), I imagine nested for-loops will be really slow.
Therefore, I just wanted to check with you experts to see if there is any efficient solution to my problem that I am not aware of. Any help would be much appreciated!
Thanks!
The calculation you are looking for looks like a groupby/product operation.
To set up the groupby operation, we need to assign a group value to each row. Taking the cumulative sum of the turn column gives the desired result:
df['group'] = df['turn'].cumsum()
# 0 0
# 1 0
# 2 1
# 3 2
# 4 2
# 5 2
# 6 3
# Name: group, dtype: int64
Now we can define the TotalPriceChange column (modulo a little cleanup work) as
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
import pandas as pd
df = pd.DataFrame({'PercPriceChange': [0.1, 0.2, -0.1, 0.1, 0.2, 0.1, -0.1],
'TimeStamp': [1457280, 1457281, 1457282, 1457283, 1457284, 1457285, 1457286]})
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
df['group'] = df['turn'].cumsum()
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
mask = (df['group'].diff() != 0)
df.loc[~mask, 'TotalPriceChange'] = 0
df = df[['TimeStamp', 'PercPriceChange', 'TotalPriceChange']]
print(df)
yields
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 0.320
1 1457281 0.2 0.000
2 1457282 -0.1 -0.100
3 1457283 0.1 0.452
4 1457284 0.2 0.000
5 1457285 0.1 0.000
6 1457286 -0.1 -0.100

Calculating a new column in pandas

I have a dataframe of historical election results and want to calculate an additional column that applies a basic math formula for records for winning candidates and copies a value over for the rest of them.
Here is the code I tried:
va2 = va1[['contest_id', 'year', 'district', 'office', 'party_code',
'pct_vote', 'winner']].drop_duplicates()
va2['vote_waste'] = va2['winner'].map(lambda x: (-.5) + va2['pct_vote']
if x == 'w' else va2['pct_vote'])
This gave me a new column where each row contained the calculation for every row in every row.
You can use numpy.where() to achieve what you want:
import pandas as pd
import numpy as np
data = {
'winner': pd.Series(['w', 'l', 'l', 'w', 'l']),
'pct_vote': pd.Series([0.4, 0.9, 0.9, 0.4, 0.9]),
'party_code': pd.Series([10, 20, 30, 40, 50])
}
df = pd.DataFrame(data)
print(df)
party_code pct_vote winner
0 10 0.4 w
1 20 0.9 l
2 30 0.9 l
3 40 0.4 w
4 50 0.9 l
df['vote_waste'] = np.where(
df['winner'] == 'w',
df['pct_vote'] - 0.5, #if condition is true, use this value
df['pct_vote'] #if condition is false, use this value
)
print(df)
party_code pct_vote winner vote_waste
0 10 0.4 w -0.1
1 20 0.9 l 0.9
2 30 0.9 l 0.9
3 40 0.4 w -0.1
4 50 0.9 l 0.9
This is because you are operating a element x against series va2['pct_vote']. What you need is operation on va2['winner'] and va2['pct_vote'] element wise. You could use apply to achieve that.
consider a as winner and b as pct_vote
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
df
Out[23]:
a b c
0 1 2 3
1 4 5 6
df['new'] = df[['a','b']].apply(lambda x : (-0.5)+x[1] if x[0] ==1 else x[1],axis=1)
df
Out[42]:
a b c new
0 1 2 3 1.5
1 4 5 6 5.0

Categories

Resources