grouping values in pandas column - python

I have a pandas dataframe that contain score such as
score
0.1
0.15
0.2
0.3
0.35
0.4
0.5
etc
I want to group these value into the gorups of 0.2
so if score is between 0.1 or 0.2 the value for this row in sore will be 0.2
if score is between 0.2 and 0.4 then the value for score will be 0.4
so for example if max score is 1, I will have 5 buckets of score, 0.2 0.4 0.6 0.8 1
desired output:
score
0.2
0.2
0.2
0.4
0.4
0.4
0.6

You can first define a function that does the rounding for you:
import numpy as np
def custom_round(x, base):
return base * np.ceil(x / base)
Then use .apply() to apply the function to your column:
df.score.apply(lambda x: custom_round(x, base=.2))
Output:
0 0.2
1 0.2
2 0.2
3 0.4
4 0.4
5 0.4
6 0.6
Name: score, dtype: float64

Try np.ceil:
import pandas as pd
import numpy as np
data = {'score': {0: 0.1, 1: 0.15, 2: 0.2, 3: 0.3, 4: 0.35, 5: 0.4, 6: 0.5}}
df = pd.DataFrame(data)
base = 0.2
df['score'] = base * np.ceil(df.score/base)
print(df)
score
0 0.2
1 0.2
2 0.2
3 0.4
4 0.4
5 0.4
6 0.6

Related

Creating a range of numbers

I'm trying to create a range of numbers in this dataframe:
pd.DataFrame({'Ranges': 1-np.arange(0, 1 , 0.1)
})
But the output expected is (in the same column):
0.9 - 1
0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
I have tried using these 1,2,3 solutions but none of them helps me to get nearer a solution. Any suggestions?
PS: The specific format of the numbers it doesn't matter (could be 1.0 or 1, 0.5 or .5 for example)
As far as i could understand you needed to make intervals such as "0.9 - 1", here's my suggestion.
pd.DataFrame(
{'Ranges': [str(x/10) +'-' +str(y/10) for x,y in zip(9- np.arange(1, 10, 1), 10-np.arange(1, 10, 1))]
})
Expected output :
You can use string concatenation on the shifted Series:
df = pd.DataFrame({'Ranges': 1-np.arange(0, 1 , 0.1)})
s = df['Ranges'].round(2).astype(str)
out = s.shift(-1, fill_value='0.0') + ' - ' + s
output:
0 0.9 - 1.0
1 0.8 - 0.9
2 0.7 - 0.8
3 0.6 - 0.7
4 0.5 - 0.6
5 0.4 - 0.5
6 0.3 - 0.4
7 0.2 - 0.3
8 0.1 - 0.2
9 0.0 - 0.1
Name: Ranges, dtype: object

How to apply cosine_similarity to my dataset?

I have a dataset:
0 1 2 3 ... n
0 0.7 0.1 0.55 0.8 ...0.7
1 0.4 0.8 0.5 0.1 ...0.1
...........................
n-2 0.1 0.1 0.5 0.5 ...0.2
n-1 0.2 0.2 0.2 0.1 ...0.4
n 0.1 0.1 0.1 0.4 ...0.7
it has shape nxn
I want to apply sklearns cosine_similarity : from sklearn.metrics.pairwise import cosine_similarity
but when i do cosine_similarity(df, dense_output = False) I get error:
Input contains NaN, infinity or a value too large for dtype('float64')

reading data-frame with missing values

I am trying to read some df with few columns and few rows where in some rows data are missing.
For example df looks like this, also elements of the df are separated sometimes with uneven number of spaces:
0.5 0.03
0.1 0.2 0.3 2
0.2 0.1 0.1 0.3
0.5 0.03
0.1 0.2 0.3 2
Is there any way to extract this:
0.1 0.2 0.3 2
0.2 0.1 0.1 0.3
0.1 0.2 0.3 2
Any suggestions.
Thanks.
You can parse manually your file:
import re
with open('data.txt') as fp:
df = pd.DataFrame([re.split(r'\s+', l.strip()) for l in fp]).dropna(axis=0)
Output:
>>> df
0 1 2 3
1 0.1 0.2 0.3 2
2 0.2 0.1 0.1 0.3
4 0.1 0.2 0.3 2
You can try this:
import pandas as pd
import numpy as np
df = {
'col1': [0.5, 0.1, 0.2, 0.5, 0.1],
'col2': [0.03, 0.2, 0.1, 0.03, 0.2],
'col3': [np.nan, 0.3, 0.1, np.nan, 0.3],
'col4': [np.nan, 2, 0.3, np.nan, 2]
}
data = pd.DataFrame(df)
print(data.dropna(axis=0))
Output:
col1 col2 col3 col4
0.1 0.2 0.3 2.0
0.2 0.1 0.1 0.3
0.1 0.2 0.3 2.0

How to create modified dataframe based on list values?

Consider a dataframe df of the following structure:-
Name Slide Height Weight Status General
A X 3 0.1 0.5 0.2
B Y 10 0.2 0.7 0.8
...
I would like to create duplicates for each row in this dataframe (specific to the Name and Slide) for the following combinations of Height and Weight shown by this list:-
list_combinations = [[3,0.1],[10,0.2],[5,1.3]]
The desired output:-
Name Slide Height Weight Status General
A X 3 0.1 0.5 0.2 #original
A X 10 0.2 0.5 0.2 # modified duplicate
A X 5 1.3 0.5 0.2 # modified duplicate
B Y 10 0.2 0.7 0.8 #original
B Y 3 0.1 0.7 0.8 # modified duplicate
B Y 5 1.3 0.7 0.8 # modified duplicate
etc. ...
Any suggestions and help would be much appreciated.
We can do merge with cross
out = pd.DataFrame(list_combinations,columns = ['Height','Weight']).\
merge(df,how='cross',suffixes = ('','_')).\
reindex(columns=df.columns).sort_values('Name')
Name Slide Height Weight Status General
0 A X 3 0.1 0.5 0.2
2 A X 10 0.2 0.5 0.2
4 A X 5 1.3 0.5 0.2
1 B Y 3 0.1 0.7 0.8
3 B Y 10 0.2 0.7 0.8
5 B Y 5 1.3 0.7 0.8

Rounding numbers in Pandas

How do you round a column to 1dp?
df value
0 0.345
1 0.45
2 0.95
Expected Output
0.3
0.5
1.0
All the below give the wrong answers:
df.value.round(1)
df.value.apply(lambda x:round(x,1))
As far as floating-point operations are concerned, Python behaves like many popular languages including C and Java. Many numbers that can be written easily in decimal notation cannot be expressed exactly in binary floating-point. The decimal value of 0.95 is actually 0.94999999999999996
To check:
from decimal import Decimal
Decimal(0.95)
Output
Decimal('0.9499999999999999555910790149937383830547332763671875')
Here's a useful "normal" rounding function which takes in number n, and returns n to specified decimal places:
import math
def normal_round(n, decimal):
exp = n * 10 ** decimal
if abs(exp) - abs(math.floor(exp)) < 0.5:
return math.floor(exp) / 10 ** decimal
return math.ceil(exp) / 10 ** decimal
Original df
df = pd.DataFrame({'value': [0.345, 0.45, 0.95]})
value
0 0.345
1 0.450
2 0.950
code
df['value'] = df['value'].apply(lambda x: normal_round(x, 1))
Output df
value
0 0.3
1 0.5
2 1.0
More examples on rounding floating point:
df = pd.DataFrame({'value': [0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95]})
df['value_round'] = df['value'].apply(lambda x: round(x, 1))
df['value_normal_round'] = df['value'].apply(lambda x: normal_round(x, 1))
Output
value value_round value_normal_round
0 0.15 0.1 0.2
1 0.25 0.2 0.3
2 0.35 0.3 0.4
3 0.45 0.5 0.5
4 0.55 0.6 0.6
5 0.65 0.7 0.7
6 0.75 0.8 0.8
7 0.85 0.8 0.9
8 0.95 0.9 1.0

Categories

Resources