reading data-frame with missing values - python

I am trying to read some df with few columns and few rows where in some rows data are missing.
For example df looks like this, also elements of the df are separated sometimes with uneven number of spaces:
0.5 0.03
0.1 0.2 0.3 2
0.2 0.1 0.1 0.3
0.5 0.03
0.1 0.2 0.3 2
Is there any way to extract this:
0.1 0.2 0.3 2
0.2 0.1 0.1 0.3
0.1 0.2 0.3 2
Any suggestions.
Thanks.

You can parse manually your file:
import re
with open('data.txt') as fp:
df = pd.DataFrame([re.split(r'\s+', l.strip()) for l in fp]).dropna(axis=0)
Output:
>>> df
0 1 2 3
1 0.1 0.2 0.3 2
2 0.2 0.1 0.1 0.3
4 0.1 0.2 0.3 2

You can try this:
import pandas as pd
import numpy as np
df = {
'col1': [0.5, 0.1, 0.2, 0.5, 0.1],
'col2': [0.03, 0.2, 0.1, 0.03, 0.2],
'col3': [np.nan, 0.3, 0.1, np.nan, 0.3],
'col4': [np.nan, 2, 0.3, np.nan, 2]
}
data = pd.DataFrame(df)
print(data.dropna(axis=0))
Output:
col1 col2 col3 col4
0.1 0.2 0.3 2.0
0.2 0.1 0.1 0.3
0.1 0.2 0.3 2.0

Related

grouping values in pandas column

I have a pandas dataframe that contain score such as
score
0.1
0.15
0.2
0.3
0.35
0.4
0.5
etc
I want to group these value into the gorups of 0.2
so if score is between 0.1 or 0.2 the value for this row in sore will be 0.2
if score is between 0.2 and 0.4 then the value for score will be 0.4
so for example if max score is 1, I will have 5 buckets of score, 0.2 0.4 0.6 0.8 1
desired output:
score
0.2
0.2
0.2
0.4
0.4
0.4
0.6
You can first define a function that does the rounding for you:
import numpy as np
def custom_round(x, base):
return base * np.ceil(x / base)
Then use .apply() to apply the function to your column:
df.score.apply(lambda x: custom_round(x, base=.2))
Output:
0 0.2
1 0.2
2 0.2
3 0.4
4 0.4
5 0.4
6 0.6
Name: score, dtype: float64
Try np.ceil:
import pandas as pd
import numpy as np
data = {'score': {0: 0.1, 1: 0.15, 2: 0.2, 3: 0.3, 4: 0.35, 5: 0.4, 6: 0.5}}
df = pd.DataFrame(data)
base = 0.2
df['score'] = base * np.ceil(df.score/base)
print(df)
score
0 0.2
1 0.2
2 0.2
3 0.4
4 0.4
5 0.4
6 0.6

Creating a range of numbers

I'm trying to create a range of numbers in this dataframe:
pd.DataFrame({'Ranges': 1-np.arange(0, 1 , 0.1)
})
But the output expected is (in the same column):
0.9 - 1
0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
I have tried using these 1,2,3 solutions but none of them helps me to get nearer a solution. Any suggestions?
PS: The specific format of the numbers it doesn't matter (could be 1.0 or 1, 0.5 or .5 for example)
As far as i could understand you needed to make intervals such as "0.9 - 1", here's my suggestion.
pd.DataFrame(
{'Ranges': [str(x/10) +'-' +str(y/10) for x,y in zip(9- np.arange(1, 10, 1), 10-np.arange(1, 10, 1))]
})
Expected output :
You can use string concatenation on the shifted Series:
df = pd.DataFrame({'Ranges': 1-np.arange(0, 1 , 0.1)})
s = df['Ranges'].round(2).astype(str)
out = s.shift(-1, fill_value='0.0') + ' - ' + s
output:
0 0.9 - 1.0
1 0.8 - 0.9
2 0.7 - 0.8
3 0.6 - 0.7
4 0.5 - 0.6
5 0.4 - 0.5
6 0.3 - 0.4
7 0.2 - 0.3
8 0.1 - 0.2
9 0.0 - 0.1
Name: Ranges, dtype: object

How to apply cosine_similarity to my dataset?

I have a dataset:
0 1 2 3 ... n
0 0.7 0.1 0.55 0.8 ...0.7
1 0.4 0.8 0.5 0.1 ...0.1
...........................
n-2 0.1 0.1 0.5 0.5 ...0.2
n-1 0.2 0.2 0.2 0.1 ...0.4
n 0.1 0.1 0.1 0.4 ...0.7
it has shape nxn
I want to apply sklearns cosine_similarity : from sklearn.metrics.pairwise import cosine_similarity
but when i do cosine_similarity(df, dense_output = False) I get error:
Input contains NaN, infinity or a value too large for dtype('float64')

Rounding numbers in Pandas

How do you round a column to 1dp?
df value
0 0.345
1 0.45
2 0.95
Expected Output
0.3
0.5
1.0
All the below give the wrong answers:
df.value.round(1)
df.value.apply(lambda x:round(x,1))
As far as floating-point operations are concerned, Python behaves like many popular languages including C and Java. Many numbers that can be written easily in decimal notation cannot be expressed exactly in binary floating-point. The decimal value of 0.95 is actually 0.94999999999999996
To check:
from decimal import Decimal
Decimal(0.95)
Output
Decimal('0.9499999999999999555910790149937383830547332763671875')
Here's a useful "normal" rounding function which takes in number n, and returns n to specified decimal places:
import math
def normal_round(n, decimal):
exp = n * 10 ** decimal
if abs(exp) - abs(math.floor(exp)) < 0.5:
return math.floor(exp) / 10 ** decimal
return math.ceil(exp) / 10 ** decimal
Original df
df = pd.DataFrame({'value': [0.345, 0.45, 0.95]})
value
0 0.345
1 0.450
2 0.950
code
df['value'] = df['value'].apply(lambda x: normal_round(x, 1))
Output df
value
0 0.3
1 0.5
2 1.0
More examples on rounding floating point:
df = pd.DataFrame({'value': [0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95]})
df['value_round'] = df['value'].apply(lambda x: round(x, 1))
df['value_normal_round'] = df['value'].apply(lambda x: normal_round(x, 1))
Output
value value_round value_normal_round
0 0.15 0.1 0.2
1 0.25 0.2 0.3
2 0.35 0.3 0.4
3 0.45 0.5 0.5
4 0.55 0.6 0.6
5 0.65 0.7 0.7
6 0.75 0.8 0.8
7 0.85 0.8 0.9
8 0.95 0.9 1.0

How do you calculate the sum based on certain numbers in the dataframe?

I have variables like this
a = pd.DataFrame(np.array([[1, 1, 2, 3, 2], [2, 2, 3, 3, 2], [1, 2, 3, 2, 3]]))
b = np.array([0.1, 0.3, 0.5, 0.6, 0.2])
Display a
0 1 2 3 4
0 1 1 2 3 2
1 2 2 3 3 2
2 1 2 3 2 3
Display b
[0.1 0.3 0.5 0.6 0.2]
The result I want is the sum of the values in b based on the values of a where the indices of a serve as the indices for the values in b .
The final result that I want is like this.
0.4 0.7 0.6
0 0.5 0.11
0.1 0.9 0.7
How to obtain the first row in detail
0.4 0.7 0.6
so 0.4 is obtained from 0.1 + 0.3, based on the number 1 in the first row of a, i.e. since the indices are 0 and 1, we add b[0] and b[1]
0.7 is obtained from 0.5 + 0.2, based on the number 2 where the indices are 2 and 4, so we added b[2] + b[4]
0.6 based on the number 3 which is just b[3] because the index is 3
You can create one-hot encoded matrices to use in a dot product:
from pandas.api.types import CategoricalDtype
n = a.max().max()
cat = CategoricalDtype(categories=np.arange(1, n + 1))
dummies = pd.get_dummies(a.T.astype(cat))
b.dot(dummies).reshape(n, n)
yields
array([[0.4, 0.7, 0.6],
[0. , 0.6, 1.1],
[0.1, 0.9, 0.7]])
This is one way you can do it, it is not optimized, yet I think it follows your logic in a clear way:
df = pd.DataFrame(columns=range(1, a.max().max()+1))
for i,r in a.iterrows():
for c in list(df):
df.loc[i,c] = np.sum((b[r[r==c].index.values]))
df
1 2 3
0 0.4 0.7 0.6
1 0 0.6 1.1
2 0.1 0.9 0.7

Categories

Resources