Pandas Dataframe: Binning in Multiple Dimensions - python

Suppose I have a dataframe, df, consisting of a class of two objects, S, a set of co-ordinates associated with them, X and Y, and a value, V, that was measured there.
For example, the dataframe looks like this:
S X Y V
0 3 3 1
0 4 3 2
1 6 0 1
1 3 3 8
I would like to know the commands that allow me to group the X and Y coordinates associated with the class, S in a new binning. In this new picture, the new value of V should be the sum of the values in the bin for each class, S.
For example, suppose this co-ordinate system was initially binned between 0 and 10 in X and Y respectively. I would like to bin it between 0 and 2. This means:
Values from 0 < X <= 5, 0 < Y <= 5 in the old binning constitute the value 0;
Values from 6 < x <= 10, 6 < y <= 10 in the old binning constitute the value 1;
Edit:
For further example, considering Dataframe df:
Row 1 has X = 3 and Y = 3. Since 0 < X <= 5 and 0 < Y <= 5, this falls into bin (0,0)
Row 2 has X = 4 and Y = 3. Since 0 < X <= 5 and 0 < Y <= 5, this also falls into bin (0,0).
Since Row 1 and 2 are observed in the same bin and are of the same class S, they are added along column V. This gives a combined row, X=0, Y=0, V = 1+2 =3
Row 3 has has X = 6 and Y = 0. Since 6 < X <= 10 and 0 < Y <= 5, this falls into bin (1,0)
Row 4 has has X= 3 and Y = 3. Since 0 < X <= 5 and 0 < Y <= 5, this falls into bin (0,0). However, since the element is of class S=1, It is not added to anything, since we only add between shared classes.
The output should then be:
S X Y V
0 0 0 3
0 1 0 1
1 0 0 8
What commands must I use to achieve this?

This should do the trick:
data.loc[data['X'] <= 5, 'X'] = 0
data.loc[data['X'] > 5, 'X'] = 1
data.loc[data['Y'] <= 5, 'Y'] = 0
data.loc[data['Y'] > 5, 'Y'] = 1
data = data.groupby(['S', 'X', 'Y']).sum().reset_index()
For your example the output is:
S X Y V
0 0 0 0 3
1 1 0 0 8
2 1 1 0 1
I found this answer to be helpful.

Related

group column values with difference of 3(say) digit in python

I am new in python, problem statement is like we have below data as dataframe
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10], 'value':[x,x,y,x,x,x,y,x,z,x,x,y,y,z]})
Diff value
1 x
1 x
2 y
3 x
4 x
4 x
5 y
6 x
7 z
7 x
8 x
9 y
9 y
10 z
we need to group diff column with diff of 3 (let's say), like 0-3,3-6,6-9,>9, and value should be count
Expected output is like
Diff x y z
0-3 2 1
3-6 3 1
6-9 3 1
>=9 2 1
Example
example code is wrong. someone who want exercise, use following code
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10],
'value':'x,x,y,x,x,x,y,x,z,x,x,y,y,z'.split(',')})
Code
labels = ['0-3', '3-6', '6-9', '>=9']
grouper = pd.cut(df['Diff'], bins=[0, 3, 6, 9, float('inf')], right=False, labels=labels)
pd.crosstab(grouper, df['value'])
output:
value x y z
Diff
0-3 2 1 0
3-6 3 1 0
6-9 3 0 1
>=9 0 2 1

Python: Writing a matrix with different values depending of how far is from a point (i,j)

Help me out with this code! As Input I have N: matrix size. i: point's row. j: point's column. P: point's magnitude. Each time I move away point (i,j) the magnitude will decrease -1. So if my input is N = 7, i = 3, j = 3, P = 3, my output would look like this:
0 0 0 0 0 0 0
0 1 1 1 1 1 0
0 1 2 2 2 1 0
0 1 2 3 2 1 0
0 1 2 2 2 1 0
0 1 1 1 1 1 0
0 0 0 0 0 0 0
I can't figure out how to write the correct value in each position :( help me out! Here's the code that I tried -->
Not a beautiful or efficient solution, but gets the work done:
N, i, j, P = 7, 3, 3, 3
M = [[0] * N for i in range(N)]
for row in range(N):
for col in range(N):
M[row][col] = P - max(abs(row - i), abs(col - j))

convert values in a series to either one of two values

I have a series y which has values between -3 and 3.
I want to convert numbers that are above 0 to 1 and numbers that are less than or equal to zero to 0.
What is the best way to do this?
I wrote the code below. However it doesn't give me the expected output. The first line works. However after running the second line the values that were 1 change to something random, which I don't understand
import numpy as np
y_final = np.where(y > 0, 1, y).tolist()
y_final = np.where(y <= 0, 0, y).tolist()
I think you need Series.clip if values are integers:
y = pd.Series(range(-3, 4))
print (y)
0 -3
1 -2
2 -1
3 0
4 1
5 2
6 3
dtype: int64
print (y.clip(lower=0, upper=1))
0 0
1 0
2 0
3 0
4 1
5 1
6 1
dtype: int64
In your solution is possible simplify it by set 1 and 0:
y_final = np.where(y > 0, 1, 0)
print (y_final)
[0 0 0 0 1 1 1]
Or convert mask greater like 0 to integers:
y_final = y.gt(0).astype(int)
#alternative
#y_final = (y > 0).astype(int)
print (y_final)
0 0
1 0
2 0
3 0
4 1
5 1
6 1
dtype: int32
You can also use simple map:
numbers = range(-3,4)
print(list(map(lambda n: 1 if n > 0 else 0, numbers)))

Get the sum of rows that contain 0 as a value

I want to know how can I make the source code of the following problem based on Python.
I have a dataframe that contain this column:
Column X
1
0
0
0
1
1
0
0
1
I want to create a list b counting the sum of successive 0 value for getting something like that :
List X
1
3
3
3
1
1
2
2
1
If I understand your question correctly, you want to replace all the zeros with the number of consecutive zeros in the current streak, but leave non-zero numbers untouched. So
1 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0
becomes
1 4 4 4 4 1 1 1 1 2 2 1 1 1 5 5 5 5 5
To do that, this should work, assuming your input column (a pandas Series) is called x.
result = []
i = 0
while i < len(x):
if x[i] != 0:
result.append(x[i])
i += 1
else:
# See how many times zero occurs in a row
j = i
n_zeros = 0
while j < len(x) and x[j] == 0:
n_zeros += 1
j += 1
result.extend([n_zeros] * n_zeros)
i += n_zeros
result
Adding screenshot below to make usage clearer

How to change values in a dataframe Python

I've searched for an answer for the past 30 min, but the only solutions are either for a single column or in R. I have a dataset in which I want to change the ('Y/N') values to 1 and 0 respectively. I feel like copying and pasting the code below 17 times is very inefficient.
df.loc[df.infants == 'n', 'infants'] = 0
df.loc[df.infants == 'y', 'infants'] = 1
df.loc[df.infants == '?', 'infants'] = 1
My solution is the following. This doesn't cause an error, but the values in the dataframe doesn't change. I'm assuming I need to do something like df = df_new. But how to do this?
for coln in df:
for value in coln:
if value == 'y':
value = '1'
elif value == 'n':
value = '0'
else:
value = '1'
EDIT: There are 17 columns in this dataset, but there is another dataset I'm hoping to tackle which contains 56 columns.
republican n y n.1 y.1 y.2 y.3 n.2 n.3 n.4 y.4 ? y.5 y.6 y.7 n.5 y.8
0 republican n y n y y y n n n n n y y y n ?
1 democrat ? y y ? y y n n n n y n y y n n
2 democrat n y y n ? y n n n n y n y n n y
3 democrat y y y n y y n n n n y ? y y y y
4 democrat n y y n y y n n n n n n y y y y
This should work:
for col in df.columns():
df.loc[df[col] == 'n', col] = 0
df.loc[df[col] == 'y', col] = 1
df.loc[df[col] == '?', col] = 1
I think simpliest is use replace by dict:
np.random.seed(100)
df = pd.DataFrame(np.random.choice(['n','y','?'], size=(5,5)),
columns=list('ABCDE'))
print (df)
A B C D E
0 n n n ? ?
1 n ? y ? ?
2 ? ? y n n
3 n n ? n y
4 y ? ? n n
d = {'n':0,'y':1,'?':1}
df = df.replace(d)
print (df)
A B C D E
0 0 0 0 1 1
1 0 1 1 1 1
2 1 1 1 0 0
3 0 0 1 0 1
4 1 1 1 0 0
This should do:
df.infants = df.infants.map({ 'Y' : 1, 'N' : 0})
Maybe you can try apply,
import pandas as pd
# create dataframe
number = [1,2,3,4,5]
sex = ['male','female','female','female','male']
df_new = pd.DataFrame()
df_new['number'] = number
df_new['sex'] = sex
df_new.head()
# create def for category to number 0/1
def tran_cat_to_num(df):
if df['sex'] == 'male':
return 1
elif df['sex'] == 'female':
return 0
# create sex_new
df_new['sex_new']=df_new.apply(tran_cat_to_num,axis=1)
df_new
raw
number sex
0 1 male
1 2 female
2 3 female
3 4 female
4 5 male
after use apply
number sex sex_new
0 1 male 1
1 2 female 0
2 3 female 0
3 4 female 0
4 5 male 1
You can change the values using the map function.
Ex.:
x = {'y': 1, 'n': 0}
for col in df.columns():
df[col] = df[col].map(x)
This way you map each column of your dataframe.
All the solutions above are correct, but what you could also do is:
df["infants"] = df["infants"].replace("Y", 1).replace("N", 0).replace("?", 1)
which now that I read more carefully is very similar to using replace with dict !
Use dataframe.replace():
df.replace({'infants':{'y':1,'?':1,'n':0}},inplace=True)

Categories

Resources