group column values with difference of 3(say) digit in python - python

I am new in python, problem statement is like we have below data as dataframe
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10], 'value':[x,x,y,x,x,x,y,x,z,x,x,y,y,z]})
Diff value
1 x
1 x
2 y
3 x
4 x
4 x
5 y
6 x
7 z
7 x
8 x
9 y
9 y
10 z
we need to group diff column with diff of 3 (let's say), like 0-3,3-6,6-9,>9, and value should be count
Expected output is like
Diff x y z
0-3 2 1
3-6 3 1
6-9 3 1
>=9 2 1

Example
example code is wrong. someone who want exercise, use following code
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10],
'value':'x,x,y,x,x,x,y,x,z,x,x,y,y,z'.split(',')})
Code
labels = ['0-3', '3-6', '6-9', '>=9']
grouper = pd.cut(df['Diff'], bins=[0, 3, 6, 9, float('inf')], right=False, labels=labels)
pd.crosstab(grouper, df['value'])
output:
value x y z
Diff
0-3 2 1 0
3-6 3 1 0
6-9 3 0 1
>=9 0 2 1

Related

Group By Sum Multiple Columns in Pandas (Ignoring duplicates)

I have the following code where my dataframe contains 3 columns
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
oneframe = pd.concat([df['toBeSummed'],df['toBeSummed2'],df['toBesummed3']], axis=1).reset_index()
temp = oneframe.groupby(['toBeSummed']).size().reset_index()
temp2 = oneframe.groupby(['toBeSummed2']).size().reset_index()
temp3 = oneframe.groupby(['toBeSummed3']).size().reset_index()
temp.columns.values[0] = "SameName"
temp2.columns.values[0] = "SameName"
temp3.columns.values[0] = "SameName"
final = pd.concat([temp,temp2,temp3]).groupby(['SameName']).sum().reset_index()
final.columns.values[0] = "Letter"
final.columns.values[1] = "Sum"
The problem here is that with the code I have, it sums up all instances of each value. Meaning calling final would result in
Letter Sum
0 X 3
1 Y 4
2 Z 5
However I want it to not count more than once if the same value exists in the row (I.e in the first row there are two X's so it would only count the one X)
Meaning the desired output is
Letter Sum
0 X 2
1 Y 3
2 Z 3
I can update or add more comments if this is confusing.
Given df:
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
Doing:
sum_cols = ['toBeSummed', 'toBeSummed2', 'toBesummed3']
out = df[sum_cols].apply(lambda x: x.unique()).explode().value_counts()
print(out.to_frame('Sum'))
Output:
Sum
Y 3
Z 3
X 2

Top N rows vs rows with Top N unique values using pandas

I have a pandas dataframe like as shown below
import pandas as pd
data ={'Count':[1,1,2,3,4,2,1,1,2,1,3,1,3,6,1,1,9,3,3,6,1,5,2,2,0,2,2,4,0,1,3,2,5,0,3,3,1,2,2,1,6,2,3,4,1,1,3,3,4,3,1,1,4,2,3,0,2,2,3,1,3,6,1,8,4,5,4,2,1,4,1,1,1,2,3,4,1,1,1,3,2,0,6,2,3,2,9,10,2,1,2,3,1,2,2,3,2,1,8,4,0,3,3,5,12,1,5,13,6,13,7,3,5,2,3,3,1,1,5,15,7,9,1,1,1,2,2,2,4,3,3,2,4,1,2,9,3,1,3,0,0,4,0,1,0,1,0]}
df = pd.DataFrame(data)
I would like to do the below
a) Top 5 rows (this will return only 5 rows)
b) Rows with Top 5 unique values (this can return N > 5 rows if the top 5 values are repeating). See my example screenshot below where we have 8 rows for selecting top 5 unique values
While am able to get Top 5 rows by using the below
df.nlargest(5,['Count'])
However, when I try the below for b), I don't get the expected output
df.nlargest(5,['Count'],keep='all')
I expect my output to be like as below
Are you after top 5 unique values or largest top five values?
df =(df.assign(top5rows=np.where(df.index.isin(df.head(5).index),'Y','N'),
top5unique=np.where(df.index.isin(df.drop_duplicates(keep='first').head(5).index), 'Y','N')))
or did you need
df =(df.assign(top5rows=np.where(df.index.isin(df.head(5).index),'Y','N'),
top5unique=np.where(df['Count'].isin(list(df['Count'].unique()[:5])),'Y','N')))
Count top5rows top5unique
0 1 Y Y
1 1 Y Y
2 2 Y Y
3 3 Y Y
4 4 Y Y
5 2 N Y
6 1 N Y
7 1 N Y
8 2 N Y
9 1 N Y
10 3 N Y
11 1 N Y
12 3 N Y
13 6 N Y
14 1 N Y

Calculate a np.arange within a Panda dataframe from other columns

I want to create a new column with all the coordinates the car needs to pass to a certain goal. This should be as a list in a panda.
To start with I have this:
import pandas as pd
cars = pd.DataFrame({'x_now': np.repeat(1,5),
'y_now': np.arange(5,0,-1),
'x_1_goal': np.repeat(1,5),
'y_1_goal': np.repeat(10,5)})
output would be:
x_now y_now x_1_goal y_1_goal
0 1 5 1 10
1 1 4 1 10
2 1 3 1 10
3 1 2 1 10
4 1 1 1 10
I have tried to add new columns like this, and it does not work
for xy_index in range(len(cars)):
if cars.at[xy_index, 'x_now'] == cars.at[xy_index,'x_1_goal']:
cars.at[xy_index, 'x_car_move_route'] = np.repeat(cars.at[xy_index, 'x_now'].astype(int),(
abs(cars.at[xy_index, 'y_now'].astype(int)-cars.at[xy_index, 'y_1_goal'].astype(int))))
else:
cars.at[xy_index, 'x_car_move_route'] = \
np.arange(cars.at[xy_index,'x_now'], cars.at[xy_index,'x_1_goal'],
(cars.at[xy_index,'x_1_goal'] - cars.at[xy_index,'x_now']) / (
abs(cars.at[xy_index,'x_1_goal'] - cars.at[xy_index,'x_now'])))
at the end I want the columns x_car_move_route and y_car_move_route so I can loop over the coordinates that they need to pass. I will show it with tkinter. I will also add more goals, since this is actually only the first turn that they need to make.
x_now y_now x_1_goal y_1_goal x_car_move_route y_car_move_route
0 1 5 1 10 [1,1,1,1,1] [6,7,8,9,10]
1 1 4 1 10 [1,1,1,1,1,1] [5,6,7,8,9,10]
2 1 3 1 10 [1,1,1,1,1,1,1] [4,5,6,7,8,9,10]
3 1 2 1 10 [1,1,1,1,1,1,1,1] [3,4,5,6,7,8,9,10]
4 1 1 1 10 [1,1,1,1,1,1,1,1,1] [2,3,4,5,6,7,8,9,10]
You can apply() something like this route() function along axis=1, which means route() will receive rows from cars. It generates either x or y coordinates depending on what's passed into var (from args).
You can tweak/fix as needed, but it should get you started:
def route(row, var):
var2 = 'y' if var == 'x' else 'x'
now, now2 = row[f'{var}_now'], row[f'{var2}_now']
goal, goal2 = row[f'{var}_1_goal'], row[f'{var2}_1_goal']
diff, diff2 = goal - now, goal2 - now2
if diff == 0:
result = np.array([now] * abs(diff2)).astype(int)
else:
result = 1 + np.arange(now, goal, diff / abs(diff)).astype(int)
return result
cars['x_car_move_route'] = cars.apply(route, args=('x',), axis=1)
cars['y_car_move_route'] = cars.apply(route, args=('y',), axis=1)
x_now y_now x_1_goal y_1_goal x_car_move_route y_car_move_route
0 1 5 1 10 [1,1,1,1,1] [6,7,8,9,10]
1 1 4 1 10 [1,1,1,1,1,1] [5,6,7,8,9,10]
2 1 3 1 10 [1,1,1,1,1,1,1] [4,5,6,7,8,9,10]
3 1 2 1 10 [1,1,1,1,1,1,1,1] [3,4,5,6,7,8,9,10]
4 1 1 1 10 [1,1,1,1,1,1,1,1,1] [2,3,4,5,6,7,8,9,10]

Pandas Dataframe: Binning in Multiple Dimensions

Suppose I have a dataframe, df, consisting of a class of two objects, S, a set of co-ordinates associated with them, X and Y, and a value, V, that was measured there.
For example, the dataframe looks like this:
S X Y V
0 3 3 1
0 4 3 2
1 6 0 1
1 3 3 8
I would like to know the commands that allow me to group the X and Y coordinates associated with the class, S in a new binning. In this new picture, the new value of V should be the sum of the values in the bin for each class, S.
For example, suppose this co-ordinate system was initially binned between 0 and 10 in X and Y respectively. I would like to bin it between 0 and 2. This means:
Values from 0 < X <= 5, 0 < Y <= 5 in the old binning constitute the value 0;
Values from 6 < x <= 10, 6 < y <= 10 in the old binning constitute the value 1;
Edit:
For further example, considering Dataframe df:
Row 1 has X = 3 and Y = 3. Since 0 < X <= 5 and 0 < Y <= 5, this falls into bin (0,0)
Row 2 has X = 4 and Y = 3. Since 0 < X <= 5 and 0 < Y <= 5, this also falls into bin (0,0).
Since Row 1 and 2 are observed in the same bin and are of the same class S, they are added along column V. This gives a combined row, X=0, Y=0, V = 1+2 =3
Row 3 has has X = 6 and Y = 0. Since 6 < X <= 10 and 0 < Y <= 5, this falls into bin (1,0)
Row 4 has has X= 3 and Y = 3. Since 0 < X <= 5 and 0 < Y <= 5, this falls into bin (0,0). However, since the element is of class S=1, It is not added to anything, since we only add between shared classes.
The output should then be:
S X Y V
0 0 0 3
0 1 0 1
1 0 0 8
What commands must I use to achieve this?
This should do the trick:
data.loc[data['X'] <= 5, 'X'] = 0
data.loc[data['X'] > 5, 'X'] = 1
data.loc[data['Y'] <= 5, 'Y'] = 0
data.loc[data['Y'] > 5, 'Y'] = 1
data = data.groupby(['S', 'X', 'Y']).sum().reset_index()
For your example the output is:
S X Y V
0 0 0 0 3
1 1 0 0 8
2 1 1 0 1
I found this answer to be helpful.

populate new column in a pandas dataframe which takes input from other columns

i have a function which should take x , y , z as input and returns r as output.
For example : my_func( x , y, z) takes x = 10 , y = 'apple' and z = 2 and returns value in column r. Similarly, function takes x = 20, y = 'orange' and z =4 and populates values in column r. Any suggestions what would be the efficient code for this ?
Before :
a x y z
5 10 'apple' 2
2 20 'orange' 4
0 4 'apple' 2
5 5 'pear' 6
After:
a x y z r
5 10 'apple' 2 x
2 20 'orange' 4 x
10 4 'apple' 2 x
5 5 'pear' 6 x
Depends on how complex your function is. In general you can use pandas.DataFrame.apply:
>>> def my_func(x):
... return '{0} - {1} - {2}'.format(x['y'],x['a'],x['x'])
...
>>> df['r'] = df.apply(my_func, axis=1)
>>> df
a x y z r
0 5 10 'apple' 2 'apple' - 5 - 10
1 2 20 'orange' 4 'orange' - 2 - 20
2 0 4 'apple' 2 'apple' - 0 - 4
3 5 5 'pear' 6 'pear' - 5 - 5
axis=1 is to make your function work 'for each row' instead of 'for each column`:
Objects passed to functions are Series objects having index either the
DataFrame’s index (axis=0) or the columns (axis=1)
But if it's really simple function, like the one above, you probably can even do it without function, with vectorized operations.

Categories

Resources