Fill all the columns of the dataframe with condition - python

I am trying to fill the dataframe with certain condition but I can not find the appropriate solution. I have a bit larger dataframe bet let's say that my pandas dataframe looks like this:
0
1
2
3
4
5
0.32
0.40
0.60
1.20
3.40
0.00
0.17
0.12
0.00
1.30
2.42
0.00
0.31
0.90
0.80
1.24
4.35
0.00
0.39
0.00
0.90
1.50
1.40
0.00
And I want to update the values, so that if 0.00 appears once in a row (row 2 and 4) that until the end all the values are 0.00. Something like this:
0
1
2
3
4
5
0.32
0.40
0.60
1.20
3.40
0.00
0.17
0.12
0.00
0.00
0.00
0.00
0.31
0.90
0.80
1.24
4.35
0.00
0.39
0.00
0.00
0.00
0.00
0.00
I have tried with
for t in range (1,T-1):
data= np.where(df[t-1]==0,0,df[t])
and several others ways but I couldn't get what I want.
Thanks!

Try as follows:
Select from df with df.eq(0). This will get us all zeros and the rest as NaN values.
Now, add df.ffill along axis=1. This will continue all the zeros through to the end of each row.
Finally, change the dtype to bool by chaining df.astype, thus turning all zeros into False, and all NaN values into True.
We feed the result to df.where. For all True values, we'll pick from the df itself, for all False values, we'll insert 0.
df = df.where(df[df.eq(0)].ffill(axis=1).astype(bool), 0)
print(df)
0 1 2 3 4 5
0 0.32 0.40 0.6 1.20 3.40 0.0
1 0.17 0.12 0.0 0.00 0.00 0.0
2 0.31 0.90 0.8 1.24 4.35 0.0
3 0.39 0.00 0.0 0.00 0.00 0.0

Related

How do I assign values from a list to another list of strings

I am pretty new to Python so a few problems occurred
I have an Excel Sheet with different entries and my goal is it to read each entry and automatically assign it to its name. By now this is a simplified sheet and more values could be added so i did not wanted to address each value after another.
So far i did this
import pandas as pd
import numpy as np
df = pd.read_excel('koef.xlsx')
data_array = np.array(df)
XCoeff = []
YCoeff = []
NCoeff = []
VarName = []
for i in range(len(data_array)):
XCoeff.append(data_array[i][1])
XCoeff.pop(0)
for i in range(len(data_array)):
YCoeff.append(data_array[i][2])
YCoeff.pop(0)
for i in range(len(data_array)):
NCoeff.append(data_array[i][3])
NCoeff.pop(0)
for i in range(len(data_array)):
VarName.append(data_array[i][0])
VarName.pop(0)
s1 = "X"
s2 = "Y"
s3 = "N"
XName = [s1 + x for x in VarName]
YName = [s2 + x for x in VarName]
NName = [s3 + x for x in VarName]
In the end i want a list of Variables for X,Y and N where for example the first entries of X would be: Xdel = 0.00, Xdel2 = 4.44, Xdel3 = -2.06 and so on. With these variables i need to do calculations.
The Excel Sheet:
Motion X Y N
0 zero 0.00 0 0.00
1 del 0.00 4.44 -2.06
2 del2 -2.09 -0.24 0.16
3 del3 0.00 -2.95 1.38
4 u -2.20 0 0.00
5 uu 1.50 X 0.00
6 uuu 0.00 0 0.00
7 udot -1.47 0 0.00
8 v 0.11 -24.1 -7.94
9 vv 2.74 2.23 -1.15
10 vvv 0.00 -74.7 2.79
11 vdot 0.00 -16.4 -0.47
12 r -0.07 4.24 -3.32
13 rr 0.58 0.56 -0.27
14 rrr 0.00 2.58 -1.25
15 rdot 0.00 -0.46 -0.75
16 vr 13.10 0 0.00
17 vrr 0.00 -40.3 8.08
18 vvr 0.00 -9.9 -3.37
19 udel 0.00 -4.56 2.32
20 vdel2 0.00 5.15 -1.17
21 vvdel 0.00 7.4 -3.41
22 rdel2 0.00 -0.51 -0.58
23 rrdel 0.00 -0.98 0.43
I hope the problem is stated clear, if not feel free to ask.
Thank You
So far i got the lists at least working but i struggle to merge them.
If you load the Excel sheet in a certain manner, accessing the cells by name can be simple. I think this will give you the cell access you need:
import pandas as pd
# Read the Excel sheet, index by the Motion column.
df = pd.read_excel('koef.xlsx', index_col='Motion')
print(df)
print(df.Y.del3) # specific cells
print(df.N.vvr)
Output:
X Y N
Motion
zero 0.00 0 0.00
del 0.00 4.44 -2.06
del2 -2.09 -0.24 0.16
del3 0.00 -2.95 1.38
u -2.20 0 0.00
uu 1.50 X 0.00
uuu 0.00 0 0.00
udot -1.47 0 0.00
v 0.11 -24.1 -7.94
vv 2.74 2.23 -1.15
vvv 0.00 -74.7 2.79
vdot 0.00 -16.4 -0.47
r -0.07 4.24 -3.32
rr 0.58 0.56 -0.27
rrr 0.00 2.58 -1.25
rdot 0.00 -0.46 -0.75
vr 13.10 0 0.00
vrr 0.00 -40.3 8.08
vvr 0.00 -9.9 -3.37
udel 0.00 -4.56 2.32
vdel2 0.00 5.15 -1.17
vvdel 0.00 7.4 -3.41
rdel2 0.00 -0.51 -0.58
rrdel 0.00 -0.98 0.43
-2.95
-3.37
Caveat: The column/row names need to also be valid Python identifiers, but if not you can use for example df['Y']['del3'] syntax as well. Using valid Python identifiers makes the syntax easier to type.

How to get maximums of multiple groups based on grouping column?

I have an initial dataset data grouped by id:
id x y
1 0.21 1.00
1 0.34 0.66
1 0.35 0.33
1 0.94 0.00
2 0.11 1.00
2 0.90 0.66
2 0.31 0.33
2 0.33 0.00
3 0.12 1.00
3 0.34 0.71
3 0.64 0.43
3 0.89 0.14
4 0.32 1.00
4 0.33 0.66
4 0.45 0.33
4 0.76 0.00
I am trying to predict the maximum y based on variable x while considering the groups. First, I train_test_split based on the groups:
data_train
id x y
1 0.21 1.00
1 0.34 0.66
1 0.35 0.33
1 0.94 0.00
2 0.11 1.00
2 0.90 0.66
2 0.31 0.33
2 0.33 0.00
and
data_test
id x y
3 0.12 1.00
3 0.34 0.66
3 0.64 0.33
3 0.89 0.00
4 0.33 1.00
4 0.32 0.66
4 0.45 0.33
4 0.76 0.00
After training the model and applying the model on data_test, I get:
y_hat
0.65
0.33
0.13
0.00
0.33
0.34
0.21
0.08
I am trying to transform y_hat so that the maximum in each of the initial groups is 1.00; otherwise, it is 0.00:
y_hat_transform
1.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
How would I do that? Note that the groups can be of varying sizes.
Edit: To simplify the problem, I have id_test and y_hat, where
id_test
3
3
3
3
4
4
4
4
and I am trying to get y_hat_transform.
id y
0 3 0.65
1 3 0.65
2 3 0.33
3 3 0.13
4 3 0.00
5 4 0.33
6 4 0.34
7 4 0.21
8 4 0.08
# Find max rows per group and assign them values
# I see 1.0 and 0.0 as binary so directly did it by casting to float
# transform gives new column of same size and repeated maxs per group
id_y['y_transform'] = (id_y['y'] == id_y.groupby(['id'])['y'].transform(max)).astype(float)

Sorting by data in another Dataframe

I've been stuck with an engineering problem thats Python/Pandas related. I'd appreciate any help given.
I've simplified the numbers so I can better explain myself.
I have something similar to the following:
positioning(x-axis)
Calculated difference
1
0.25
0.05
2
0.75
0.06
3
1.25
0.02
4
0.25
0.05
5
0.75
0.05
6
1.25
0.02
7
0.25
0.09
8
0.75
0.01
9
1.25
0.02
10
0.25
0.05
What I need to do is re-organise the calculated difference based on the x-axis positioning.
So it looks something like this:
(0.25)
(0.75)
(1.25)
0.05
0
0
0
0.06
0
0
0
0.02
0.5
0
0
0
0.5
0
0
0
0.02
0.09
0
0
0
0.01
0
0
0
0.02
0.05
0
0
As you can see, I need to organize everything based on the x-positioning.
What is the best approach to this problem? Keep in mind I have 2000+ rows and the x positioning is dynamic but I'm currently working till up to 50(so a lot of columns).
I hope I've clarified the question.
Use pd.get_dummies:
In [10]: pd.get_dummies(df['positioning(x-axis)']).mul(df['Calculated difference'],axis=0)
Out[10]:
0.25 0.75 1.25
1 0.05 0.00 0.00
2 0.00 0.06 0.00
3 0.00 0.00 0.02
4 0.05 0.00 0.00
5 0.00 0.05 0.00
6 0.00 0.00 0.02
7 0.09 0.00 0.00
8 0.00 0.01 0.00
9 0.00 0.00 0.02
10 0.05 0.00 0.00
Just do pivot
df.pivot(columns='positioning(x-axis)',values='Calculated difference').fillna(0)
Out[363]:
Calculated 0.25 0.75 1.25
0 0.05 0.00 0.00
1 0.00 0.06 0.00
2 0.00 0.00 0.02
3 0.05 0.00 0.00
4 0.00 0.05 0.00
5 0.00 0.00 0.02
6 0.09 0.00 0.00
7 0.00 0.01 0.00
8 0.00 0.00 0.02
9 0.05 0.00 0.00
factorize
i, p = pd.factorize(df['positioning(x-axis)'])
d = df['Calculated difference'].to_numpy()
a = np.zeros_like(d, shape=(len(df), len(p)))
a[np.arange(len(df)), i] = d
pd.DataFrame(a, df.index, p)
0.25 0.75 1.25
0 0.05 0.00 0.00
1 0.00 0.06 0.00
2 0.00 0.00 0.02
3 0.05 0.00 0.00
4 0.00 0.05 0.00
5 0.00 0.00 0.02
6 0.09 0.00 0.00
7 0.00 0.01 0.00
8 0.00 0.00 0.02
9 0.05 0.00 0.00
One way to do this would be to use pandas' pivot and then to reset the index.
Given a data frame like this:
positioning(x-axis) Calculated difference
0 0.0 0.61
1 0.0 0.96
2 0.0 0.56
3 0.0 0.91
4 0.0 0.57
5 0.0 0.67
6 0.1 0.71
7 0.1 0.71
8 0.1 0.95
9 0.1 0.89
10 0.1 0.61
df.pivot(columns='positioning(x-axis)', values='Calculated difference').reset_index().drop(columns=['index']).fillna(0)
positioning(x-axis) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1 0.96 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.22 0.00 0.00
3 0.00 0.66 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00
5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
6 0.00 0.00 0.00 0.91 0.00 0.00 0.00 0.00 0.00 0.00 0.00
7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.85
8 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
9 0.00 0.91 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Normalising data for plotting

I am trying to plot the data shown below in a normalised way, in order to have the maximum value on the y-axis equal to 1.
Dataset:
%_F %_M %_C %_D Label
0 0.00 0.00 0.08 0.05 0.0
1 0.00 0.00 0.00 0.14 0.0
2 0.00 0.00 0.10 0.01 1.0
3 0.01 0.01 0.07 0.05 1.0
4 0.00 0.00 0.07 0.14 0.0
6 0.00 0.00 0.07 0.05 0.0
7 0.00 0.00 0.05 0.68 0.0
8 0.00 0.00 0.03 0.09 0.0
9 0.00 0.00 0.04 0.02 0.0
10 0.00 0.00 0.06 0.02 0.0
I tried as follows:
cols_to_norm = ["%_F", "%_M", "%_C", "%_D"]
df[cols_to_norm] = df[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
but I am not completely sure about the output.
In fact, if a plot as follows
df.pivot_table(index='Label').plot.bar()
I get a different result. I think it is because I am not considering in the first code the index on Label.
there are multiple techniques normalize
this shows technique which uses native pandas
import io
df = pd.read_csv(io.StringIO(""" %_F %_M %_C %_D Label
0 0.00 0.00 0.08 0.05 0.0
1 0.00 0.00 0.00 0.14 0.0
2 0.00 0.00 0.10 0.01 1.0
3 0.01 0.01 0.07 0.05 1.0
4 0.00 0.00 0.07 0.14 0.0
6 0.00 0.00 0.07 0.05 0.0
7 0.00 0.00 0.05 0.68 0.0
8 0.00 0.00 0.03 0.09 0.0
9 0.00 0.00 0.04 0.02 0.0
10 0.00 0.00 0.06 0.02 0.0"""), sep="\s+")
fig, ax = plt.subplots(2, figsize=[10,6])
df2 = (df-df.min())/(df.max()-df.min())
df.plot(ax=ax[0], kind="line")
df2.plot(ax=ax[1], kind="line")

How can I divide an entire dataframe by certain rows?

I have a dataframe that looks like the below, except much longer. Ultimately, Var, Type, and Level, when combined, represent unique entries. I want to divide the unexposed entries against the other entries in the dataframe, according to the appropriate grouping (e.g., 'Any-All Exposed' would be divided by 'Any All Unexposed', whereas 'Any Existing Exposed' would be divided by 'Any Existing Unexposed.'
Var Type Level Metric1 Metric2 Metric3
Any All Unexposed 34842 30783 -12
Any All Exposed 54167 54247 0.15
Any All LowExposure 20236 20311 0.37
Any All MediumExposure 15254 15388 0.87
Any All HighExposure 18677 18548 0.7
Any New Unexposed 0 23785 0
Any New Exposed 0 43030 0
Any New LowExposure 0 16356 0
Any New MediumExposure 0 12213 0
Any New HighExposure 0 14461 0
Any Existing Unexposed 34843 6998 -80
Any Existing Exposed 54167 11217 -80
Any Existing LowExposure 20236 3955 -81
Any Existing MediumExposure 15254 3175 -79
Any Existing HighExposure 18677 4087 -78
The most straightforward way to do this, I think, would be creating a mulitindex, but I've tried a variety of methods to no avail (normally, receiving an error that it can't divide on a non-unique index).
An expected result would be something like, where in every row is divided by the Unexposed row according to the var and type values.
Var Type Level Metric1 Metric2 Metric3 MP1 MP2 MP3
Any All Unexposed 34842 30783 -12 1.00 1.00 1.00
Any All Exposed 54167 54247 0.15 1.55 1.76 -0.01
Any All LowExposure 20236 20311 0.37 0.58 0.66 -0.03
Any All MediumExposure 15254 15388 0.87 0.44 0.50 -0.07
Any All HighExposure 18677 18548 0.7 0.54 0.60 -0.06
Any New Unexposed 0 23785 0 0.00 1.00 0.00
Any New Exposed 0 43030 0 0.00 1.81 0.00
Any New LowExposure 0 16356 0 0.00 0.69 0.00
Any New MediumExposure 0 12213 0 0.00 0.51 0.00
Any New HighExposure 0 14461 0 0.00 0.61 0.00
Any Existing Unexposed 34843 6998 -80 1.00 1.00 1.00
Any Existing Exposed 54167 11217 -80 1.55 1.60 1.00
Any Existing LowExposure 20236 3955 -81 0.58 0.57 1.01
Any Existing MediumExposure 15254 3175 -79 0.44 0.45 0.99
Any Existing HighExposure 18677 4087 -78 0.54 0.58 0.98
To divide every row in each Var/Type grouping by a specific Level, use groupby and divide.
For example, to divide by Unexposed, as in your example output:
def divide_by(g, denom_lvl):
cols = ["Metric1", "Metric2", "Metric3"]
num = g[cols]
denom = g.loc[g.Level==denom_lvl, cols].iloc[0]
return num.divide(denom).fillna(0).round(2)
df.groupby(['Var','Type']).apply(divide_by, denom_lvl='Unexposed')
Output:
Metric1 Metric2 Metric3
0 1.00 1.00 1.00
1 1.55 1.76 -0.01
2 0.58 0.66 -0.03
3 0.44 0.50 -0.07
4 0.54 0.60 -0.06
5 0.00 1.00 0.00
6 0.00 1.81 0.00
7 0.00 0.69 0.00
8 0.00 0.51 0.00
9 0.00 0.61 0.00
10 1.00 1.00 1.00
11 1.55 1.60 1.00
12 0.58 0.57 1.01
13 0.44 0.45 0.99
14 0.54 0.58 0.98
Im not sure if i got it correctly. Would sth like this do the trick?
You can parse all unique combinations and perform the division.
var_col= df['Var'].unique()
type_col= df['Type'].unique()
for i in var_col:
for j in type_col:
result= df[df['Var']==i][df['Type']==j][df['Level']=='Exposed'] / df[df['Var']==i][df['Type']==j][df['Level']=='Unexposed']
...

Categories

Resources