Efficient method to adjust column values where equal to x - Python - python

The following multiplies all values in a column where rows are equal to a specific value. Using below, where row is in Item is equal to Up, I want to multiply all other columns by 2. I'm passing this to a single column at at time. Is there a more efficient way to process this?
import pandas as pd
df = pd.DataFrame({
'Item' : ['Up','Up','Down','Up','Down','Up'],
'A' : [50, 50, 60, 60, 40, 30],
'B' : [60, 70, 60, 50, 50, 60],
})
df.loc[df['Item'] == 'Up', 'A'] = df['A'] * 2
df.loc[df['Item'] == 'Up', 'B'] = df['B'] * 2
Out:
Item A B
0 Up 100 120
1 Up 100 140
2 Down 60 60
3 Up 120 100
4 Down 40 50
5 Up 60 120

You can do:
df.loc[df['Item'] == 'Up', ['A','B']] *= 2
Output:
Item A B
0 Up 100 120
1 Up 100 140
2 Down 60 60
3 Up 120 100
4 Down 40 50
5 Up 60 120

Related

PySpark self-dependent cumulative calculation without row-wise for loop or iterrows

I have a calculation that I need to make for a dataset that models a tank of liquids, and I really would like to do this without using iterating manually over each row, but I just don't seem to be clever enough to figure it out.
The calculation is quite easy to do on a simple list of values, as shown:
inflow_1 = [100, 100, 90, 0, 20, 0, 20, 60, 30, 70]
inflow_2 = [0, 50, 30, 20, 50, 0, 90, 20, 70, 90]
outflow = [0, 10, 80, 70, 80, 50, 30, 100, 90, 10]
tank_volume1 = 0
tank_volume2 = 0
outflow_volume1 = 0
outflow_volume2 = 0
outflows_1 = []
outflows_2 = []
for in1, in2, out in zip(inflow_1, inflow_2, outflow):
tank_volume1 += in1
tank_volume2 += in2
outflow_volume1 += out * (tank_volume1 / (tank_volume1 + tank_volume2))
outflow_volume2 += out * (tank_volume2 / (tank_volume1 + tank_volume2))
tank_volume1 -= outflow_volume1
tank_volume2 -= outflow_volume2
outflows_1.append(outflow_volume1)
outflows_2.append(outflow_volume2)
df = pd.DataFrame({'inflow_1': inflow_1, 'inflow_2': inflow_2, 'outflow': outflow, 'outflow_1': outflows_1, 'outflow_2': outflows_2})
Which outputs:
inflow_1 inflow_2 outflow timestamp outflow_1 outflow_2
0 100 0 0 0 0.000000 0.000000
1 100 50 10 1 8.000000 2.000000
2 90 30 80 2 70.666667 19.333333
3 0 20 70 3 121.678161 38.321839
4 20 50 80 4 165.540230 74.459770
5 0 0 50 5 235.396552 54.603448
6 20 90 30 6 272.389498 47.610502
7 60 20 100 7 377.535391 42.464609
8 30 70 90 8 473.443834 36.556166
9 70 90 10 9 484.369943 35.630057
But I just don't see how to do this purely in pyspark, or even pandas without iterating through rows manually. I feel like it should be possible, since I basically just need access to the previously calculated value per calculation, something similar to cumsum(), but no combination I can think of gets the job done.
If there's also just a better term for this type of calculation, I'd appreciate that input.

How to use multiple conditions, including selecting on quantile in Python

Imagine the following dataset df:
Row
Population_density
Distance
1
400
50
2
500
30
3
300
40
4
200
120
5
500
60
6
1000
50
7
3300
30
8
500
90
9
700
100
10
1000
110
11
900
200
12
850
30
How can I make a new dummy column that represents a 1 when values of df['Population_density'] are above the third quantile (>75%) AND the df['Distance'] is < 100, while a 0 is given to the remainder of the data? Consequently, rows 6 and 7 should have a 1 while the other rows should have a 0.
Creating a dummy variable with only one criterium can be fairly easy. For instance, the following condition works for creating a new dummy variable that contains a 1 when the Distance is <100 and a 0 otherwise: df['Distance_Below_100'] = np.where(df['Distance'] < 100, 1, 0). However, I do not know how to combine conditions whereby one of the conditions includes a quantile selection (in this case, the upper 25% of the variable Population_density.
import pandas as pd
# assign data of lists.
data = {'Row': range(1,13,1), 'Population_density': [400, 500, 300, 200, 500, 1000, 3300, 500, 700, 1000, 900, 850],
'Distance': [50, 30, 40, 120, 60, 50, 30, 90, 100, 110, 200, 30]}
# Create DataFrame
df = pd.DataFrame(data)
You can use & or | to join the conditions
import numpy as np
df['Distance_Below_100'] = np.where(df['Population_density'].gt(df['Population_density'].quantile(0.75)) & df['Distance'].lt(100), 1, 0)
print(df)
Row Population_density Distance Distance_Below_100
0 1 400 50 0
1 2 500 30 0
2 3 300 40 0
3 4 200 120 0
4 5 500 60 0
5 6 1000 50 1
6 7 3300 30 1
7 8 500 90 0
8 9 700 100 0
9 10 1000 110 0
10 11 900 200 0
11 12 850 30 0
he, to make a function on data frame i recommended to use lambda.
for example this is your function:
def myFunction(value):
pass
to create a new column 'new_column', (pick_cell) is which cell you want to make a function on:
df['new_column']= df.apply(lambda x : myFunction(x.pick_cell))

my data cleaning script is slow, any ideas on how to improve?

I have a Data(csv format) where the first column is an epoch timestamp(strictly increasing) and the other columns are cumulative rows(just increasing or equal).
Sample is as below:
df = pandas.DataFrame([[1515288240, 100, 50, 90, 70],[1515288241, 101, 60, 95, 75],[1515288242, 110, 70, 100, 80],[1515288239, 110, 70, 110, 85],[1515288241, 110, 75, 110, 85],[1515288243,110,70,110,85]],columns =['UNIX_TS','A','B','C','D'])
df =
id UNIX_TS A B C D
0 1515288240 100 50 90 70
1 1515288241 101 60 95 75
2 1515288242 110 70 100 80
3 1515288239 110 70 110 85
4 1515288241 110 75 110 85
5 1515288243 110 70 110 85
import pandas as pd
def clean(df,column_name,equl):
i=0
while(df.shape[0]-2>=i):
if df[column_name].iloc[i]>df[column_name].iloc[i+1]:
df.drop(df[column_name].iloc[[i+1]].index,inplace=True)
continue
elif df[column_name].iloc[i]==df[column_name].iloc[i+1] and equl==1:
df.drop(df[column_name].iloc[[i+1]].index,inplace=True)
continue
i+=1
clean(df,'UNIX_TS',1)
for col in df.columns[1:]:
clean(df,col,0)
df =
id UNIX_TS A B C D
0 1515288240 100 50 90 70
1 1515288241 101 60 95 75
2 1515288242 110 70 100 80
My script works as intended but its too slow, anybody has ideas about how to improve its speed.
I wrote a script to remove all the invalid data based on 2 rules:
Unix_TS must be strictly increasing(because its a time, it cannot flow back or pause),
other columns are increasing and can be constant for example is in one row it is 100 and the next row it can be >=100 but not less.
Based on the rules the index 3 & 4 are invalid because unix_ts 1515288239 is 1515288241 are less than the index 2.
index 5 is wrong because the value B decreased
IIUC, can use
cols = ['A', 'B', 'C', 'D']
mask_1 = df['UNIX_TS'] > df['UNIX_TS'].cummax().shift().fillna(0)
mask_2 = mask_2 = (df[cols] >= df[cols].cummax().shift().fillna(0)).all(1)
df[mask_1 & mask_2]
Outputs
UNIX_TS A B C D
0 1515288240 100 50 90 70
1 1515288241 101 60 95 75
2 1515288242 110 70 100 80

merge rows pandas dataframe based on condition

Hi have a dataframe df
containing a set of events (rows).
df = pd.DataFrame(data=[[1, 2, 7, 10],
[10, 22, 1, 30],
[30, 42, 2, 10],
[100,142, 22,1],
[143, 152, 2, 10],
[160, 162, 12, 11]],columns=['Start','End','Value1','Value2'])
df
Out[15]:
Start End Value1 Value2
0 1 2 7 10
1 10 22 1 30
2 30 42 2 10
3 100 142 22 1
4 143 152 2 10
5 160 162 12 11
If 2 (or more) consecutive events are <= 10 far apart I would like to merge the 2 (or more) events (i.e. use the start of the first event, end of the last and sum the values in Value1 and Value2).
In the example above df becomes:
df
Out[15]:
Start End Value1 Value2
0 1 42 10 50
1 100 162 36 22
That's totally possible:
df.groupby(((df.Start - df.End.shift(1)) > 10).cumsum()).agg({'Start':min, 'End':max, 'Value1':sum, 'Value2': sum})
Explanation:
start_end_differences = df.Start - df.End.shift(1) #shift moves the series down
threshold_selector = start_end_differences > 10 # will give you a boolean array where true indicates a point where the difference more than 10.
groups = threshold_selector.cumsum() # sums up the trues (1) and will create an integer series starting from 0
df.groupby(groups).agg({'Start':min}) # the aggregation is self explaining
Here is a generalized solution that remains agnostic of the other columns:
cols = df.columns.difference(['Start', 'End'])
grps = df.Start.sub(df.End.shift()).gt(10).cumsum()
gpby = df.groupby(grps)
gpby.agg(dict(Start='min', End='max')).join(gpby[cols].sum())
Start End Value1 Value2
0 1 42 10 50
1 100 162 36 22

Adding column to pandas DataFrame containing list of other columns' values

I have a DataFrame to which I need to add a column. The column needs to be a list of two values:
Current table:
lat long other_value
0 50 50 x
1 60 50 y
2 70 50 z
3 80 50 a
Needed table:
lat long other_value new_column
0 50 50 x [50, 50]
1 60 50 y [60, 50]
2 70 50 z [70, 50]
3 80 50 a [80, 50]
I know this is super simple, but the documentation doesn't seem to cover this (at least not apparently).
One way is to use tolist():
>>> df['new_column'] = df[['lat', 'long']].values.tolist()
>>> df
lat long other_value new_column
0 50 50 x [50, 50]
1 60 50 y [60, 50]
2 70 50 z [70, 50]
3 80 50 a [80, 50]
In general though, I'd be very wary of using lists in DataFrames since they're more difficult to manipulate in columns and you don't get many of the performance benefits that come with integers/floats.
you could use zip
df['new_column'] = list(zip(df.lat, df.long))

Categories

Resources