I have two DataFrame as listed below
plusMinusOne = pd.DataFrame({0: [2459650, 2459650,2459650,2459654,2459654,2459654,2459660], 1: [100, 90,80,14,15,16,2]},index=[3,4,5,12,13,14,27])
bias = pd.DataFrame({0: [2459651, 2459652,2459653,2459655,2459656,2459658,2459659], 1: [10, 20,30,40,50,60,70]})
I have to subtract plusMinusOne's 1st column with bias 1th column by matching the bias 0th column with plusMinusOne's 0th column.
As 2459650 is not present in bias dataFrame i have to check for 2459651/2459649 from bias and subtract any one's value from that. I have to look for 1 above or 1 below from bias and then subtract the value for every row
I was trying like this.
for i in plusMinusOne[0]:
if i+1 in bias[0].values:
plusMinusOne[1] = plusMinusOne[1].sub(plusMinusOne[0].map(
bias.assign(key=bias[0]-1).set_index('key')[1]), fill_value=0)
break
elif i-1 in bias[0].values:
plusMinusOne[1] = plusMinusOne[1].sub(plusMinusOne[0].map(
bias.assign(key=bias[0]+1).set_index('key')[1]), fill_value=0)
break
My expected output is :
plusMinusOne
2459650 90
2459650 80
2459650 70
2459654 -26
2459654 -25
2459654 -24
2459660 -68
Vectorized solution,
def bias_diff(row):
value = 0
if (row[0] == bias[0]).any():
value = row[1] - bias[(row[0]) == bias[0]].iloc[0,1]
elif ((row[0]+1) == bias[0]).any():
value = row[1] - bias[(row[0]+1) == bias[0]].iloc[0,1]
else:
value = row[1] - bias[(row[0]-1) == bias[0]].iloc[0,1]
return value
plusMinusOne[1] = plusMinusOne.apply(bias_diff, axis=1)
print(plusMinusOne)
Output
0 1
3 2459650 90
4 2459650 80
5 2459650 70
12 2459654 -26
13 2459654 -25
14 2459654 -24
27 2459660 -68
This is not an efficient code, but this work for your case.This will work for which difference you want by changing the diff variable value
import pandas as pd
df1 = pd.DataFrame({0: [2459650, 2459650,2459650,2459654,2459654,2459654,2459660], 1: [100, 90,80,14,15,16,2]})
df2 = pd.DataFrame({0: [2459651, 2459652,2459653,2459655,2459656,2459658,2459659], 1: [10, 20,30,40,50,60,70]})
diff = 3
def data_process(df1,df2,i,diff):
data = None
for j in range(len(df2)):
if df1[0][i] == df2[0][j]:
data = df1[1][i]-df2[1][j]
else:
try:
if (df1[0][i])+diff == df2[0][j]:
data = df1[1][i]-df2[1][j]
elif (df1[0][i])-diff == df2[0][j]:
data = df1[1][i] - df2[1][j]
except:
pass
return data
processed_data = []
for i in range(len(df1)):
if data_process(df1,df2,i,diff) is None:
processed_data.append(df1[1][i])
else:
processed_data.append(data_process(df1,df2,i,diff))
df1[2] = processed_data
print(df1[[0,2]])
The output dataframe for diff 1 is
0 2
0 2459650 90
1 2459650 80
2 2459650 70
3 2459654 -26
4 2459654 -25
5 2459654 -24
6 2459660 -68
the output dataframe for diff 3 is
0 2
0 2459650 70.0
1 2459650 60.0
2 2459650 50.0
3 2459654 4.0
4 2459654 5.0
5 2459654 6.0
6 2459660 2
The 2459660 does not contain +3 or -3 combinational value (i.e) 2459657 or 2459663 in second dataframe. so i return the value as it is. Else it will return Nan value instead of 2.
Related
I have two dataframes(missingData and bias) and one Series(missingDateUnique).
missingDateUnique = pd.Series({0: 2459650, 9: 2459654})
missingDate = pd.DataFrame({0: [2459650, 2459650,2459650,2459654,2459654,2459654], 1: [10, 10,10,14,14,14]},index=[0,1,2,9,10,11])
bias = pd.DataFrame({0: [2459651, 2459652,2459653,2459655,2459656,2459658,2459659], 1: [11, 12,13,15,16,18,19]})
As missingDateUnique values are not in bias dataFrame. I have to check for i+1 in bias dataframe and subtract the 1's column value with missingDate's value.
I was doing it like this
for i in missingDateUnique:
if i+1 in bias[0].values:
missingDate[1] = missingDate[1].sub(missingDate[0].map(bias.set_index(0)[1]),fill_value=0)
The result should be like---
In missingDate's 1st row instead of 10 it should be 11-10=1
Full output-----
2459650 1
2459650 1
2459650 1
2459654 1
2459654 1
2459654 1
For example 2459654 in missingDate i have to check for 2459655 and 2459653 both in bias and subtract with any one from that. If both 2459655 and 2459653 are not present then I have to check for 2459656 and 2459652 and so on
You can subtract 1 from bias column 0 and map it to missingDate column 0
missingDate[2] = missingDate[0].map(bias.assign(key=bias[0]-1).set_index('key')[1]) - missingDate[1]
print(missingDate)
0 1 2
0 2459650 10 1
1 2459650 10 1
2 2459650 10 1
9 2459654 14 1
10 2459654 14 1
11 2459654 14 1
I have a dataframe which contains some negative and positive values
I've used following code to get pct_change on row values
df_gp1 = df_gp1.pct_change(periods=4, axis=1) * 100
and here I want to assign some specific number, depending on how the values change from negative to positive or vice versa
for example, if the value turns from
positive to negative, return -100
negative to positive, return 100
negative to negative, return -100,
positive to positive, ordinary pct_change
for example my current dataframe could look like the following
DATA
D-4
D-3
D-2
D-1
D-0
A
-20
-15
-13
-10
-5
B
-30
-15
-10
10
25
C
40
25
30
41
30
D
25
25
10
15
-10
I want a new output(dataframe) that gives me following return
DATA
D-0
A
-100
B
100
C
-25
D
-100
as you can see, the 4th period must provide pct_change (i.e D-0 / D-4), but if it stays negative, return -100
if it turns from positive to negative, still return -100
if it turns from negative to positive, return 100,
if it's a change from positive value to another positive value, then apply pct_chg
and my original dataframe is like 4000 rows and 300 columns big.
Thus my desired output will have 4000 rows and 296 columns(since the it eliminates data D-4, D-3, D-2, D-1
I tried to make conditional list, and choice list, and use np.select method, but I just don't know how to apply it across whole dataframe and create new one that returns percentage changes.
Any help is deeply appreciated.
Use:
#convert column DATA to index if necessary
df = df.set_index('DATA')
#compare for less like 0
m1 = df.lt(0)
#comapre shifted 4 columns less like 0
m2 = df.shift(4, axis=1).lt(0)
#pass to np.select
arr = np.select([m1, ~m1 & m2, ~m1 & ~m2],
[-100, 100, df.pct_change(periods=4, axis=1) * 100])
#create DataFrame, remove first 4 columns
df = pd.DataFrame(arr, index=df.index, columns=df.columns).iloc[:, 4:].reset_index()
print (df)
DATA D-0
0 A -100.0
1 B 100.0
2 C -25.0
3 D -100.0
Given:
D-4 D-3 D-2 D-1 D-0
DATA
A -20 -15 -13 -10 -5
B -30 -15 -10 10 25
C 40 25 30 41 30
D 25 25 10 15 -10
Doing:
def stuff(row):
if row['D-0'] < 0:
return -100
elif row['D-4'] < 0:
return 100
else:
return (row.pct_change(periods=4) * 100)['D-0']
print(df.apply(stuff, axis=1))
Output:
A -100.0
B 100.0
C -25.0
D -100.0
dtype: float64
I was counting the no of occurrence of angle and dist by the code below:
g = new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
the output:
current_angle current_dist 0
-50 30 1
-50 40 2
-50 41 6
-50 45 4
try1:
g.columns = ['angle','Distance','count','Percentage Missed'] - result was no change in the name of column
try2:
When I print the columns using print(g.columns) ended with error AttributeError: 'Series' object has no attribute 'columns'
I want to rename the column 0 as count and add a new column to the dataframe g as percent missed which is calculated by 100 - value in column 0
Expected output
current_angle current_dist count percent missed
-50 30 1 99
-50 40 2 98
-50 41 6 94
-50 45 4 96
1:How to modify the code? I mean instead of value_counts, is there any other function that can give the expected output?
2. How to get the expected output with the current method?
EDIT 1(exceptional case)
data:
angle
distance
velocity
0
124
-3
50
24
-25
50
34
25
expected output:
count is calculated based on distance
angle
distance
velocity
count
percent missed
0
124
-3
1
99
50
24
-25
1
99
50
34
25
1
99
First add Series.reset_index, because DataFrame.value_counts return Series, so possible use parameter name for change column 0 to count column and then subtract 100 to new column by Series.rsub for subtract from right side like 100 - df['count']:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
Or if need also set new columns names use DataFrame.set_axis:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.set_axis(['angle','Distance','count'], axis=1)
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
If need assign new columns names here is alternative solution:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index())
df.columns = ['angle','Distance','count']
df['percent missed'] = df['count'].rsub(100)
Assuming a DataFrame as input (if not reset_index first), simply use rename and a subtraction:
df = df.rename(columns={'0': 'count'}) # assuming string '0' here, else use 0
df['percent missed'] = 100 - df['count']
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96
alternative: using groupby.size:
(new_df
.groupby(['current_angle','current_dist']).size()
.reset_index(name='count')
.assign(**{'percent missed': lambda d: 100-d['count']})
)
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96
I want
proc format;
value RNG
low - 24 = '1'
24< - 35 = '2'
35< - 44 = '3'
44< - high ='4'
I need this in python pandas.
If you are looking for equivalent of the mapping function, you can use something like this.
df = pd.DataFrame(np.random.randint(100,size=5), columns=['score'])
print(df)
output:
score
0 73
1 90
2 83
3 40
4 76
Now lets apply the binning function for score column in dataframe and create new column in the same dataframe.
def format_fn(x):
if x < 24:
return '1'
elif x <35:
return '2'
elif x< 44:
return '3'
else:
return '4'
df['binned_score']=df['score'].apply(format_fn)
print(df)
output:
score binned_score
0 73 4
1 90 4
2 83 4
3 40 3
4 76 4
I'm working with a data frame like this, but bigger and with more zone. I am trying to sum the value of the rows by their names. The total sum of the R or C zones goes in total column while the total sum of either M zones goes in total1 .
Input:
total, total1 are the desired output.
ID Zone1 CHC1 Value1 Zone2 CHC2 Value2 Zone3 CHC3 Value3 total total1
1 R5B 100 10 C2 0 20 R10A 2 5 35 0
1 C2 95 20 M2-6 5 6 R5B 7 3 23 6
3 C2 40 4 C4 60 6 0 6 0 10 0
3 C1 100 8 0 0 0 0 100 0 8 0
5 M1-5 10 6 M2-6 86 15 0 0 0 0 21
You can use filter for DataFrames for Zones and Values:
z = df.filter(like='Zone')
v = df.filter(like='Value')
Then create boolean DataFrames by contains with apply if want check substrings:
m1 = z.apply(lambda x: x.str.contains('R|C'))
m2 = z.apply(lambda x: x.str.contains('M'))
#for check strings
#m1 = z == 'R2'
#m2 = z.isin(['C1', 'C4'])
Last filter by where v and sum per rows:
df['t'] = v.where(m1.values).sum(axis=1).astype(int)
df['t1'] = v.where(m2.values).sum(axis=1).astype(int)
print (df)
ID Zone1 CHC1 Value1 Zone2 CHC2 Value2 Zone3 CHC3 Value3 t t1
0 1 R5B 100 10 C2 0 20 R10A 2 5 35 0
1 1 C2 95 20 M2-6 5 6 R5B 7 3 23 6
2 3 C2 40 4 C4 60 6 0 6 0 10 0
3 3 C1 100 8 0 0 0 0 100 0 8 0
4 5 M1-5 10 6 M2-6 86 15 0 0 0 0 21
Solution1 (simpler code but slower and less flexible)
total = []
total1 = []
for i in range(df.shape[0]):
temp = df.iloc[i].tolist()
if "R2" in temp:
total.append(temp[temp.index("R2")+1])
else:
total.append(0)
if ("C1" in temp) & ("C4" in temp):
total1.append(temp[temp.index("C1")+1] + temp[temp.index("C4")+1])
else:
total1.append(0)
df["Total"] = total
df["Total1"] = total1
Solution2 (faster than solution1 and easier to customize but possibly memory intensive)
# columns to use
cols = df.columns.tolist()
zones = [x for x in cols if x.startswith('Zone')]
vals = [x for x in cols if x.startswith('Value')]
# you can customize here
bucket1 = ['R2']
bucket2 = ['C1', 'C4']
thresh = 2 # "OR": 1, "AND": 2
original = df.copy()
# bucket1 check
for zone in zones:
df.loc[~df[zone].isin(bucket1), cols[cols.index(zone)+1]] = 0
original['Total'] = df[vals].sum(axis=1)
df = original.copy()
# bucket2 check
for zone in zones:
df.loc[~df[zone].isin(bucket2), cols[cols.index(zone)+1]] = 0
df['Check_Bucket'] = df[zones].stack().reset_index().groupby('level_0')[0].apply(list)
df['Check_Bucket'] = df['Check_Bucket'].apply(lambda x: len([y for y in x if y in bucket2]))
df['Total1'] = df[vals].sum(axis=1)
df.loc[df.Check_Bucket < thresh, 'Total1'] = 0
df.drop('Check_Bucket', axis=1, inplace=True)
When I expanded original dataframe to 100k rows, solution 1 took 11.4 s ± 82.1 ms per loop, while solution 2 took 3.53 s ± 29.8 ms per loop. The difference is because solution 2 does not for-looping over row direction.