I'm trying to create a new columns based on the columns that are present in the dataframe. This is the sample data
ORG CHAIN_NBR SEQ_NBR INT_STATUS BLOCK_CODE_1 BLOCK_CODE_2
0 523 1 0 A C A
1 523 2 1 I A D
2 521 3 1 A H F
3 513 4 1 D H Q
4 513 5 1 8 I M
This is the code that I'm Executing:
df=pd.read_csv("rtl_one", sep="\x01")
def risk():
if df['INT_STATUS'].isin(['B','C','F','H','P','R','T','X','Z','8','9']):
df['rcut'] = '01'
elif df['BLOCK_CODE_1'].isin(['A','B','C','D','E','F','G','I','J','K','L','M', 'N','O','P','R','U','W','Y','Z']):
df['rcut'] = '02'
elif df["BLOCK_CODE_2"].isin(['A','B','C','D','E','F','G','I','J','K','L','M', 'N','O','P','R','U','W','Y','Z']):
df['rcut'] == '03'
else:
df['rcut'] = '00'
risk()
Output data should look like this:
ORG CHAIN_NBR SEQ_NBR INT_STATUS BLOCK_CODE_1 BLOCK_CODE_2 rcut
0 523 1 0 A C A 02
1 523 2 1 I A D 02
2 521 3 1 A H F 03
3 513 4 1 D H Q 00
4 513 5 1 8 I M 01
use iterrows and store the results in a list that you can then append to the dataframe as a column:
rcut = []
for i, row in df.iterrows():
if row['INT_STATUS'] in ['B','C','F','H','P','R','T','X','Z','8','9']:
rcut.append('01')
elif row['BLOCK_CODE_1'] in ['A','B','C','D','E','F','G','I','J','K','L','M', 'N','O','P','R','U','W','Y','Z']:
rcut.append('02')
elif row['BLOCK_CODE_1'] in ['A','B','C','D','E','F','G','I','J','K','L','M', 'N','O','P','R','U','W','Y','Z']:
rcut.append('03')
else:
rcut.append('00')
df['rcut'] = rcut
(Note: your 2nd and 3d conditions are the same, I've reused your code here so you would have to change that)
use .index to add a new column, df['rcut'] = df.index
and then use df.insert( index,'rcut',value) in your if elif condition.
Related
I need some input from you. The idea is that I would like to see how long (in rows) it takes before you can see
a new value in column SUB_B1, and
a new value in SUB_B2
i.e, how many steps is there between
SUB_A1 and SUB B1, and
between SUB A2 and SUB B2
I have structured the data something like this: (I sort the index in descending order by the results column. After that I separate index B and A and place them in new columns)
df.sort_values(['A','result'], ascending=[True,False]).set_index(['A','B'])
result
SUB_A1
SUB_A2
SUB_B1
SUB_B2
A
B
10_125
10_173
0.903257
10
125
10
173
10_332
0.847333
10
125
10
332
10_243
0.842802
10
125
10
243
10_522
0.836335
10
125
10
522
58_941
0.810760
10
125
58
941
...
...
...
...
...
...
10_173
10_125
0.903257
10
173
10
125
58_941
0.847333
10
173
58
941
1_941
0.842802
10
173
1
941
96_512
0.836335
10
173
96
512
10_513
0.810760
10
173
10
513
This is what I have done so far: (edit: I think I need to iterate over values[] However, I havent manage to loop beyond the first rows yet...)
def func(group):
if group.SUB_A1.values[0] == group.SUB_B1.values[0]:
group.R1.values[0] = 1
else:
group.R1.values[0] = 0
if group.SUB_A1.values[0] == group.SUB_B1.values[1] and group.R1.values[0] == 1:
group.R1.values[1] = 2
else:
group.R1.values[1] = 0
df['R1'] = 0
df.groupby('A').apply(func)
Expected outcome:
result
SUB_B1
SUB_B2
R1
R2
A
B
10_125
10_173
0.903257
10
173
1
0
10_332
0.847333
10
332
2
0
10_243
0.842802
10
243
3
0
10_522
0.836335
10
522
4
0
58_941
0.810760
58
941
0
0
...
...
...
...
...
...
Are you looking for something like this:
Sample dataframe:
df = pd.DataFrame(
{"SUB_A": [1, -1, -2, 3, 3, 4, 3, 6, 6, 6],
"SUB_B": [1, 2, 3, 3, 3, 3, 4, 6, 6, 6]},
index=pd.MultiIndex.from_product([range(1, 3), range(1, 6)], names=("A", "B"))
)
SUB_A SUB_B
A B
1 1 1 1
2 -1 2
3 -2 3
4 3 3
5 3 3
2 1 4 3
2 3 4
3 6 6
4 6 6
5 6 6
Now this
equal = df.SUB_A == df.SUB_B
df["R"] = equal.groupby(equal.groupby("A").diff().fillna(True).cumsum()).cumsum()
leads to
SUB_A SUB_B R
A B
1 1 1 1 1
2 -1 2 0
3 -2 3 0
4 3 3 1
5 3 3 2
2 1 4 3 0
2 3 4 0
3 6 6 1
4 6 6 2
5 6 6 3
Try using pandas.DataFrame.iterrows and pandas.DataFrame.shift.
You can iterate through the dataframe and compare current row with the previous one, then add some condition:
df['SUB_A2_last'] = df['SUB_A2'].shift()
count = 0
#Fill column with zeros
df['count_series'] = 0
for index, row in df.iterrows():
subA = row['sub_A2']
subA_last = row['sub_A2_last']
if subA == subA_last:
count += 1
else:
count = 0
df.loc[index, 'count_series'] = count
Then repeat for B column. It is possible to get a better aproach using pandas.DataFrame.apply and a custom function.
Puh! Super! Thanks for the input you guys
def func(group):
for each in range(len(group)):
if group.SUB_A1.values[0] == group.SUB_B1.values[each]:
group.R1.values[each] = each + 1
continue
elif group.SUB_A1.values[0] == group.SUB_B1.values[each] and group.R1.values[each] == each + 1:
group.R1.values[each] = each + 1
else:
group.R1.values[each] = 0
return group
df['R1'] = 0
df.groupby('A').apply(func)
I have dataframe with 2 columns in it Column A and Column B and an array of alphabets from A to P which are as follows
df = pd.DataFrame({
'Column_A':[0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'Column_B':[]
})
the array is as follows:
label = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P']
Expected output is
'A':[0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'B':['A','A','A','A','A','E','E','E','E','E','I','I','I','I','I','M']
Value from Column B changes as soon as value from Column A is 1. and the value is taken from the given array 'label'
I have tried using this for loop
for row in df.index:
try:
if df.loc[row,'Column_A'] == 1:
df.at[row, 'Column_B'] = label[row+4]
print(label[row])
else:
df.ColumnB.fillna('ffill')
except IndexError:
row = (row+4)%4
df.at[row, 'Coumn_B'] = label[row]
I also want to loopback if it reaches the last value in 'Label' Array.
Some solution that should do the trick looks like:
label=list('ABCDEFGHIJKLMNOP')
df = pd.DataFrame({
'Column_A': [0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'Column_B': label
})
Not exactly sure, what you intended with the fillna, because I think you don't need it.
max_index= len(label)
df['Column_B']='ffill'
lookup= 0
for row in df.index:
if df.loc[row,'Column_A'] == 1:
lookup= lookup+4 if lookup+4 < max_index else lookup%4
df.at[row, 'Column_B'] = label[lookup]
print(label[row])
I also avoid the exception handling in this case, because the "index overflow" can be handled without exception handling.
Btw. if you have a large dataframe you can probably make the code faster by eliminating one lookup (but you'd need to verify if it really runs faster). The solution would look like this then:
max_index= len(label)
df['Column_B']='ffill'
lookup= 0
for row, record in df.iterrows():
if record['Column_A'] == 1:
lookup= lookup+4 if lookup+4 < max_index else lookup%4
df.at[row, 'Column_B'] = label[lookup]
print(label[row])
Option 1
cond1 = df.Column_A == 1
cond2 = df.index == 0
mappr = lambda x: label[x]
df.assign(Column_B=np.where(cond1 | cond2, df.index.map(mappr), np.nan)).ffill()
Column_A Column_B
0 0 A
1 0 A
2 0 A
3 0 A
4 0 A
5 1 F
6 0 F
7 0 F
8 0 F
9 0 F
10 1 K
11 0 K
12 0 K
13 0 K
14 0 K
15 1 P
Option 2
a = np.append(0, np.flatnonzero(df.Column_A))
b = df.Column_A.to_numpy().cumsum()
c = np.array(label)
df.assign(Column_B=c[a[b]])
Column_A Column_B
0 0 A
1 0 A
2 0 A
3 0 A
4 0 A
5 1 F
6 0 F
7 0 F
8 0 F
9 0 F
10 1 K
11 0 K
12 0 K
13 0 K
14 0 K
15 1 P
Using groupby with transform then map
df.reset_index().groupby(df.Column_A.eq(1).cumsum())['index'].transform('first').map(dict(enumerate(label)))
Out[139]:
0 A
1 A
2 A
3 A
4 A
5 F
6 F
7 F
8 F
9 F
10 K
11 K
12 K
13 K
14 K
15 P
Name: index, dtype: object
I'm working with a data frame like this, but bigger and with more zone. I am trying to sum the value of the rows by their names. The total sum of the R or C zones goes in total column while the total sum of either M zones goes in total1 .
Input:
total, total1 are the desired output.
ID Zone1 CHC1 Value1 Zone2 CHC2 Value2 Zone3 CHC3 Value3 total total1
1 R5B 100 10 C2 0 20 R10A 2 5 35 0
1 C2 95 20 M2-6 5 6 R5B 7 3 23 6
3 C2 40 4 C4 60 6 0 6 0 10 0
3 C1 100 8 0 0 0 0 100 0 8 0
5 M1-5 10 6 M2-6 86 15 0 0 0 0 21
You can use filter for DataFrames for Zones and Values:
z = df.filter(like='Zone')
v = df.filter(like='Value')
Then create boolean DataFrames by contains with apply if want check substrings:
m1 = z.apply(lambda x: x.str.contains('R|C'))
m2 = z.apply(lambda x: x.str.contains('M'))
#for check strings
#m1 = z == 'R2'
#m2 = z.isin(['C1', 'C4'])
Last filter by where v and sum per rows:
df['t'] = v.where(m1.values).sum(axis=1).astype(int)
df['t1'] = v.where(m2.values).sum(axis=1).astype(int)
print (df)
ID Zone1 CHC1 Value1 Zone2 CHC2 Value2 Zone3 CHC3 Value3 t t1
0 1 R5B 100 10 C2 0 20 R10A 2 5 35 0
1 1 C2 95 20 M2-6 5 6 R5B 7 3 23 6
2 3 C2 40 4 C4 60 6 0 6 0 10 0
3 3 C1 100 8 0 0 0 0 100 0 8 0
4 5 M1-5 10 6 M2-6 86 15 0 0 0 0 21
Solution1 (simpler code but slower and less flexible)
total = []
total1 = []
for i in range(df.shape[0]):
temp = df.iloc[i].tolist()
if "R2" in temp:
total.append(temp[temp.index("R2")+1])
else:
total.append(0)
if ("C1" in temp) & ("C4" in temp):
total1.append(temp[temp.index("C1")+1] + temp[temp.index("C4")+1])
else:
total1.append(0)
df["Total"] = total
df["Total1"] = total1
Solution2 (faster than solution1 and easier to customize but possibly memory intensive)
# columns to use
cols = df.columns.tolist()
zones = [x for x in cols if x.startswith('Zone')]
vals = [x for x in cols if x.startswith('Value')]
# you can customize here
bucket1 = ['R2']
bucket2 = ['C1', 'C4']
thresh = 2 # "OR": 1, "AND": 2
original = df.copy()
# bucket1 check
for zone in zones:
df.loc[~df[zone].isin(bucket1), cols[cols.index(zone)+1]] = 0
original['Total'] = df[vals].sum(axis=1)
df = original.copy()
# bucket2 check
for zone in zones:
df.loc[~df[zone].isin(bucket2), cols[cols.index(zone)+1]] = 0
df['Check_Bucket'] = df[zones].stack().reset_index().groupby('level_0')[0].apply(list)
df['Check_Bucket'] = df['Check_Bucket'].apply(lambda x: len([y for y in x if y in bucket2]))
df['Total1'] = df[vals].sum(axis=1)
df.loc[df.Check_Bucket < thresh, 'Total1'] = 0
df.drop('Check_Bucket', axis=1, inplace=True)
When I expanded original dataframe to 100k rows, solution 1 took 11.4 s ± 82.1 ms per loop, while solution 2 took 3.53 s ± 29.8 ms per loop. The difference is because solution 2 does not for-looping over row direction.
I have extracted some data in pandas format from a sql server. The structure like this:
df = pd.DataFrame({'Day':(1,2,3,4,1,2,3,4),'State':('A','A','A','A','B','B','B','B'),'Direction':('N','S','N','S','N','S','N','S'),'values':(12,34,22,37,14,16,23,43)})
>>> df
Day Direction State values
0 1 N A 12
1 2 S A 34
2 3 N A 22
3 4 S A 37
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43
Now I want to substitute all values with same day and same Direction but with (State == A) by itself + values with same day and same State but with (State == B). For example, like this:
df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'A'),'values'] = df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'A'),'values'].values + df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'B'),'values'].values
>>> df
Day Direction State values
0 1 N A 26
1 2 S A 34
2 3 N A 22
3 4 S A 37
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43
Notice the first line values have been changed from 12 to 26(12 + 14)
Since the values are from different rows, so kind of difficult to use combine_first functions?
Now I have to use two loops (on 'Day' and on 'Direction') and the above attribution sentence to do, it's extremely slow when the dataframe's getting big. Do you have any smart and efficient way doing this?
You can first define a function to do add values from B to A in the same group. Then apply this function to each group.
def f(x):
x.loc[x.State=='A','values']+=x.loc[x.State=='B','values'].iloc[0]
return x
df.groupby(['Day','Direction']).apply(f)
Out[94]:
Day Direction State values
0 1 N A 26
1 2 S A 50
2 3 N A 45
3 4 S A 80
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43
I have data and have convert using dataframe pandas :
import pandas as pd
d = [
(1,70399,0.988375133622),
(1,33919,0.981573492596),
(1,62461,0.981426807114),
(579,1,0.983018778374),
(745,1,0.995580488899),
(834,1,0.980942505189)
]
df_new = pd.DataFrame(e, columns=['source_target']).sort_values(['source_target'], ascending=[True])
and i need build series for mapping column source and target into another
e = []
for x in d:
e.append(x[0])
e.append(x[1])
e = list(set(e))
df_new = pd.DataFrame(e, columns=['source_target'])
df_new.source_target = (df_new.source_target.diff() != 0).cumsum() - 1
new_ser = pd.Series(df_new.source_target.values, index=new_source_old).drop_duplicates()
so i get series :
source_target
1 0
579 1
745 2
834 3
33919 4
62461 5
70399 6
dtype: int64
i have tried change dataframe df_beda based on new_ser series using :
df_beda.target = df_beda.target.mask(df_beda.target.isin(new_ser), df_beda.target.map(new_ser)).astype(int)
df_beda.source = df_beda.source.mask(df_beda.source.isin(new_ser), df_beda.source.map(new_ser)).astype(int)
but result is :
source target weight
0 0 70399 0.988375
1 0 33919 0.981573
2 0 62461 0.981427
3 579 0 0.983019
4 745 0 0.995580
5 834 0 0.980943
it's wrong, ideal result is :
source target weight
0 0 6 0.988375
1 0 4 0.981573
2 0 5 0.981427
3 1 0 0.983019
4 2 0 0.995580
5 3 0 0.980943
maybe anyone can help me for show where my mistake
Thanks
If the order doesn't matter, you can do the following. Avoid for loop unless it's absolutely necessary.
uniq_vals = np.unique(df_beda[['source','target']])
map_dict = dict(zip(uniq_vals, xrange(len(uniq_vals))))
df_beda[['source','target']] = df_beda[['source','target']].replace(map_dict)
print df_beda
source target weight
0 0 6 0.988375
1 0 4 0.981573
2 0 5 0.981427
3 1 0 0.983019
4 2 0 0.995580
5 3 0 0.980943
If you want to roll back, you can create an inverse map from the original one, because it is guaranteed to be 1-to-1 mapping.
inverse_map = {v:k for k,v in map_dict.iteritems()}
df_beda[['source','target']] = df_beda[['source','target']].replace(inverse_map)
print df_beda
source target weight
0 1 70399 0.988375
1 1 33919 0.981573
2 1 62461 0.981427
3 579 1 0.983019
4 745 1 0.995580
5 834 1 0.980943