checking condition in a loop (Pandas)

checking condition in a loop (Pandas) - python

Consider the following dataframe:
df = pd.DataFrame(data=np.array([['a',1, 2, 3,'T'], ['a',4, 5, 6,'F'],
['b',7, 8, 9,'T'], ['b',10, 11 , 12,'T'], ['b',13, 14 , 15,'F']])
, columns=['id','A', 'B', 'C','T/F'])
id A B C T/F
0 a 1 2 3 T
1 a 4 5 6 F
2 b 7 8 9 T
3 b 10 11 12 T
4 b 13 14 15 F
I want to apply a condition to T/F column that will copy the rows of each id with T label to further columns of the same id.
For example,I need the following result:
id A B C T/F
0 a 1 2 3 T 1 2 3 T
1 a 4 5 6 F 1 2 3 T
2 b 7 8 9 T 7 8 9 T
3 b 10 11 12 T 7 8 9 T
4 b 13 14 15 F 7 8 9 T
5 b 7 8 9 T 10 11 12 T
6 b 10 11 12 T 10 11 12 T
7 b 13 14 15 F 10 11 12 T
here is my script:
n = np.array(df.groupby('id').size())
m = len(df.groupby('id'))
Cnt = 0
df4 = pd.DataFrame()
for prsnNo in range(m):
for i in range(n[prsnNo]):
v = df.iloc[Cnt: Cnt + n[prsnNo], :].groupby('id').cumcount() == i
df1 = df.iloc[Cnt: Cnt + n[prsnNo], :].where(v)
temp = df4
df4 = df.iloc[Cnt: Cnt + n[prsnNo], :].merge(df1, on="id", how="left")
df4 = pd.concat([temp, df4])
Cnt += n[prsnNo]
I do not know how to add a condition to check the value of T/F column in my loop.If I add the if condition in my loop it gives me an error.
for prsnNo in range(m):
for i in range(n[prsnNo]):
if df[df['T/F'] =='T'] :
v = df.iloc[Cnt: Cnt + n[prsnNo], :].groupby('id').cumcount() == i
df1 = df.iloc[Cnt: Cnt + n[prsnNo], :].where(v)
temp = df4
df4 = df.iloc[Cnt: Cnt + n[prsnNo], :].merge(df1, on="id", how="left")
df4 = pd.concat([temp, df4])
Cnt += n[prsnNo]
Thanks,

If order doesn't matter, you can use groupby + first, and then perform a merge with df and the grouped result.
v = df.groupby(['id', df['T/F'].eq('T').cumsum()])\
.first().reset_index(level=1, drop=True)
df = df.merge(v, left_on='id', right_index=True)
df.columns = df.columns.str.split('_').str[0]
df
id A B C T/F A B C T/F
0 a 1 2 3 T 1 2 3 T
1 a 4 5 6 F 1 2 3 T
2 b 7 8 9 T 7 8 9 T
2 b 7 8 9 T 10 11 12 T
3 b 10 11 12 T 7 8 9 T
3 b 10 11 12 T 10 11 12 T
4 b 13 14 15 F 7 8 9 T
4 b 13 14 15 F 10 11 12 T

Related

Pandas apply to create multiple columns, using multiple columns as input

I am trying to use a function to create multiple outputs, using multiple columns as inputs. Here's my attempt:
df = pd.DataFrame(np.random.randint(0,10,size=(6, 4)), columns=list('ABCD'))
df.head()
A B C D
0 8 2 5 0
1 9 9 8 6
2 4 0 1 7
3 8 4 0 3
4 5 6 9 9
def some_func(a, b, c):
return a+b, a+b+c
df['dd'], df['ee'] = df.apply(lambda x: some_func(a = x['A'], b = x['B'], c = x['C']), axis=1, result_type='expand')
df.head()
A B C D dd ee
0 8 2 5 0 0 1
1 9 9 8 6 0 1
2 4 0 1 7 0 1
3 8 4 0 3 0 1
4 5 6 9 9 0 1
The outputs are all 0 for the first new column, and all 1 for the next new column. I am interested in the correct solution, but I am also curious about why my code resulted this way.

You can assign to subset ['dd','ee']:
def some_func(a, b, c):
return a+b, a+b+c
df[['dd','ee']] = df.apply(lambda x: some_func(a = x['A'],
b = x['B'],
c = x['C']), axis=1, result_type='expand')
print (df)
A B C D dd ee
0 4 7 7 3 11 18
1 2 1 3 4 3 6
2 4 7 6 0 11 17
3 0 9 1 1 9 10
4 5 6 5 9 11 16
5 3 2 4 9 5 9
If possible, better/ fatser is use vectorized solution:
df = df.assign(dd = df.A + df.B, ee = df.A + df.B + df.C)

Just to explain the 0, 1 part. 0 and 1 are actually the column names of
df.apply(lambda x: some_func(a = x['A'], b = x['B'], c = x['C']), axis=1, result_type='expand')
That is
x = df.apply(lambda x: some_func(a = x['A'], b = x['B'], c = x['C']), axis=1, result_type='expand')
a, b = x
print(a) # first column name
print(b) # second column name
output:
0
1
Finally, you assign
df['dd'], df['ee'] = 0, 1
results in
A B C D dd ee
0 8 2 5 0 0 1
1 9 9 8 6 0 1
2 4 0 1 7 0 1
3 8 4 0 3 0 1
4 5 6 9 9 0 1

Alternative way:
df['dd'], df['ee'] = zip(*df.apply(lambda x: some_func(x['A'], x['B'], x['C]) )

Construct a df such that every number within a range gets value 'A' assigned when knowing the start and end of the range values that belong to 'A'

Suppose I have the following Pandas dataframe:
In[285]: df = pd.DataFrame({'Name':['A','B'], 'Start': [1,6], 'End': [4,12]})
In [286]: df
Out[286]:
Name Start End
0 A 1 4
1 B 6 12
Now I would like to construct the dataframe as follows:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
My biggest struggle is in getting the 'Name' column right. Is there a smart way to do this in Python?

I would do pd.concat on a list comprehension:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1)})
.assign(Name=n)
for n,s,e in zip(df['Name'], df['Start'], df['End']))
Output:
Number Name
0 1 A
1 2 A
2 3 A
3 4 A
0 6 B
1 7 B
2 8 B
3 9 B
4 10 B
5 11 B
6 12 B
Update: As commented by #rafaelc:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1), 'Name': n})
for n,s,e in zip(df['Name'], df['Start'], df['End']))
works just fine.

Let us do it with this example (with 3 names):
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C'], 'Start': [1,6,18], 'End': [4,12,20]})
You may create the target columns first, using list comprehensions:
name = [row.Name for i, row in df.iterrows() for _ in range(row.End - row.Start + 1)]
number = [k for i, row in df.iterrows() for k in range(row.Start, row.End + 1)]
And then you can create the target DataFrame:
expanded = pd.DataFrame({"Name": name, "Number": number})
You get:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
11 C 18
12 C 19
13 C 20

I'd take advantage of loc and index.repeat for a vectorized solution.
base = df.loc[df.index.repeat(df['End'] - df['Start'] + 1), ['Name', 'Start']]
base['Start'] += base.groupby(level=0).cumcount()
Name Start
0 A 1
0 A 2
0 A 3
0 A 4
1 B 6
1 B 7
1 B 8
1 B 9
1 B 10
1 B 11
1 B 12
Of course we can rename the columns and reset the index at the end, for a nicer showing.
base.rename(columns={'Start': 'Number'}).reset_index(drop=True)
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12

Creating python function to create categorical bins in pandas

I'm trying to create a reusable function in python 2.7(pandas) to form categorical bins, i.e. group less-value categories as 'other'. Can someone help me to create a function for the below: col1, col2, etc. are different categorical variable columns.
##Reducing categories by binning categorical variables - column1
a = df.col1.value_counts()
#get top 5 values of index
vals = a[:5].index
df['col1_new'] = df.col1.where(df.col1.isin(vals), 'other')
df = df.drop(['col1'],axis=1)
##Reducing categories by binning categorical variables - column2
a = df.col2.value_counts()
#get top 6 values of index
vals = a[:6].index
df['col2_new'] = df.col2.where(df.col2.isin(vals), 'other')
df = df.drop(['col2'],axis=1)

You can use:
df = pd.DataFrame({'A':list('abcdefabcdefabffeg'),
'D':[1,3,5,7,1,0,1,3,5,7,1,0,1,3,5,7,1,0]})
print (df)
A D
0 a 1
1 b 3
2 c 5
3 d 7
4 e 1
5 f 0
6 a 1
7 b 3
8 c 5
9 d 7
10 e 1
11 f 0
12 a 1
13 b 3
14 f 5
15 f 7
16 e 1
17 g 0
def replace_under_top(df, c, n):
a = df[c].value_counts()
#get top n values of index
vals = a[:n].index
#assign columns back
df[c] = df[c].where(df[c].isin(vals), 'other')
#rename processes column
df = df.rename(columns={c : c + '_new'})
return df
Test:
df1 = replace_under_top(df, 'A', 3)
print (df1)
A_new D
0 other 1
1 b 3
2 other 5
3 other 7
4 e 1
5 f 0
6 other 1
7 b 3
8 other 5
9 other 7
10 e 1
11 f 0
12 other 1
13 b 3
14 f 5
15 f 7
16 e 1
17 other 0
df2 = replace_under_top(df, 'D', 4)
print (df2)
A D_new
0 other 1
1 b 3
2 other 5
3 other 7
4 e 1
5 f other
6 other 1
7 b 3
8 other 5
9 other 7
10 e 1
11 f other
12 other 1
13 b 3
14 f 5
15 f 7
16 e 1
17 other other

Adding duplicate rows to a DataFrame

I did not figure out how to solve the following question!
consider the following data set:
df = pd.DataFrame(data=np.array([['a',1, 2, 3], ['a',4, 5, 6],
['b',7, 8, 9], ['b',10, 11 , 12]]),
columns=['id','A', 'B', 'C'])
id A B C
a 1 2 3
a 4 5 6
b 7 8 9
b 10 11 12
I need to group the data by id and in each group duplicate the first row and add it to the dataset like the following data set:
id A B C A B C
a 1 2 3 1 2 3
a 4 5 6 1 2 3
b 7 8 9 7 8 9
b 10 11 12 7 8 9
I really appreciate it for your help.
I did the following steps, however I could not expand it :
df1 = df.loc [0:0 , 'A' :'C']
df3 = pd.concat([df,df1],axis=1)

Use groupby + first, and then concatenate df with this result:
v = df.groupby('id').transform('first')
pd.concat([df, v], 1)
id A B C A B C
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9

cumcount + where+ffill
v=df.groupby('id').cumcount()==0
pd.concat([df,df.iloc[:,1:].where(v).ffill()],1)
Out[57]:
id A B C A B C
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9

One can also try drop_duplicates and merge.
df_unique = df.drop_duplicates("id")
df.merge(df_unique, on="id", how="left")
id A_x B_x C_x A_y B_y C_y
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9

How two merge two dataframes without any index being based

Suppose I have two dataframes X and Y:
import pandas as pd
X = pd.DataFrame({'A':[1,4,7],'B':[2,5,8],'C':[3,6,9]})
Y = pd.DataFrame({'D':[1],'E':[11]})
In [4]: X
Out[4]:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
In [6]: Y
Out[6]:
D E
0 1 11
and then, I want to get the following result dataframe:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
how？

Assuming that Y contains only one row:
In [9]: X.assign(**Y.to_dict('r')[0])
Out[9]:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
or a much nicer alternative from #piRSquared:
In [27]: X.assign(**Y.iloc[0])
Out[27]:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
Helper dict:
In [10]: Y.to_dict('r')[0]
Out[10]: {'D': 1, 'E': 11}

Here is another way
Y2 = pd.concat([Y]*3, ignore_index = True) #This duplicates the rows
Which produces:
D E
0 1 11
0 1 11
0 1 11
Then concat once again:
pd.concat([X,Y2], axis =1)
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

checking condition in a loop (Pandas) - python

Related

Pandas apply to create multiple columns, using multiple columns as input

Construct a df such that every number within a range gets value 'A' assigned when knowing the start and end of the range values that belong to 'A'

Creating python function to create categorical bins in pandas

Adding duplicate rows to a DataFrame

How two merge two dataframes without any index being based

Categories

Resources