Consider the following dataframe:
df = pd.DataFrame(data=np.array([['a',1, 2, 3,'T'], ['a',4, 5, 6,'F'],
['b',7, 8, 9,'T'], ['b',10, 11 , 12,'T'], ['b',13, 14 , 15,'F']])
, columns=['id','A', 'B', 'C','T/F'])
id A B C T/F
0 a 1 2 3 T
1 a 4 5 6 F
2 b 7 8 9 T
3 b 10 11 12 T
4 b 13 14 15 F
I want to apply a condition to T/F column that will copy the rows of each id with T label to further columns of the same id.
For example,I need the following result:
id A B C T/F
0 a 1 2 3 T 1 2 3 T
1 a 4 5 6 F 1 2 3 T
2 b 7 8 9 T 7 8 9 T
3 b 10 11 12 T 7 8 9 T
4 b 13 14 15 F 7 8 9 T
5 b 7 8 9 T 10 11 12 T
6 b 10 11 12 T 10 11 12 T
7 b 13 14 15 F 10 11 12 T
here is my script:
n = np.array(df.groupby('id').size())
m = len(df.groupby('id'))
Cnt = 0
df4 = pd.DataFrame()
for prsnNo in range(m):
for i in range(n[prsnNo]):
v = df.iloc[Cnt: Cnt + n[prsnNo], :].groupby('id').cumcount() == i
df1 = df.iloc[Cnt: Cnt + n[prsnNo], :].where(v)
temp = df4
df4 = df.iloc[Cnt: Cnt + n[prsnNo], :].merge(df1, on="id", how="left")
df4 = pd.concat([temp, df4])
Cnt += n[prsnNo]
I do not know how to add a condition to check the value of T/F column in my loop.If I add the if condition in my loop it gives me an error.
for prsnNo in range(m):
for i in range(n[prsnNo]):
if df[df['T/F'] =='T'] :
v = df.iloc[Cnt: Cnt + n[prsnNo], :].groupby('id').cumcount() == i
df1 = df.iloc[Cnt: Cnt + n[prsnNo], :].where(v)
temp = df4
df4 = df.iloc[Cnt: Cnt + n[prsnNo], :].merge(df1, on="id", how="left")
df4 = pd.concat([temp, df4])
Cnt += n[prsnNo]
Thanks,
If order doesn't matter, you can use groupby + first, and then perform a merge with df and the grouped result.
v = df.groupby(['id', df['T/F'].eq('T').cumsum()])\
.first().reset_index(level=1, drop=True)
df = df.merge(v, left_on='id', right_index=True)
df.columns = df.columns.str.split('_').str[0]
df
id A B C T/F A B C T/F
0 a 1 2 3 T 1 2 3 T
1 a 4 5 6 F 1 2 3 T
2 b 7 8 9 T 7 8 9 T
2 b 7 8 9 T 10 11 12 T
3 b 10 11 12 T 7 8 9 T
3 b 10 11 12 T 10 11 12 T
4 b 13 14 15 F 7 8 9 T
4 b 13 14 15 F 10 11 12 T
Related
I am trying to use a function to create multiple outputs, using multiple columns as inputs. Here's my attempt:
df = pd.DataFrame(np.random.randint(0,10,size=(6, 4)), columns=list('ABCD'))
df.head()
A B C D
0 8 2 5 0
1 9 9 8 6
2 4 0 1 7
3 8 4 0 3
4 5 6 9 9
def some_func(a, b, c):
return a+b, a+b+c
df['dd'], df['ee'] = df.apply(lambda x: some_func(a = x['A'], b = x['B'], c = x['C']), axis=1, result_type='expand')
df.head()
A B C D dd ee
0 8 2 5 0 0 1
1 9 9 8 6 0 1
2 4 0 1 7 0 1
3 8 4 0 3 0 1
4 5 6 9 9 0 1
The outputs are all 0 for the first new column, and all 1 for the next new column. I am interested in the correct solution, but I am also curious about why my code resulted this way.
You can assign to subset ['dd','ee']:
def some_func(a, b, c):
return a+b, a+b+c
df[['dd','ee']] = df.apply(lambda x: some_func(a = x['A'],
b = x['B'],
c = x['C']), axis=1, result_type='expand')
print (df)
A B C D dd ee
0 4 7 7 3 11 18
1 2 1 3 4 3 6
2 4 7 6 0 11 17
3 0 9 1 1 9 10
4 5 6 5 9 11 16
5 3 2 4 9 5 9
If possible, better/ fatser is use vectorized solution:
df = df.assign(dd = df.A + df.B, ee = df.A + df.B + df.C)
Just to explain the 0, 1 part. 0 and 1 are actually the column names of
df.apply(lambda x: some_func(a = x['A'], b = x['B'], c = x['C']), axis=1, result_type='expand')
That is
x = df.apply(lambda x: some_func(a = x['A'], b = x['B'], c = x['C']), axis=1, result_type='expand')
a, b = x
print(a) # first column name
print(b) # second column name
output:
0
1
Finally, you assign
df['dd'], df['ee'] = 0, 1
results in
A B C D dd ee
0 8 2 5 0 0 1
1 9 9 8 6 0 1
2 4 0 1 7 0 1
3 8 4 0 3 0 1
4 5 6 9 9 0 1
Alternative way:
df['dd'], df['ee'] = zip(*df.apply(lambda x: some_func(x['A'], x['B'], x['C]) )
Suppose I have the following Pandas dataframe:
In[285]: df = pd.DataFrame({'Name':['A','B'], 'Start': [1,6], 'End': [4,12]})
In [286]: df
Out[286]:
Name Start End
0 A 1 4
1 B 6 12
Now I would like to construct the dataframe as follows:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
My biggest struggle is in getting the 'Name' column right. Is there a smart way to do this in Python?
I would do pd.concat on a list comprehension:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1)})
.assign(Name=n)
for n,s,e in zip(df['Name'], df['Start'], df['End']))
Output:
Number Name
0 1 A
1 2 A
2 3 A
3 4 A
0 6 B
1 7 B
2 8 B
3 9 B
4 10 B
5 11 B
6 12 B
Update: As commented by #rafaelc:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1), 'Name': n})
for n,s,e in zip(df['Name'], df['Start'], df['End']))
works just fine.
Let us do it with this example (with 3 names):
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C'], 'Start': [1,6,18], 'End': [4,12,20]})
You may create the target columns first, using list comprehensions:
name = [row.Name for i, row in df.iterrows() for _ in range(row.End - row.Start + 1)]
number = [k for i, row in df.iterrows() for k in range(row.Start, row.End + 1)]
And then you can create the target DataFrame:
expanded = pd.DataFrame({"Name": name, "Number": number})
You get:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
11 C 18
12 C 19
13 C 20
I'd take advantage of loc and index.repeat for a vectorized solution.
base = df.loc[df.index.repeat(df['End'] - df['Start'] + 1), ['Name', 'Start']]
base['Start'] += base.groupby(level=0).cumcount()
Name Start
0 A 1
0 A 2
0 A 3
0 A 4
1 B 6
1 B 7
1 B 8
1 B 9
1 B 10
1 B 11
1 B 12
Of course we can rename the columns and reset the index at the end, for a nicer showing.
base.rename(columns={'Start': 'Number'}).reset_index(drop=True)
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
I'm trying to create a reusable function in python 2.7(pandas) to form categorical bins, i.e. group less-value categories as 'other'. Can someone help me to create a function for the below: col1, col2, etc. are different categorical variable columns.
##Reducing categories by binning categorical variables - column1
a = df.col1.value_counts()
#get top 5 values of index
vals = a[:5].index
df['col1_new'] = df.col1.where(df.col1.isin(vals), 'other')
df = df.drop(['col1'],axis=1)
##Reducing categories by binning categorical variables - column2
a = df.col2.value_counts()
#get top 6 values of index
vals = a[:6].index
df['col2_new'] = df.col2.where(df.col2.isin(vals), 'other')
df = df.drop(['col2'],axis=1)
You can use:
df = pd.DataFrame({'A':list('abcdefabcdefabffeg'),
'D':[1,3,5,7,1,0,1,3,5,7,1,0,1,3,5,7,1,0]})
print (df)
A D
0 a 1
1 b 3
2 c 5
3 d 7
4 e 1
5 f 0
6 a 1
7 b 3
8 c 5
9 d 7
10 e 1
11 f 0
12 a 1
13 b 3
14 f 5
15 f 7
16 e 1
17 g 0
def replace_under_top(df, c, n):
a = df[c].value_counts()
#get top n values of index
vals = a[:n].index
#assign columns back
df[c] = df[c].where(df[c].isin(vals), 'other')
#rename processes column
df = df.rename(columns={c : c + '_new'})
return df
Test:
df1 = replace_under_top(df, 'A', 3)
print (df1)
A_new D
0 other 1
1 b 3
2 other 5
3 other 7
4 e 1
5 f 0
6 other 1
7 b 3
8 other 5
9 other 7
10 e 1
11 f 0
12 other 1
13 b 3
14 f 5
15 f 7
16 e 1
17 other 0
df2 = replace_under_top(df, 'D', 4)
print (df2)
A D_new
0 other 1
1 b 3
2 other 5
3 other 7
4 e 1
5 f other
6 other 1
7 b 3
8 other 5
9 other 7
10 e 1
11 f other
12 other 1
13 b 3
14 f 5
15 f 7
16 e 1
17 other other
I did not figure out how to solve the following question!
consider the following data set:
df = pd.DataFrame(data=np.array([['a',1, 2, 3], ['a',4, 5, 6],
['b',7, 8, 9], ['b',10, 11 , 12]]),
columns=['id','A', 'B', 'C'])
id A B C
a 1 2 3
a 4 5 6
b 7 8 9
b 10 11 12
I need to group the data by id and in each group duplicate the first row and add it to the dataset like the following data set:
id A B C A B C
a 1 2 3 1 2 3
a 4 5 6 1 2 3
b 7 8 9 7 8 9
b 10 11 12 7 8 9
I really appreciate it for your help.
I did the following steps, however I could not expand it :
df1 = df.loc [0:0 , 'A' :'C']
df3 = pd.concat([df,df1],axis=1)
Use groupby + first, and then concatenate df with this result:
v = df.groupby('id').transform('first')
pd.concat([df, v], 1)
id A B C A B C
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9
cumcount + where+ffill
v=df.groupby('id').cumcount()==0
pd.concat([df,df.iloc[:,1:].where(v).ffill()],1)
Out[57]:
id A B C A B C
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9
One can also try drop_duplicates and merge.
df_unique = df.drop_duplicates("id")
df.merge(df_unique, on="id", how="left")
id A_x B_x C_x A_y B_y C_y
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9
Suppose I have two dataframes X and Y:
import pandas as pd
X = pd.DataFrame({'A':[1,4,7],'B':[2,5,8],'C':[3,6,9]})
Y = pd.DataFrame({'D':[1],'E':[11]})
In [4]: X
Out[4]:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
In [6]: Y
Out[6]:
D E
0 1 11
and then, I want to get the following result dataframe:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
how?
Assuming that Y contains only one row:
In [9]: X.assign(**Y.to_dict('r')[0])
Out[9]:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
or a much nicer alternative from #piRSquared:
In [27]: X.assign(**Y.iloc[0])
Out[27]:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
Helper dict:
In [10]: Y.to_dict('r')[0]
Out[10]: {'D': 1, 'E': 11}
Here is another way
Y2 = pd.concat([Y]*3, ignore_index = True) #This duplicates the rows
Which produces:
D E
0 1 11
0 1 11
0 1 11
Then concat once again:
pd.concat([X,Y2], axis =1)
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11