Count the number of specific values in multiple columns pandas - python

I have a data frame:
A B C D E
12 4.5 6.1 BUY NaN
12 BUY BUY 5.6 NaN
BUY 4.5 6.1 BUY NaN
12 4.5 6.1 0 NaN
I want to count the number of times 'BUY' appears in each row. Intended result:
A B C D E score
12 4.5 6.1 BUY NaN 1
12 BUY BUY 5.6 NaN 2
15 4.5 6.1 BUY NaN 1
12 4.5 6.1 0 NaN 0
I have tried the following but it simply gives 0 for all the rows:
df['score'] = df[df == 'BUY'].sum(axis=1)
Note that BUY can only appear in B, C, D, E columns.
I tried to find the solution online but shockingly found none.
Little help will be appreciated. THANKS!

You can compare and then sum:
df['score'] = (df[['B','C','D','E']] == 'BUY').sum(axis=1)
This sums up all the booleans and you get the correct result.
When you do df[df == 'BUY'], you are just replacing anything which is not BUY with np.nan and then taking sum over axis=1 doesnot work since all you have left in your result is np.nan and the 'BUY' string. Hence you get all 0.

Or you could use apply with list.count:
df['score'] = df.apply(lambda x: x.tolist().count('BUY'), axis=1)
print(df)
Output:
A B C D E score
0 12 4.5 6.1 BUY NaN 1
1 12 BUY BUY 5.6 NaN 2
2 BUY 4.5 6.1 BUY NaN 2
3 12 4.5 6.1 0 NaN 0

Try using apply with lambda over axis=1. This picks up each row at a time as a series. You can use the condition [row == 'BUY'] to filter the row and then count the number of 'BUY' using len()
df['score'] = df.apply(lambda row: len(row[row == 'BUY']), axis=1)
print(df)
A B C D E score
0 12 4.5 6.1 BUY NaN 1
1 12 BUY BUY 5.6 NaN 2
2 BUY 4.5 6.1 BUY NaN 2
3 12 4.5 6.1 0 NaN 0

import numpy as np
df['score'] = np.count_nonzero(df == 'BUY', axis=1)
Output:
A B C D E score
0 12 4.5 6.1 BUY NaN 1
1 12 BUY BUY 5.6 NaN 2
2 BUY 4.5 6.1 BUY NaN 2
3 12 4.5 6.1 0 NaN 0

Related

how to adjust subtotal columns in pandas using grouby?

I'm working on exporting data frames to Excel using dataframe join.
However, after Join dataframe,
when calculating subtotal using groupby, the figure below is executed.
There's a "Subtotal" word in the index column.
enter image description here
Is there any way to move it into the code column and sort the indexes?
enter image description here
here codes :
def subtotal(df__, str):
container = []
for key, group in df__.groupby(['key']):
group.loc['subtotal'] = group[['quantity', 'quantity2', 'quantity3']].sum()
container.append(group)
df_subtotal = pd.concat(container)
df_subtotal.loc['GrandTotal'] = df__[['quantity', 'quantity2', 'quantity3']].sum()
print(df_subtotal)
return (df_subtotal.to_excel(writer, sheet_name=str))
Use np.where() to fill NaN in code column with value in df.index. Then assign a new index array to df.index.
import numpy as np
df['code'] = np.where(df['code'].isna(), df.index, df['code'])
df.index = np.arange(1, len(df) + 1)
print(df)
code key product quntity1 quntity2 quntity3
1 cs01767 a apple-a 10 0 10.0
2 Subtotal NaN NaN 10 0 10.0
3 cs0000 b bannana-a 50 10 40.0
4 cs0000 b bannana-b 0 0 0.0
5 cs0000 b bannana-c 0 0 0.0
6 cs0000 b bannana-d 80 20 60.0
7 cs0000 b bannana-e 0 0 0.0
8 cs01048 b bannana-f 0 0 NaN
9 cs01048 b bannana-g 0 0 0.0
10 Subtotal NaN NaN 130 30 100.0
11 cs99999 c melon-a 50 10 40.0
12 cs99999 c melon-b 20 20 0.0
13 cs01188 c melon-c 10 0 10.0
14 Subtotal NaN NaN 80 30 50.0
15 GrandTotal NaN NaN 220 60 160.0

pandas filling nan with previous row value multiplied with another column

I have dataframe for which I want to fill nan with values from previous rows mulitplied with pct_change column
col_to_fill pct_change
0 1 NaN
1 2 1.0
2 10 0.5
3 nan 0.5
4 nan 1.3
5 nan 2
6 5 3
so for 3rd row 10*0.5 = 5 and use that filled value to fill next rows if its nan.
col_to_fill pct_change
0 1 NaN
1 2 1.0
2 10 0.5
3 5 0.5
4 6.5 1.3
5 13 2
6 5 3
I have used this
while df['col_to_fill'].isna().sum() > 0:
df.loc[df['col_to_fill'].isna(), 'col_to_fill'] = df['col_to_fill'].shift(1) * df['pct_change']
but Its taking too much time as its only filling those row whos previous row are nonnan in one loop.
Try with cumprod after ffill
s = df.col_to_fill.ffill()*df.loc[df.col_to_fill.isna(),'pct_change'].cumprod()
df.col_to_fill.fillna(s, inplace=True)
df
Out[90]:
col_to_fill pct_change
0 1.0 NaN
1 2.0 1.0
2 10.0 0.5
3 5.0 0.5
4 6.5 1.3
5 13.0 2.0
6 5.0 3.0

Groupby Row element and Tranpose a Panda Dataframe

In Python, I have the following Pandas dataframe:
Factor Value
0 a 1.2
1 b 3.4
2 b 4.5
3 b 5.6
4 c 1.3
5 d 4.6
I would like to organize this where:
unique row identifiers (the factor col) become columns
Their respective values remain under the created columns
The factor values are not in an organized.
Target:
A B C D
0 1.2 3.4 1.3 4.6
1 4.5
2 5.6
3
4
5
Use, set_index and unstack with groupby:
df.set_index(['Factor', df.groupby('Factor').cumcount()])['Value'].unstack(0)
Output:
Factor a b c d
0 1.2 3.4 1.3 4.6
1 NaN 4.5 NaN NaN
2 NaN 5.6 NaN NaN

Pandas: Iterate by two column for each iteration

Does anyone know how to iterate a pandas Dataframe with two columns for each iteration?
Say I have
a b c d
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
5.4 3.9 1.7 0.4
So something like
for x, y in ...:
correlation of x and y
So output will be
corr_ab corr_bc corr_cd
0.1 0.3 -0.4
You can use zip with indexing for tuples, create dictionary of one element lists with Series.corr and f-strings for columns names and pass to DataFrame constructor:
L = {f'corr_{col1}{col2}': [df[col1].corr(df[col2])]
for col1, col2 in zip(df.columns, df.columns[1:])}
df = pd.DataFrame(L)
print (df)
corr_ab corr_bc corr_cd
0 0.860108 0.61333 0.888523
You can use df.corr to get the correlation of the dataframe. You then use mask to avoid repeated correlations. After that you can stack your new dataframe to make it more readable. Assuming you have data like this
0 1 2 3 4
0 11 6 17 2 3
1 3 12 16 17 5
2 13 2 11 10 0
3 8 12 13 18 3
4 4 3 1 0 18
Finding the correlation,
corrData = data.corr(method='pearson')
We get,
0 1 2 3 4
0 1.000000 -0.446023 0.304108 -0.136610 -0.674082
1 -0.446023 1.000000 0.563112 0.773013 -0.258801
2 0.304108 0.563112 1.000000 0.494512 -0.823883
3 -0.136610 0.773013 0.494512 1.000000 -0.545530
4 -0.674082 -0.258801 -0.823883 -0.545530 1.000000
Masking out repeated correlations,
dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape)).astype(np.bool))
We get
0 1 2 3 4
0 NaN -0.446023 0.304108 -0.136610 -0.674082
1 NaN NaN 0.563112 0.773013 -0.258801
2 NaN NaN NaN 0.494512 -0.823883
3 NaN NaN NaN NaN -0.545530
4 NaN NaN NaN NaN NaN
Stacking the correlated data
dataCorr = dataCorr.stack().reset_index()
The stacked data will look as shown
level_0 level_1 0
0 0 1 -0.446023
1 0 2 0.304108
2 0 3 -0.136610
3 0 4 -0.674082
4 1 2 0.563112
5 1 3 0.773013
6 1 4 -0.258801
7 2 3 0.494512
8 2 4 -0.823883
9 3 4 -0.545530

Missing data, insert rows in Pandas and fill with NAN

I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.
This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.

Categories

Resources