efficient way to calculate between columns with conditions - python

I have a dataframe looks like
Cnt_A Cnt_B Cnt_C Cnt_D
ID_1 0 1 3 0
ID_2 1 0 0 0
ID_3 5 2 0 8
...
I'd like to count columns that are not zero and put the result into new column like this,
Total_Not_Zero_Cols Cnt_A Cnt_B Cnt_C Cnt_D
ID_1 2 0 1 3 0
ID_2 1 1 0 0 0
ID_3 3 5 2 0 8
...
I did loop to get the result, but it took very long time (of course).
I can't figure out the most efficient way to calculate between columns with condition :(
Thank you in advance

Check if each value not equals to 0 then sum on columns axis:
df['Total_Not_Zero_Cols'] = df.ne(0).sum(axis=1)
print(df)
# Output
Cnt_A Cnt_B Cnt_C Cnt_D Total_Not_Zero_Cols
ID_1 0 1 3 0 2
ID_2 1 0 0 0 3
ID_3 5 2 0 8 1

Use ne to generate a DataFrame of booleans with True for non-zeros values, then aggregate the rows as integers using sum:
df['Total_Not_Zero_Cols'] = df.ne(0).sum(axis=1)

Numpy based -
Use -
np.sum(df!=0, axis=1)
Output
ID_1 2
ID_2 1
ID_3 3
dtype: int64

Related

Getting a slice from pandas dataframe by comparing contents of df1 with df2

I have two dataframes, for each id in df1 I need to pick rows from df2 with same id, filter out rows where df2.ApplicationDate < df1.ApplicationDate and count how many of such rows exist.
Here is how I am doing it currently:
for i, row in df1.iterrows():
count = len(df2[(df2['PersonalId']==row['PersonalId'])
& (df2['ApplicationDate'] < row['ApplicationDate']])
counts.append(count)
This approach works but its hellishly slow on large dataframes, is there any way to accelerate it?
Edit: added sample input with expected output
df1:
Id ApplicationDate
0 1 5-12-20
1 2 6-12-20
2 3 7-12-20
3 4 8-12-20
4 5 9-12-20
5 6 10-12-20
df2:
Id ApplicationDate
0 1 4-11-20
1 1 4-12-20
2 3 5-12-20
3 3 8-12-20
4 5 1-12-20
expected counts for each id: [2, 0, 1, 0, 1, 0]
You can can left join both tables:
df3 = df1.merge(df2, on='Id', how='left')
Result:
Id ApplicationDate_x ApplicationDate_y
0 1 2020-05-12 2020-04-11
1 1 2020-05-12 2020-04-12
2 2 2020-06-12 NaT
3 3 2020-07-12 2020-05-12
4 3 2020-07-12 2020-08-12
5 4 2020-08-12 NaT
6 5 2020-09-12 2020-01-12
7 6 2020-10-12 NaT
Then you can compare dates, group by 'Id' and count True values per group:
df3.ApplicationDate_x.gt(df3.ApplicationDate_y).groupby(df3.Id).sum()
Result:
Id
1 2
2 0
3 1
4 0
5 1
6 0
df1.merge(df2, on="Id", how="left").assign(
temp=lambda x: x.ApplicationDate_y.notna(),
tempr=lambda x: x.ApplicationDate_x > x.ApplicationDate_y,
counter=lambda x: x.temp & x.tempr,
).groupby("Id").counter.sum()
Id
1 2
2 0
3 1
4 0
5 1
6 0
Name: counter, dtype: int64
The code above merges the dataframe and then uses the sum of the conditions based on the groupby to get the count.

Get consecutive occurrences of an event by group in pandas

I'm working with a DataFrame that has id, wage and date, like this:
id wage date
1 100 201212
1 100 201301
1 0 201302
1 0 201303
1 120 201304
1 0 201305
.
2 0 201302
2 0 201303
And I want to create a n_months_no_income column that counts how many consecutive months a given individual has got wage==0, like this:
id wage date n_months_no_income
1 100 201212 0
1 100 201301 0
1 0 201302 1
1 0 201303 2
1 120 201304 0
1 0 201305 1
. .
2 0 201302 1
2 0 201303 2
I feel it's some sort of mix between groupby('id') , cumcount(), maybe diff() or apply() and then a fillna(0), but I'm not finding the right one.
Do you have any ideas?
Here's an example for the dataframe for ease of replication:
df = pd.DataFrame({'id':[1,1,1,1,1,1,2,2],'wage':[100,100,0,0,120,0,0,0],
'date':[201212,201301,201302,201303,201304,201305,201302,201303]})
Edit: Added code for ease of use.
In your case two groupby with cumcount and create the addtional key with cumsum
df.groupby('id').wage.apply(lambda x : x.groupby(x.ne(0).cumsum()).cumcount())
Out[333]:
0 0
1 0
2 1
3 2
4 0
5 1
Name: wage, dtype: int64

Populate a dataframe based on a aggregated dataframe

I have a dataframe like this which is generated after some aggregations and conditions,
X P D1 D2
1 A 2016-06-02 2016-07-26
2 A 2016-10-04 2016-12-01
3 A 2016-12-13 2017-03-11
1 B 2017-03-04 2018-01-11
From this dataframe, I have to populate another dataframe that has n number of columns where each column is for a month in the range of [201606, 201607,......, 201801] which is made earlier. ie I already have a dataframe with columns as mentioned above. I want to populate that dataframe.
I want to make a row for each record in the aggregated dataframe, where the combination of X, P will be unique throughout the aggregated dataframe.
For the first record, I want to fill the columns 201606 to 201607 ie from D1 to D2 (both inclusive) with 1. All other columns should be 0
For the second row, I want to fill the columns 201610 to 201612 with 1 and 0 for every other column, and so on for every row in the aggregated dataframe.
How can I do this faster and efficiently using pandas? I prefer not to loop through the dataframe as my data will be huge.
If populating an existing dataframe is not ideal, generating a dataframe as I mentioned above can also serve my purpose.
I cannot imagine how to not iterate anything. But it is possible to iterate either the rows of the initial dataframe or the columns of the resulting dataframe:
First build the resulting dataframe with all columns to 0
resul = pd.DataFrame(data = 0, columns=pd.period_range(df.D1.min(), df.D2.max(), freq='M'),
index = df.index)
Iterating rows of the initial dataframe:
for ix, row in df.iterrows():
resul.loc[ix, pd.period_range(row.D1, row.D2, freq='M')] = 1
Iterating columns of the result dataframe
for c in resul.columns:
resul[c] = np.where((c.end_time>=df.D1)&(c.start_time <= df.D2), 1, 0)
In both case, with your sample data it gives as expected:
2016-06 2016-07 2016-08 2016-09 2016-10 2016-11 2016-12 2017-01 2017-02 2017-03 2017-04 2017-05 2017-06 2017-07 2017-08 2017-09 2017-10 2017-11 2017-12 2018-01
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
The choice between both methods will be the shorter iteration: if initial dataframe has less rows that the resul dataframe has columns, then choose method 1, else method 2

Fill values from one dataframe to another with matching IDs

I have two pandas data frames, I want to get the sum of items_bought for each ID in DF1. Then add a column to DF2 containing the sum of items_bought calculated from DF1 with matching ID else fill it with 0. How can I do this in an elegant and efficient manner?
DF1
ID | items_bought
1 5
3 8
2 2
3 5
4 6
2 2
DF2
ID
1
2
8
3
2
Desired Result: DF2 Becomes
ID | items_bought
1 5
2 4
8 0
3 13
2 4
df1.groupby('ID').sum().loc[df2.ID].fillna(0).astype(int)
Out[104]:
items_bought
ID
1 5
2 4
8 0
3 13
2 4
Work on df1 to calculate the sum for each ID.
The resulting dataframe is now indexed by ID, so you can select with df2 IDs by calling loc.
Fill the gaps with fillna.
NA are handled by float type. Now that they are removed, convert the column back to integer.
Solution with groupby and sum, then reindex with fill_value=0 and last reset_index:
df2 = df1.groupby('ID').items_bought.sum().reindex(df2.ID, fill_value=0).reset_index()
print (df2)
ID items_bought
0 1 5
1 2 4
2 8 0
3 3 13
4 2 4

How to convert pandas dataframe so that index is the unique set of values and data is the count of each value?

I have a dataframe from a multiple choice questions and it is formatted like so:
Sex Qu1 Qu2 Qu3
Name
Bob M 1 2 1
John M 3 3 5
Alex M 4 1 2
Jen F 3 2 4
Mary F 4 3 4
The data is a rating from 1 to 5 for the 3 multiple choice questions. I want rearrange the data so that the index is range(1,6) where 1='bad', 2='poor', 3='ok', 4='good', 5='excellent', the columns are the same and the data is the count of the number occurrences of the values (excluding the Sex column). This is basically a histogram of fixed bin sizes and the x-axis labeled with strings. I like the output of df.plot() much better than df.hist() for this but I can't figure out how to rearrange the table to give me a histogram of data. Also, how do you change x-labels to be strings?
Series.value_counts gives you the histogram you're looking for:
In [9]: df['Qu1'].value_counts()
Out[9]:
4 2
3 2
1 1
So, apply this function to each of those 3 columns:
In [13]: table = df[['Qu1', 'Qu2', 'Qu3']].apply(lambda x: x.value_counts())
In [14]: table
Out[14]:
Qu1 Qu2 Qu3
1 1 1 1
2 NaN 2 1
3 2 2 NaN
4 2 NaN 2
5 NaN NaN 1
In [15]: table = table.fillna(0)
In [16]: table
Out[16]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
Using table.reindex or table.ix[some_array] you can rearrange the data.
To transform to strings, use table.rename:
In [17]: table.rename(index=str)
Out[17]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
In [18]: table.rename(index=str).index[0]
Out[18]: '1'

Categories

Resources