Populate a dataframe based on a aggregated dataframe

Populate a dataframe based on a aggregated dataframe - python

I have a dataframe like this which is generated after some aggregations and conditions,
X P D1 D2
1 A 2016-06-02 2016-07-26
2 A 2016-10-04 2016-12-01
3 A 2016-12-13 2017-03-11
1 B 2017-03-04 2018-01-11
From this dataframe, I have to populate another dataframe that has n number of columns where each column is for a month in the range of [201606, 201607,......, 201801] which is made earlier. ie I already have a dataframe with columns as mentioned above. I want to populate that dataframe.
I want to make a row for each record in the aggregated dataframe, where the combination of X, P will be unique throughout the aggregated dataframe.
For the first record, I want to fill the columns 201606 to 201607 ie from D1 to D2 (both inclusive) with 1. All other columns should be 0
For the second row, I want to fill the columns 201610 to 201612 with 1 and 0 for every other column, and so on for every row in the aggregated dataframe.
How can I do this faster and efficiently using pandas? I prefer not to loop through the dataframe as my data will be huge.
If populating an existing dataframe is not ideal, generating a dataframe as I mentioned above can also serve my purpose.

I cannot imagine how to not iterate anything. But it is possible to iterate either the rows of the initial dataframe or the columns of the resulting dataframe:
First build the resulting dataframe with all columns to 0
resul = pd.DataFrame(data = 0, columns=pd.period_range(df.D1.min(), df.D2.max(), freq='M'),
index = df.index)
Iterating rows of the initial dataframe:
for ix, row in df.iterrows():
resul.loc[ix, pd.period_range(row.D1, row.D2, freq='M')] = 1
Iterating columns of the result dataframe
for c in resul.columns:
resul[c] = np.where((c.end_time>=df.D1)&(c.start_time <= df.D2), 1, 0)
In both case, with your sample data it gives as expected:
2016-06 2016-07 2016-08 2016-09 2016-10 2016-11 2016-12 2017-01 2017-02 2017-03 2017-04 2017-05 2017-06 2017-07 2017-08 2017-09 2017-10 2017-11 2017-12 2018-01
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
The choice between both methods will be the shorter iteration: if initial dataframe has less rows that the resul dataframe has columns, then choose method 1, else method 2

Related

Count the number of different observations for every two columns and divide by the number of rows multiplied by two

I have a data frame with header and row names.
Starting from the second column I would like to calculate the number of different observations (in that case 0 and 1 only) and divide them by the number of rows multiplied by two...
I am able to do this using COUNTIF from excel...but I was wondering if there is a way to do this in Python using pandas COUNTIF...
The data frame subset is as follows:
position 1X0812.0 1X0812.1 10192.0 10192.1 10316.0 10316.1 10349.0 10349.1 10418.0 10418.1 10482.0 10482.1
4035 0 0 0 0 0 0 0 1 0 0 0 0
4036 0 0 0 0 0 0 0 1 0 0 0 0
4083 0 0 0 0 1 0 0 1 0 0 0 0
4119 0 1 0 0 1 0 0 1 0 0 0 0
4164 0 1 0 0 1 0 0 1 0 0 0 0
4185 1 1 0 1 1 0 0 1 0 0 0 0
4379 1 1 0 1 0 0 0 1 0 0 0 0
The desired output would be something like this in a new file:
0 0.5714 0.8571 0.7142 0.5000 1.0000 1.0000
1 0.4285 0.1428 0.2857 0.5000 0.0000 0.0000
Looking over the internet what could only get the sum of one observation (0 in this case) for the whole dataset...
import pandas as pd
df = pd.read_csv('query.tsv', sep='\t')
(df == 0).sum()
result:
position 0
10046.0 309027
10046.1 308117
10192.0 308117
10192.1 309027
...
9656.1 171617
9860.0 261170
9860.1 226217
9878.0 309027
9878.1 309027
Length: 1565, dtype: int64
thanks

Counting the 1s among the 0s is equivalent to summing. So we can group the dataframe starting from the second column by the first part of the columns headers (before the .) and divide the sum by the count (alternatively you can directly use mean for the same).
g = df.iloc[:,1:].groupby(df.columns[1:].str.extract(r'(\d+)\.', expand=False), axis=1)
s = g.sum().sum() / g.count().sum() # or s = g.mean().mean()
s is an unnamed Series representing the share of 1s:
10046 0.428571
10192 0.142857
10316 0.285714
10349 0.500000
10418 0.000000
10482 0.000000
dtype: float64
From this Series we construct the desired resulting dataframe:
result = pd.concat([(1 - s).rename(0), s.rename(1)], axis=1).T
Result:
10046 10192 10316 10349 10418 10482
0 0.571429 0.857143 0.714286 0.5 1.0 1.0
1 0.428571 0.142857 0.285714 0.5 0.0 0.0

efficient way to calculate between columns with conditions

I have a dataframe looks like
Cnt_A Cnt_B Cnt_C Cnt_D
ID_1 0 1 3 0
ID_2 1 0 0 0
ID_3 5 2 0 8
...
I'd like to count columns that are not zero and put the result into new column like this,
Total_Not_Zero_Cols Cnt_A Cnt_B Cnt_C Cnt_D
ID_1 2 0 1 3 0
ID_2 1 1 0 0 0
ID_3 3 5 2 0 8
...
I did loop to get the result, but it took very long time (of course).
I can't figure out the most efficient way to calculate between columns with condition :(
Thank you in advance

Check if each value not equals to 0 then sum on columns axis:
df['Total_Not_Zero_Cols'] = df.ne(0).sum(axis=1)
print(df)
# Output
Cnt_A Cnt_B Cnt_C Cnt_D Total_Not_Zero_Cols
ID_1 0 1 3 0 2
ID_2 1 0 0 0 3
ID_3 5 2 0 8 1

Use ne to generate a DataFrame of booleans with True for non-zeros values, then aggregate the rows as integers using sum:
df['Total_Not_Zero_Cols'] = df.ne(0).sum(axis=1)

Numpy based -
Use -
np.sum(df!=0, axis=1)
Output
ID_1 2
ID_2 1
ID_3 3
dtype: int64

Getting a slice from pandas dataframe by comparing contents of df1 with df2

I have two dataframes, for each id in df1 I need to pick rows from df2 with same id, filter out rows where df2.ApplicationDate < df1.ApplicationDate and count how many of such rows exist.
Here is how I am doing it currently:
for i, row in df1.iterrows():
count = len(df2[(df2['PersonalId']==row['PersonalId'])
& (df2['ApplicationDate'] < row['ApplicationDate']])
counts.append(count)
This approach works but its hellishly slow on large dataframes, is there any way to accelerate it?
Edit: added sample input with expected output
df1:
Id ApplicationDate
0 1 5-12-20
1 2 6-12-20
2 3 7-12-20
3 4 8-12-20
4 5 9-12-20
5 6 10-12-20
df2:
Id ApplicationDate
0 1 4-11-20
1 1 4-12-20
2 3 5-12-20
3 3 8-12-20
4 5 1-12-20
expected counts for each id: [2, 0, 1, 0, 1, 0]

You can can left join both tables:
df3 = df1.merge(df2, on='Id', how='left')
Result:
Id ApplicationDate_x ApplicationDate_y
0 1 2020-05-12 2020-04-11
1 1 2020-05-12 2020-04-12
2 2 2020-06-12 NaT
3 3 2020-07-12 2020-05-12
4 3 2020-07-12 2020-08-12
5 4 2020-08-12 NaT
6 5 2020-09-12 2020-01-12
7 6 2020-10-12 NaT
Then you can compare dates, group by 'Id' and count True values per group:
df3.ApplicationDate_x.gt(df3.ApplicationDate_y).groupby(df3.Id).sum()
Result:
Id
1 2
2 0
3 1
4 0
5 1
6 0

df1.merge(df2, on="Id", how="left").assign(
temp=lambda x: x.ApplicationDate_y.notna(),
tempr=lambda x: x.ApplicationDate_x > x.ApplicationDate_y,
counter=lambda x: x.temp & x.tempr,
).groupby("Id").counter.sum()
Id
1 2
2 0
3 1
4 0
5 1
6 0
Name: counter, dtype: int64
The code above merges the dataframe and then uses the sum of the conditions based on the groupby to get the count.

Pandas new column as result of sum in date range

having this dataframe:
provincia contagios defunciones fecha
0 distrito nacional 11 0 18/3/2020
1 azua 0 0 18/3/2020
2 baoruco 0 0 18/3/2020
3 dajabon 0 0 18/3/2020
4 barahona 0 0 18/3/2020
How can I have a new dataframe like this:
provincia contagios_from_march1_8 defunciones_from_march1_8
0 distrito nacional 11 0
1 azua 0 0
2 baoruco 0 0
3 dajabon 0 0
4 barahona 0 0
Where the 'contagios_from_march1_8' and 'defunciones_from_march1_8' are the result of the sum of the 'contagios' and 'defunciones' in the date range 3/1/2020 to 3/8/2020.
Thanks.

Can use df.sum on a condition. Eg.:
df[df["date"]<month]["contagios"].sum()
refer to this for extracting month out of date: Extracting just Month and Year separately from Pandas Datetime column

Get consecutive occurrences of an event by group in pandas

I'm working with a DataFrame that has id, wage and date, like this:
id wage date
1 100 201212
1 100 201301
1 0 201302
1 0 201303
1 120 201304
1 0 201305
.
2 0 201302
2 0 201303
And I want to create a n_months_no_income column that counts how many consecutive months a given individual has got wage==0, like this:
id wage date n_months_no_income
1 100 201212 0
1 100 201301 0
1 0 201302 1
1 0 201303 2
1 120 201304 0
1 0 201305 1
. .
2 0 201302 1
2 0 201303 2
I feel it's some sort of mix between groupby('id') , cumcount(), maybe diff() or apply() and then a fillna(0), but I'm not finding the right one.
Do you have any ideas?
Here's an example for the dataframe for ease of replication:
df = pd.DataFrame({'id':[1,1,1,1,1,1,2,2],'wage':[100,100,0,0,120,0,0,0],
'date':[201212,201301,201302,201303,201304,201305,201302,201303]})
Edit: Added code for ease of use.

In your case two groupby with cumcount and create the addtional key with cumsum
df.groupby('id').wage.apply(lambda x : x.groupby(x.ne(0).cumsum()).cumcount())
Out[333]:
0 0
1 0
2 1
3 2
4 0
5 1
Name: wage, dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Populate a dataframe based on a aggregated dataframe - python

Related

Count the number of different observations for every two columns and divide by the number of rows multiplied by two

efficient way to calculate between columns with conditions

Getting a slice from pandas dataframe by comparing contents of df1 with df2

Pandas new column as result of sum in date range

Get consecutive occurrences of an event by group in pandas

Categories

Resources