having this dataframe:
provincia contagios defunciones fecha
0 distrito nacional 11 0 18/3/2020
1 azua 0 0 18/3/2020
2 baoruco 0 0 18/3/2020
3 dajabon 0 0 18/3/2020
4 barahona 0 0 18/3/2020
How can I have a new dataframe like this:
provincia contagios_from_march1_8 defunciones_from_march1_8
0 distrito nacional 11 0
1 azua 0 0
2 baoruco 0 0
3 dajabon 0 0
4 barahona 0 0
Where the 'contagios_from_march1_8' and 'defunciones_from_march1_8' are the result of the sum of the 'contagios' and 'defunciones' in the date range 3/1/2020 to 3/8/2020.
Thanks.
Can use df.sum on a condition. Eg.:
df[df["date"]<month]["contagios"].sum()
refer to this for extracting month out of date: Extracting just Month and Year separately from Pandas Datetime column
Related
I have a dataframe looks like
Cnt_A Cnt_B Cnt_C Cnt_D
ID_1 0 1 3 0
ID_2 1 0 0 0
ID_3 5 2 0 8
...
I'd like to count columns that are not zero and put the result into new column like this,
Total_Not_Zero_Cols Cnt_A Cnt_B Cnt_C Cnt_D
ID_1 2 0 1 3 0
ID_2 1 1 0 0 0
ID_3 3 5 2 0 8
...
I did loop to get the result, but it took very long time (of course).
I can't figure out the most efficient way to calculate between columns with condition :(
Thank you in advance
Check if each value not equals to 0 then sum on columns axis:
df['Total_Not_Zero_Cols'] = df.ne(0).sum(axis=1)
print(df)
# Output
Cnt_A Cnt_B Cnt_C Cnt_D Total_Not_Zero_Cols
ID_1 0 1 3 0 2
ID_2 1 0 0 0 3
ID_3 5 2 0 8 1
Use ne to generate a DataFrame of booleans with True for non-zeros values, then aggregate the rows as integers using sum:
df['Total_Not_Zero_Cols'] = df.ne(0).sum(axis=1)
Numpy based -
Use -
np.sum(df!=0, axis=1)
Output
ID_1 2
ID_2 1
ID_3 3
dtype: int64
Please refer to the image attached
I have a data frame that has yearly revenue in columns (2020 to 2025). I want to shift the revenue in those columns by a given time delta(column Time Shift). The time delta I have is in terms of days. Is there an efficient way to make the shift?
E.G
What I want to achieve is to shift the yearly revenue in columns by the value of days in the Time Shift column i.e. 4 days of revenue to shift from column to column ( i.e. 1.27[116/365 * 4] should be shifted from 2022 to 2023 for the 1st row)
Thanks in Advance
Text Input data
Launch Date Launch Date Base Time Shift 2020 2021 2022 2023 2024 2025
2022-06-01 2022-06-01 4 0 0 115.98 122.93 119.22 35.31
2025-02-01 2025-02-01 4 0 0 0 0 0 66.18859318
2022-09-01 2022-09-01 4 49.42 254.86 191.12 248.80 206.53 98.22
2025-01-01 2025-01-01 4 0 0 0 0 14.47 54.24
2022-06-01 2022-06-01 4 0 0 50.25 53.26 51.65 15.30
2025-02-01 2025-02-01 4 0 0 0 0 0 28.67
2022-09-01 2022-09-01 4 148.20 758.22 535.45 676.73 545.42 251.83
2025-01-01 2025-01-01 4 0 0 0 0 38.23 139.07
2022-06-01 2022-06-01 4 0 0 140.78 144.88 136.41 39.23
You can figour out how much to shift per year and then subtract from the current year and add to the next year.
Get the column names of interest
ycols = [str(n) for n in range(2020,2026)]
calculate the amount that needs shifting, per year (to the next year):
shift_df = df[ycols].multiply(df['Time_Shift']/365.0, axis=0)
looks like this
2020 2021 2022 2023 2024 2025
-- -------- ------- -------- -------- -------- --------
0 0 0 1.27101 1.34718 1.30652 0.386959
1 0 0 0 0 0 0.725354
2 0.541589 2.79299 2.09447 2.72658 2.26334 1.07638
3 0 0 0 0 0.158575 0.594411
4 0 0 0.550685 0.583671 0.566027 0.167671
5 0 0 0 0 0 0.314192
6 1.62411 8.30926 5.86795 7.41622 5.97721 2.75978
7 0 0 0 0 0.418959 1.52405
8 0 0 1.54279 1.58773 1.4949 0.429918
Now create a copy of df (could use the original if you want of course) and apply the operations:
df2 = df.copy()
df2[ycols] = df2[ycols] - shift_df[ycols]
df2[ycols[1:]] =df2[ycols[1:]] + shift_df[ycols[:-1]].values
slightly tricky bits here are in the last line -- we use the indexing [1:] and [:-1] appropriately to access previous year shift, and also use .values method otherwise the column labels do not match and you can do do the addition.
After this we get df2:
Launch_Date Launch_Date_Base Time_Shift 2020 2021 2022 2023 2024 2025
-- ------------- ------------------ ------------ -------- ------- -------- ------- -------- --------
0 2022-06-01 2022-06-01 4 0 0 114.709 122.854 119.261 36.2296
1 2025-02-01 2025-02-01 4 0 0 0 0 0 65.4632
2 2022-09-01 2022-09-01 4 48.8784 252.609 191.819 248.168 206.993 99.407
3 2025-01-01 2025-01-01 4 0 0 0 0 14.3114 53.8042
4 2022-06-01 2022-06-01 4 0 0 49.6993 53.227 51.6676 15.6984
5 2025-02-01 2025-02-01 4 0 0 0 0 0 28.3558
6 2022-09-01 2022-09-01 4 146.576 751.535 537.891 675.182 546.859 255.047
7 2025-01-01 2025-01-01 4 0 0 0 0 37.811 137.965
8 2022-06-01 2022-06-01 4 0 0 139.237 144.835 136.503 40.295
As you noticed the amount shifted from year 2026 is 'lost' ie we do not assign it to any new column
I have a Pandas dataframe with the following columns:
SecId Date Sector Country
184149 2019-12-31 Utility USA
184150 2019-12-31 Banking USA
187194 2019-12-31 Aerospace FRA
...............
128502 2020-02-12 CommSvcs UK
...............
SecId & Date columns are the indices. What I want is the following..
SecId Date Aerospace Banking CommSvcs ........ Utility AFG CAN .. FRA .... UK USA ...
184149 2019-12-31 0 0 0 1 0 0 0 0 1
184150 2019-12-31 0 1 0 0 0 0 0 0 1
187194 2019-12-31 1 0 0 0 0 0 1 0 0
................
128502 2020-02-12 0 0 1 0 0 0 0 1 0
................
What is the efficient way to pivot this? The original data is denormalized for each day and can have millions of rows.
You can use get_dummies. You can cast as a categorical dtype beforehand to define what columns will be created.
code:
SECTORS = df.Sector.unique()
df["Sector"] = df.Sector.astype(pd.Categorical(SECTORS))
COUNTRIES = df.Country.unique()
df["Country"] = df.Country.astype(pd.Categorical(COUNTRIES))
df2 = pd.get_dummies(data=df, columns=["Sector", "Country"], prefix="", pefix_sep="")
output:
SecId Date Aerospace Banking Utility FRA USA
0 184149 2019-12-31 0 0 1 0 1
1 184150 2019-12-31 0 1 0 0 1
2 187194 2019-12-31 1 0 0 1 0
Try as #BEN_YO suggests:
pd.get_dummies(df,columns=['Sector', 'Country'], prefix='', prefix_sep='')
Output:
SecId Date Aerospace Banking Utility FRA USA
0 184149 2019-12-31 0 0 1 0 1
1 184150 2019-12-31 0 1 0 0 1
2 187194 2019-12-31 1 0 0 1 0
I'm working with a DataFrame that has id, wage and date, like this:
id wage date
1 100 201212
1 100 201301
1 0 201302
1 0 201303
1 120 201304
1 0 201305
.
2 0 201302
2 0 201303
And I want to create a n_months_no_income column that counts how many consecutive months a given individual has got wage==0, like this:
id wage date n_months_no_income
1 100 201212 0
1 100 201301 0
1 0 201302 1
1 0 201303 2
1 120 201304 0
1 0 201305 1
. .
2 0 201302 1
2 0 201303 2
I feel it's some sort of mix between groupby('id') , cumcount(), maybe diff() or apply() and then a fillna(0), but I'm not finding the right one.
Do you have any ideas?
Here's an example for the dataframe for ease of replication:
df = pd.DataFrame({'id':[1,1,1,1,1,1,2,2],'wage':[100,100,0,0,120,0,0,0],
'date':[201212,201301,201302,201303,201304,201305,201302,201303]})
Edit: Added code for ease of use.
In your case two groupby with cumcount and create the addtional key with cumsum
df.groupby('id').wage.apply(lambda x : x.groupby(x.ne(0).cumsum()).cumcount())
Out[333]:
0 0
1 0
2 1
3 2
4 0
5 1
Name: wage, dtype: int64
I have a dataframe like this which is generated after some aggregations and conditions,
X P D1 D2
1 A 2016-06-02 2016-07-26
2 A 2016-10-04 2016-12-01
3 A 2016-12-13 2017-03-11
1 B 2017-03-04 2018-01-11
From this dataframe, I have to populate another dataframe that has n number of columns where each column is for a month in the range of [201606, 201607,......, 201801] which is made earlier. ie I already have a dataframe with columns as mentioned above. I want to populate that dataframe.
I want to make a row for each record in the aggregated dataframe, where the combination of X, P will be unique throughout the aggregated dataframe.
For the first record, I want to fill the columns 201606 to 201607 ie from D1 to D2 (both inclusive) with 1. All other columns should be 0
For the second row, I want to fill the columns 201610 to 201612 with 1 and 0 for every other column, and so on for every row in the aggregated dataframe.
How can I do this faster and efficiently using pandas? I prefer not to loop through the dataframe as my data will be huge.
If populating an existing dataframe is not ideal, generating a dataframe as I mentioned above can also serve my purpose.
I cannot imagine how to not iterate anything. But it is possible to iterate either the rows of the initial dataframe or the columns of the resulting dataframe:
First build the resulting dataframe with all columns to 0
resul = pd.DataFrame(data = 0, columns=pd.period_range(df.D1.min(), df.D2.max(), freq='M'),
index = df.index)
Iterating rows of the initial dataframe:
for ix, row in df.iterrows():
resul.loc[ix, pd.period_range(row.D1, row.D2, freq='M')] = 1
Iterating columns of the result dataframe
for c in resul.columns:
resul[c] = np.where((c.end_time>=df.D1)&(c.start_time <= df.D2), 1, 0)
In both case, with your sample data it gives as expected:
2016-06 2016-07 2016-08 2016-09 2016-10 2016-11 2016-12 2017-01 2017-02 2017-03 2017-04 2017-05 2017-06 2017-07 2017-08 2017-09 2017-10 2017-11 2017-12 2018-01
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
The choice between both methods will be the shorter iteration: if initial dataframe has less rows that the resul dataframe has columns, then choose method 1, else method 2