I have following df. I am going to dynamically create new columns based on number of date (day_number=2), and conditionally fill them based on "code" and "count"
Current format:
code count
id date
ABC1 2019-04-04 1 76
2019-04-05 2 82
Desired matrix-like format:
code count code1_day1 code1_day1 code1_day2 code2_day2
id date
ABC1 2019-04-04 1 76 76 0 0 82
2019-04-05 2 82
I have done this but it fills the same for every column:
code=[1,2]
for date, new in df.groupby(level=[0]):
for col in range(day_number): # day_number=2
for lvl in code:
new[f"day{col+1}_code1"]=new['count'].where(new['code']==1)
new[f"day{col+1}_code2"]=new['count'].where(new['code']==2)
So many thanks for your help!
A biger example of the databse:
code count new-col1 new_col2 ......
id date
ABC1
2019-04-04 1 76 76 0 79 0 82 0 83 0 88 0 55 3 65 6
2019-04-05 1 79 79 0 82 0 83 0 88 0 55 3 65 6 101 10
2019-04-06 1 82 82 0 83 0 88 0 55 3 65 6 101 10 120 14
2019-04-07 2 83 83 0 88 0 55 3 65 6 101 10 120 14 0 0
2019-04-08 1 88 88 0 55 3 65 6 101 10 120 14 0 0 0 0
2019-04-09 1 55 55 3 65 6 101 10 120 14 0 0 0 0 10 0
2019-04-09 2 3 65 6 101 10 120 14 0 0 0 0 10 0
2019-04-10 1 65 101 10 120 14 0 0 0 0 10 0
2019-04-10 2 6 120 14 0 0 0 0 10 0
2019-04-11 1 101 0 0 0 0 10 0
your sample data is not so usable so I've simulated
considering differently, the data is grouped, hence groupby() ID in index and code
apply() after a groupby() gets passed a dataframe, build required columns on this dataframe
d = pd.date_range("01-jan-2021","03-jan-2021")
df = pd.concat([
pd.DataFrame({"ID":"ABC1","date":d,"code":1,"count":np.random.randint(20,50, len(d))}),
pd.DataFrame({"ID":"ABC1","date":d,"code":2,"count":np.random.randint(20,50, len(d))})
]).sort_values(["ID","date","code"], ascending=[True,False,True]).set_index(["ID","date"])
# pad an array with NaN to same length as second iterable
def nppad(a, s):
return np.pad(a.astype(float), (0,len(s)-len(a)), "constant", constant_values=np.nan)
df2 = df.groupby(["ID","code"]).apply(lambda dfa: dfa.assign(**{f"code{dfa.iloc[0,0]}_day{i+1}":
nppad(dfa["count"].values[i:],dfa)
for i in range(len(dfa))}))
output
code count code1_day1 code1_day2 code1_day3 code2_day1 code2_day2 code2_day3
ID date
ABC1 2021-01-03 1 40 40.0 38.0 46.0 NaN NaN NaN
2021-01-03 2 37 NaN NaN NaN 37.0 33.0 33.0
2021-01-02 1 38 38.0 46.0 NaN NaN NaN NaN
2021-01-02 2 33 NaN NaN NaN 33.0 33.0 NaN
2021-01-01 1 46 46.0 NaN NaN NaN NaN NaN
2021-01-01 2 33 NaN NaN NaN 33.0 NaN NaN
Related
I have 2 dataframes, i want to get sum value of every row based on groupby of unique id each previous 3rows & each row value should be multiply by other dataframe value
for example : dataframe A dataframe B
unique_id value out_value num_values
1 1 45 0.15
2 1 33 0.30
3 1 18 0.18
#4 1 26 20.7
5 2 66
6 2 44
7 2 22
#8 2 19. 28.3
expected output_value column
4th row = 18 * 0.15 + 33*0.30 + 45*0.18 = 2.7+9.9+8.1 = 20.7
8th row = 22 * 0.15 + 44*0.30 + 66*0.18 = 3.3+ 13.2 + 11.88= 28.3
based on Unique_id each value should calculate based previous 3values.
for every row there will be previous 3 rows available
import pandas as pd
import numpy as np
df_a = pd.DataFrame({
'uni_id':[1, 1, 1, 1, 2, 2, 2, 2, 152, 152, 152, 152, 152],
'value':[45,33,18,26,66,44,22,19,36,27,45,81,90]
}, index=range(1,14))
df_b = pd.DataFrame({
'num_values':[0.15,0.30,0.18]
})
df_a
###
uni_id value
1 1 45
2 1 33
3 1 18
4 1 26
5 2 66
6 2 44
7 2 22
8 2 19
9 152 36
10 152 27
11 152 45
12 152 81
13 152 90
df_b
###
num_values
0 0.15
1 0.30
2 0.18
# main calculation
arr = [df_a['value'].shift(x+1).values[::-1][:3] for x in range(len(df_a['value']))[::-1]]
arr_b = pd.Series(np.inner(arr, df_b['num_values']))
# filter and clean
mask = df_a.groupby('uni_id').cumcount()+1 > 3
output = arr_b * mask
output[output == 0] = np.nan
# concat result to df_a
df_a['out_value'] = output
df_a
###
uni_id value out_value
1 1 45 NaN
2 1 33 NaN
3 1 18 NaN
4 1 26 20.70
5 2 66 NaN
6 2 44 NaN
7 2 22 NaN
8 2 19 28.38
9 152 36 NaN
10 152 27 NaN
11 152 45 NaN
12 152 81 21.33
13 152 90 30.51
If you want to keep non-null values through filtrate:
df_a.query('out_value.notnull()')
###
uni_id value out_value
4 1 26 20.70
8 2 19 28.38
12 152 81 21.33
13 152 90 30.51
Group with metrics uni_id,Year_Month
Data preparation:
# create date range series with 7 days
import pandas as pd
import numpy as np
rng = np.random.default_rng(42)
rng.integers(10,100, 26)
date_range = pd.Series(pd.date_range(start='01.30.2020', periods=27, freq='5D')).dt.to_period('M')
df_a = pd.DataFrame({
'uni_id':[1, 1, 1, 1,1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 152, 152, 152, 152, 152,152, 152, 152, 152, 152],
'Year_Month':date_range,
'value':rng.integers(10,100, 26)
}, index=range(1,27))
df_b = pd.DataFrame({
'num_values':[0.15,0.30,0.18]
})
df_a
###
uni_id Year_Month value
1 1 2020-02 46
2 1 2020-02 84
3 1 2020-02 59
4 1 2020-02 49
5 1 2020-02 50
6 1 2020-02 30
7 1 2020-03 18
8 1 2020-03 59
9 2 2020-03 89
10 2 2020-03 15
11 2 2020-03 87
12 2 2020-03 84
13 2 2020-04 34
14 2 2020-04 66
15 2 2020-04 24
16 2 2020-04 78
17 152 2020-04 73
18 152 2020-04 41
19 152 2020-05 16
20 152 2020-05 97
21 152 2020-05 50
22 152 2020-05 90
23 152 2020-05 71
24 152 2020-05 80
25 152 2020-06 78
26 152 2020-06 27
Processing
arr = [df_a['value'].shift(x+1).values[::-1][:3] for x in range(len(df_a['value']))[::-1]]
arr_b = pd.Series(np.inner(arr, df_b['num_values']))
# filter and clean
mask = df_a.groupby(['uni_id','Year_Month']).cumcount()+1 > 3
output = arr_b * mask
output[output == 0] = np.nan
# concat result to df_a
df_a['out_value'] = output
df_a
###
uni_id Year_Month value out_value
1 1 2020-02 46 NaN
2 1 2020-02 84 NaN
3 1 2020-02 59 NaN
4 1 2020-02 49 40.17
5 1 2020-02 50 32.82
6 1 2020-02 30 28.32
7 1 2020-03 18 NaN
8 1 2020-03 59 NaN
9 2 2020-03 89 NaN
10 2 2020-03 15 NaN
11 2 2020-03 87 NaN
12 2 2020-03 84 41.4
13 2 2020-04 34 NaN
14 2 2020-04 66 NaN
15 2 2020-04 24 NaN
16 2 2020-04 78 30.78
17 152 2020-04 73 NaN
18 152 2020-04 41 NaN
19 152 2020-05 16 NaN
20 152 2020-05 97 NaN
21 152 2020-05 50 NaN
22 152 2020-05 90 45.96
23 152 2020-05 71 46.65
24 152 2020-05 80 49.5
25 152 2020-06 78 NaN
26 152 2020-06 27 NaN
I've faced a problem while applying sort_values() and cumsum() within a group.
I have a dataaset:
Basically, I need to sort values within a group, get cumulative sales and select those lines that compose 90% of sales.
to get first
and then, just select 90% of sales within each region
I have tried the following but the last line doesn't work. I returns an error: Cannot access callable attribute 'sort_values' of 'SeriesGroupBy' objects, try using the 'apply' method
I've tried apply also..
import pandas as pd
df = pd.DataFrame({'id':['id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8', 'id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8'],
'region':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,],
'sales':[54,34,23,56,78,98,76,34,27,89,76,54,34,45,56,54]})
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cumul'] = df.groupby(df['region'])['sales'].sort_values(ascending=False).cumsum()
Thank you for any suggestions
You can definitely sort the dataframe first, then do groupby():
df.sort_values(['region','sales'], ascending=[True,False],inplace=True)
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cummul'] = df.groupby('region')['%'].cumsum()
# filter
df[df['cummul'].le(0.9)]
Output:
id region sales % cummul
5 id_6 1 98 0.216336 0.216336
4 id_5 1 78 0.172185 0.388521
6 id_7 1 76 0.167770 0.556291
3 id_4 1 56 0.123620 0.679912
0 id_1 1 54 0.119205 0.799117
1 id_2 1 34 0.075055 0.874172
9 id_2 2 89 0.204598 0.204598
10 id_3 2 76 0.174713 0.379310
14 id_7 2 56 0.128736 0.508046
11 id_4 2 54 0.124138 0.632184
15 id_8 2 54 0.124138 0.756322
13 id_6 2 45 0.103448 0.859770
First we use your logic to create the % column, but we multiply by 100 and round to whole numbers.
Then we sort by region and %, no need for groupby.
After we sort, we create the cumul column.
And finally we select those within the 90% range with query:
df['%'] = df['sales'].div(df.groupby('region')['sales'].transform('sum')).mul(100).round()
df = df.sort_values(['region', '%'], ascending=[True, False])
df['cumul'] = df.groupby('region')['%'].cumsum()
df.query('cumul.le(90)')
output
id region sales % cumul
5 id_6 1 98 22.0 22.0
4 id_5 1 78 17.0 39.0
6 id_7 1 76 17.0 56.0
0 id_1 1 54 12.0 68.0
3 id_4 1 56 12.0 80.0
1 id_2 1 34 8.0 88.0
9 id_2 2 89 20.0 20.0
10 id_3 2 76 17.0 37.0
14 id_7 2 56 13.0 50.0
11 id_4 2 54 12.0 62.0
15 id_8 2 54 12.0 74.0
13 id_6 2 45 10.0 84.0
If you just need the sales data without the percentage, this can easily be done with method chaining:
(
df
.sort_values(by='sales', ascending=False)
.groupby('region')
.apply(lambda x[x.sales > x.sales.quantile(.1)])
.reset_index(level=0, drop=True)
)
Output
id region sales
5 id_6 1 98
4 id_5 1 78
6 id_7 1 76
3 id_4 1 56
0 id_1 1 54
1 id_2 1 34
7 id_8 1 34
9 id_2 2 89
10 id_3 2 76
14 id_7 2 56
11 id_4 2 54
15 id_8 2 54
13 id_6 2 45
12 id_5 2 34
This works because getting all values greater than 10% is essentially the same as getting the top 90%.
I have one data frame (df1) with 5 columns and another (df2) with 10 columns. I want to add columns from df2 to df1, but only columns names (without values). Also, I want to do the same with adding columns without values from df1 to df2.
Here are the data frames:
df1
A B C D E
1 234 52 1 54
54 23 87 5 125
678 67 63 8 18
45 21 36 5 65
8 5 24 3 13
df2
F G H I J K L M N O
12 34 2 17 4 19 54 7 58 123
154 3 7 53 25 2 47 27 84 6
78 7 3 82 8 56 21 29 547 1
And I want to get this:
df1
A B C D E F G H I J K L M N O
1 234 52 1 54
54 23 87 5 125
678 67 63 8 18
45 21 36 5 65
8 5 24 3 13
And I want to get this:
df2
A B C D E F G H I J K L M N O
12 34 2 17 4 19 54 7 58 123
154 3 7 53 25 2 47 27 84 6
78 7 3 82 8 56 21 29 547 1
I tried with df.columns.values and got the array of columns names, but then I have to apply them as data frame columns and give them empty values, and the way that I am doing now has too many lines of code, and I just wonder is it some easier way to do that?
I will appreciate any help.
Use Index.union with DataFrame.reindex:
cols = df1.columns.union(df2.columns)
#if order is important
#cols = df1.columns.append(df2.columns)
df1 = df1.reindex(columns=cols)
df2 = df2.reindex(columns=cols)
print (df1)
A B C D E F G H I J K L M N O
0 1 234 52 1 54 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 54 23 87 5 125 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 678 67 63 8 18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 45 21 36 5 65 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 8 5 24 3 13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
print (df2)
A B C D E F G H I J K L M N O
0 NaN NaN NaN NaN NaN 12 34 2 17 4 19 54 7 58 123
1 NaN NaN NaN NaN NaN 154 3 7 53 25 2 47 27 84 6
2 NaN NaN NaN NaN NaN 78 7 3 82 8 56 21 29 547 1
If same index values in both DataFrames is possible use DataFrame.align:
print (df1)
A B C D E
0 1 234 52 1 54
1 54 23 87 5 125
2 678 67 63 8 18
df1, df2 = df1.align(df2)
print (df1)
A B C D E F G H I J K L M N O
0 1 234 52 1 54 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 54 23 87 5 125 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 678 67 63 8 18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
print (df2)
A B C D E F G H I J K L M N O
0 NaN NaN NaN NaN NaN 12 34 2 17 4 19 54 7 58 123
1 NaN NaN NaN NaN NaN 154 3 7 53 25 2 47 27 84 6
2 NaN NaN NaN NaN NaN 78 7 3 82 8 56 21 29 547 1
The Scenario:
I have 2 dataframes fc0 and yc0. Where fc0 is a Cluster and yc0 is another dataframe which needs to be merged in fc0.
The Nature of data is as follows:
fc0
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
yc0
iid uid 1 2 5 6 9 15
0 944 5.0 3.0 4.0 3.0 3.0 5.0
The Twist
I have 1682 columns in fc0 and I have few hundered values in yc0. Now I need the yc0 to go into fc0
In haste of resolving it, I even tried yc0.reset_index(inplace=True) but wasn't really helpful.
Expected Output
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
944 5.0 3.0 NaN NaN 4.0 3.0 3.0
References
Link1 Tried this, but landed up inserting NaN values for 1st 16 Columns and rest of the data shifted by that many columns
Link2 Couldn't match column keys, besides I tried it for row.
Link3 Merging doesn't match the columns in it.
Link4 Concatenation doesn't work that way.
Link5 Same issues with Join.
EDIT 1
fc0.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 235 entries, 234 to 468
Columns: 1683 entries, uid to 1682
dtypes: float64(1682), int64(1)
memory usage: 3.0 MB
and
yc0.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 336 entries, uid to 1007
dtypes: float64(335), int64(1)
memory usage: 2.7 KB
Here's a MVCE example. Does this small sample data show the functionality that you are expecting?
df1 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('ABCE'))
A B C E
0 81 57 54 88
1 63 63 74 10
2 13 89 88 66
3 90 81 3 31
4 66 93 55 4
df2 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('BCDE'))
B C D E
0 93 48 62 25
1 24 97 52 88
2 53 50 21 13
3 81 27 7 81
4 10 21 77 19
df_out = pd.concat([df1,df2])
print(df_out)
Output:
A B C D E
0 81.0 57 54 NaN 88
1 63.0 63 74 NaN 10
2 13.0 89 88 NaN 66
3 90.0 81 3 NaN 31
4 66.0 93 55 NaN 4
0 NaN 93 48 62.0 25
1 NaN 24 97 52.0 88
2 NaN 53 50 21.0 13
3 NaN 81 27 7.0 81
4 NaN 10 21 77.0 19
The goal here is to see how many unique values i have in my database. This is the code i have written:
apps = pd.read_csv('ConcatOwned1_900.csv', sep='\t', usecols=['appid'])
apps[('appid')] = apps[('appid')].astype(int)
apps_list=apps['appid'].unique()
b = apps.groupby('appid').size()
blist = b.unique()
print len(apps_list), len(blist), len(set(b))
>>>7672 2164 2164
Why is there difference in those two methods?
Due to request i am posting some of my data:
Unnamed: 0 StudID No appid work work2
0 0 76561193665298433 0 10 nan 0
1 1 76561193665298433 1 20 nan 0
2 2 76561193665298433 2 30 nan 0
3 3 76561193665298433 3 40 nan 0
4 4 76561193665298433 4 50 nan 0
5 5 76561193665298433 5 60 nan 0
6 6 76561193665298433 6 70 nan 0
7 7 76561193665298433 7 80 nan 0
8 8 76561193665298433 8 100 nan 0
9 9 76561193665298433 9 130 nan 0
10 10 76561193665298433 10 220 nan 0
11 11 76561193665298433 11 240 nan 0
12 12 76561193665298433 12 280 nan 0
13 13 76561193665298433 13 300 nan 0
14 14 76561193665298433 14 320 nan 0
15 15 76561193665298433 15 340 nan 0
16 16 76561193665298433 16 360 nan 0
17 17 76561193665298433 17 380 nan 0
18 18 76561193665298433 18 400 nan 0
19 19 76561193665298433 19 420 nan 0
20 20 76561193665298433 20 500 nan 0
21 21 76561193665298433 21 550 nan 0
22 22 76561193665298433 22 620 6.0 3064
33 33 76561193665298434 0 10 nan 837
34 34 76561193665298434 1 20 nan 27
35 35 76561193665298434 2 30 nan 9
36 36 76561193665298434 3 40 nan 5
37 37 76561193665298434 4 50 nan 2
38 38 76561193665298434 5 60 nan 0
39 39 76561193665298434 6 70 nan 403
40 40 76561193665298434 7 130 nan 0
41 41 76561193665298434 8 80 nan 6
42 42 76561193665298434 9 100 nan 10
43 43 76561193665298434 10 220 nan 14
IIUC based on attached piece of the dataframe it seems that you should analyze b.index, not values of b. Just look:
b = apps.groupby('appid').size()
In [24]: b
Out[24]:
appid
10 2
20 2
30 2
40 2
50 2
60 2
70 2
80 2
100 2
130 2
220 2
240 1
280 1
300 1
320 1
340 1
360 1
380 1
400 1
420 1
500 1
550 1
620 1
dtype: int64
In [25]: set(b)
Out[25]: {1, 2}
But if you do it for b.index you'll get the same values for all 3 methods:
blist = b.index.unique()
In [30]: len(apps_list), len(blist), len(set(b.index))
Out[30]: (23, 23, 23)