Labeling by period - python

my dataset
name day value
A 7 88
A 15 101
A 21 121
A 29 56
B 21 131
B 30 78
B 35 102
C 8 80
C 16 101
...
I am trying to plot with values for these days, but I want to label because there are too many unique numbers of days.
I try to label it consistently,
Is there a way to speed up labeling by cutting it every 7 days(week)?
For example, ~ 7day = 1week, 8 ~ 14day = 2week, and so on.
output what I want
name day value week
A 7 88 1
A 15 101 3
A 21 121 3
A 29 56 5
B 21 131 3
B 30 78 5
B 35 102 5
C 8 80 2
C 16 101 3
thank you for reading

Subtract 1, then use integer division by 7 and last add 1:
df['week'] = (df['day'] - 1) // 7 + 1
print (df)
name day value week
0 A 7 88 1
1 A 15 101 3
2 A 21 121 3
3 A 29 56 5
4 B 21 131 3
5 B 30 78 5
6 B 35 102 5
7 C 8 80 2
8 C 16 101 3

Related

How to swap many columns into rows with rows by being grouped in pandas? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Pandas Melt Function
(2 answers)
Closed 10 months ago.
Let's say that these are my data
day region cars motorcycles bikes buses
1 A 0 1 1 2
2 A 4 0 6 8
3 A 2 9 8 0
1 B 6 12 34 82
2 B 13 92 76 1
3 B 23 87 98 9
1 C 29 200 31 45
2 C 54 80 23 89
3 C 129 90 231 56
How do I make the regions columns and the columns(except for the day column) rows?
Basically, I want it to look like this :
day vehicle_type A B C
1 cars 0 6 29
2 cars 4 13 54
3 cars 2 23 129
1 motorcycles 1 12 200
2 motorcycles 0 92 80
3 motorcycles 9 87 90
1 bikes 1 34 31
2 bikes 6 76 23
3 bikes 8 98 231
1 buses 2 82 45
2 buses 8 1 89
3 buses 0 9 56
Use stack and unstack:
(
df.set_index(["day", "region"])
.rename_axis(columns="vehicle_type")
.stack()
.unstack(level=1)
.rename_axis(columns=None)
.reset_index()
)

Pandas custom groupby fill

I have this dataset:
menu alternative id varA varB varC
1 NaN A NaN NaN NaN
1 NaN A NaN NaN NaN
2 94 A 8.089481019 7.07639559 0.90627215
2 89 A 7.52310322 19.49894193 14.4562262
3 79 A 24.79634962 18.91163612 23.85341972
3 95 A 21.10990397 17.00630516 1.09875582
4 47 A 5.681766806 4.136047755 17.38880496
4 62 A 10.39459876 0.997853805 0.045331687
5 58 A 11.91790497 5.696799013 27.21424163
5 23 A 11.71107828 2.165751058 11.56534045
6 57 A 1.068603487 27.95362014 1.334049372
6 100 A 26.31848796 6.757305213 4.742282633
7 NaN A NaN NaN NaN
7 NaN A NaN NaN NaN
8 24 A 29.99608877 28.49057834 0.14073638
8 7 A 8.749041949 14.17745528 9.604565417
9 64 A 29.4316969 19.57593592 9.174503643
9 60 A 13.53995541 1.898164567 16.49089291
10 85 A 20.1394155 0.995839592 16.18638727
10 22 A 22.68625486 14.26052953 17.79707308
1 82 B 21.34428845 9.901326583 1.053134597
1 91 B 19.04689216 16.29217346 29.56962312
2 35 B 25.44168095 29.00407645 2.246459981
2 100 B 15.79687903 20.37920541 28.45071525
3 44 B 7.359501131 23.66924419 7.198215907
3 41 B 22.65272801 8.66227065 12.05186217
4 59 B 26.67565422 9.608511948 26.45016581
4 53 B 5.64870847 21.83063691 19.20105218
5 48 B 5.591317152 25.17616679 24.30522374
5 16 B 23.85069753 23.12154586 0.781450997
6 NaN B NaN NaN NaN
6 NaN B NaN NaN NaN
7 68 B 9.334935288 16.39114327 21.17696541
7 41 B 5.841577934 6.901223007 28.38116983
8 35 B 21.20288984 9.665414964 4.472546438
8 96 B 0.451299457 27.66880932 26.2120144
9 84 B 19.67310555 1.993071082 9.08442779
9 65 B 0.475983889 16.72261394 17.17122898
10 40 B 9.553130945 17.88616649 22.17570401
10 40 B 19.70487161 5.898428653 11.25844279
1 19 C 20.47792809 9.344376127 7.855311112
1 59 C 14.59141273 8.090534362 19.6972446
2 19 C 6.624345353 0.192145343 26.31356322
2 67 C 24.483236 6.718856437 25.75609679
3 67 C 27.6408808 24.91014602 25.90758755
3 30 C 26.52738124 10.78363589 4.873602089
4 14 C 3.776964641 21.16561036 24.03153234
4 46 C 16.53719818 23.86634958 25.61504006
5 NaN C NaN NaN NaN
5 NaN C NaN NaN NaN
6 58 C 28.1357636 15.89359176 0.567406646
6 28 C 0.708229201 12.20641988 0.309303591
7 17 C 4.319991082 21.29233667 3.516184987
7 8 C 24.09490443 9.507000131 14.93472971
8 85 C 19.99606403 21.61509867 0.161222766
8 5 C 6.056082264 25.35186187 5.375641692
9 24 C 19.83904205 24.54037422 11.08571464
9 13 C 4.388769239 7.928106767 4.279531285
10 78 C 13.67598922 5.3140143 15.2710129
10 13 C 12.27642791 16.04610858 1.815260029
As you can see here, I have some null values which I need to fill. However, I need to fill them in a somewhat custom manner. For every id and for every menu I need to fill the null values based on random selection among the same menus (same menu number) in different ids which have non-null values.
Example. The menu 1 in id A has null values. I want to randomly select menu 1 in different id which has non-null values and fill them there. Let it be, id B and menu 1. For menu 7 in id A let it be menu 7 in id C and etc.
It is somehow similar to this question but iin my case, the filling should happen within the same "subgroups" if we can say so.
The final output should be something like this:
menu alternative id varA varB varC
1 82 A 21.34428845 9.901326583 1.053134597
1 91 A 19.04689216 16.29217346 29.56962312
2 94 A 8.089481019 7.07639559 0.90627215
2 89 A 7.52310322 19.49894193 14.4562262
3 79 A 24.79634962 18.91163612 23.85341972
3 95 A 21.10990397 17.00630516 1.09875582
4 47 A 5.681766806 4.136047755 17.38880496
4 62 A 10.39459876 0.997853805 0.045331687
5 58 A 11.91790497 5.696799013 27.21424163
5 23 A 11.71107828 2.165751058 11.56534045
6 57 A 1.068603487 27.95362014 1.334049372
6 100 A 26.31848796 6.757305213 4.742282633
7 17 A 4.319991082 21.29233667 3.516184987
7 8 A 24.09490443 9.507000131 14.93472971
8 24 A 29.99608877 28.49057834 0.14073638
8 7 A 8.749041949 14.17745528 9.604565417
9 64 A 29.4316969 19.57593592 9.174503643
9 60 A 13.53995541 1.898164567 16.49089291
10 85 A 20.1394155 0.995839592 16.18638727
10 22 A 22.68625486 14.26052953 17.79707308
1 82 B 21.34428845 9.901326583 1.053134597
1 91 B 19.04689216 16.29217346 29.56962312
2 35 B 25.44168095 29.00407645 2.246459981
2 100 B 15.79687903 20.37920541 28.45071525
3 44 B 7.359501131 23.66924419 7.198215907
3 41 B 22.65272801 8.66227065 12.05186217
4 59 B 26.67565422 9.608511948 26.45016581
4 53 B 5.64870847 21.83063691 19.20105218
5 48 B 5.591317152 25.17616679 24.30522374
5 16 B 23.85069753 23.12154586 0.781450997
6 57 B 1.068603487 27.95362014 1.334049372
6 100 B 26.31848796 6.757305213 4.742282633
7 68 B 9.334935288 16.39114327 21.17696541
7 41 B 5.841577934 6.901223007 28.38116983
8 35 B 21.20288984 9.665414964 4.472546438
8 96 B 0.451299457 27.66880932 26.2120144
9 84 B 19.67310555 1.993071082 9.08442779
9 65 B 0.475983889 16.72261394 17.17122898
10 40 B 9.553130945 17.88616649 22.17570401
10 40 B 19.70487161 5.898428653 11.25844279
1 19 C 20.47792809 9.344376127 7.855311112
1 59 C 14.59141273 8.090534362 19.6972446
2 19 C 6.624345353 0.192145343 26.31356322
2 67 C 24.483236 6.718856437 25.75609679
3 67 C 27.6408808 24.91014602 25.90758755
3 30 C 26.52738124 10.78363589 4.873602089
4 14 C 3.776964641 21.16561036 24.03153234
4 46 C 16.53719818 23.86634958 25.61504006
5 48 C 5.591317152 25.17616679 24.30522374
5 16 C 23.85069753 23.12154586 0.781450997
6 58 C 28.1357636 15.89359176 0.567406646
6 28 C 0.708229201 12.20641988 0.309303591
7 17 C 4.319991082 21.29233667 3.516184987
7 8 C 24.09490443 9.507000131 14.93472971
8 85 C 19.99606403 21.61509867 0.161222766
8 5 C 6.056082264 25.35186187 5.375641692
9 24 C 19.83904205 24.54037422 11.08571464
9 13 C 4.388769239 7.928106767 4.279531285
10 78 C 13.67598922 5.3140143 15.2710129
10 13 C 12.27642791 16.04610858 1.815260029
Any guidance would be appreciated. Maybe even there is some groupby apply logic which could assist in this.
You can run fillna() row-wise in apply(), then fill with a random sample from the dataframe filtered by your conditions:
df.apply(lambda row: row.fillna(df[(df['menu'] == row['menu']) & (df['id'] != row['id'])].dropna().sample(n=1).iloc[0]), axis=1)

multiple cumulative sum based on grouped columns

I have a dataset where I would like to sum two columns and then perform a subtraction while displaying a cumulative sum.
Data
id date t1 t2 total start cur_t1 cur_t2 final_o finaldb de_t1 de_t2
a q122 4 1 5 50 25 20 55 21 1 1
a q222 1 1 2 50 25 20 57 22 0 0
a q322 0 0 0 50 25 20 57 22 5 5
b q122 5 5 10 100 30 40 110 27 4 4
b q222 2 2 4 100 30 70 114 29 5 1
b q322 3 4 7 100 30 70 121 33 0 1
Desired
id date t1 t2 total start cur_t1 cur_t2 final_o finaldb de_t1 de_t2 finalt1
a q122 4 1 5 50 25 20 55 21 1 1 28
a q222 1 1 2 50 25 20 57 22 0 0 29
a q322 0 0 0 50 25 20 57 22 5 5 24
b q122 5 5 10 100 30 40 110 27 4 4 31
b q222 2 2 4 100 30 70 114 29 5 1 28
b q322 3 4 7 100 30 70 121 33 0 1 31
Logic
Create 'finalt1' column by summing 't1' and 'cur_t1'
initially and then subtracting 'de_t1' cumulatively and grouping by 'id' and 'date'
Doing
df['finalt1'] = df['cur_t1'].add(df.groupby('id')['t1'].cumsum())
I am still researching on how to subtract the 'de_t1' column cumulatively.
I can't test right now, but logically:
(df['cur_t1'].add(df.groupby('id')['t1'].cumsum())
.sub(df.groupby('id')['de_t1'].cumsum())
)
Of note, there was also this possibility to avoid grouping twice (it is calculating both cumsums at once and computing the difference), but it is actually slower:
df['cur_t1'].add(df.groupby('id')[['de_t1', 't1']].cumsum().diff(axis=1)['t1'])

Merge dataframes including extreme values

I have 2 data frames, df1 and df2:
df1
Out[66]:
A B
0 1 11
1 1 2
2 1 32
3 1 42
4 1 54
5 1 66
6 2 16
7 2 23
8 3 13
9 3 24
10 3 35
11 3 46
12 3 51
13 4 12
14 4 28
15 4 39
16 4 49
df2
Out[80]:
B
0 32
1 42
2 13
3 24
4 35
5 39
6 49
I want to merge dataframes but at the same time including the first and/or last value of the set in column A. This is an example of the desired outcome:
df3
Out[93]:
A B
0 1 2
1 1 32
2 1 42
3 1 54
4 3 13
5 3 24
6 3 35
7 3 46
8 4 28
9 4 39
10 4 49
I'm trying to use merge but that only slice the portion of data frames that coincides. Someone have an idea to deal with this? thanks!
Here's one way to do it using merge with indicator, groupby, and rolling:
df[df.merge(df2, on='B', how='left', indicator='Ind').eval('Found=Ind == "both"')
.groupby('A')['Found']
.apply(lambda x: x.rolling(3, center=True, min_periods=2).max()).astype(bool)]
Output:
A B
1 1 2
2 1 32
3 1 42
4 1 54
8 3 13
9 3 24
10 3 35
11 3 46
14 4 28
15 4 39
16 4 49
pd.concat([df1.groupby('A').min().reset_index(), pd.merge(df1,df2, on="B"), df1.groupby('A').max().reset_index()]).reset_index(drop=True).drop_duplicates().sort_values(['A','B'])
A B
0 1 2
4 1 32
5 1 42
1 2 16
2 3 13
7 3 24
8 3 35
3 4 12
9 4 39
10 4 49
Breaking down each part
#Get Minimum
df1.groupby('A').min().reset_index()
# Merge on B
pd.merge(df1,df2, on="B")
# Get Maximum
df1.groupby('A').max().reset_index()
# Reset the Index and drop duplicated rows since there may be similarities between the Merge and Min/Max. Sort values by 'A' then by 'B'
.reset_index(drop=True).drop_duplicates().sort_values(['A','B'])

insert dataframe into a dataframe - Python/Pandas

Question is pretty self explanatory, how would you insert a dataframe with a couple of values in to a bigger dataframe at a given point (between index's 10 and 11). Meaning that .append cant be used
You can use concat with sliced df by loc:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(100, size=(5,6)), columns=list('ABCDEF'))
print (df1)
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
df2 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#inserted between 4 and 5 index values
print (pd.concat([df1.loc[:4], df2, df1.loc[4:]], ignore_index=True))
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
5 1 4 7 1 5 7
6 2 5 8 3 3 4
7 3 6 9 5 6 3
8 27 4 31 1 13 83

Categories

Resources