I have a dataframe with profit values, IDs, and week values. It looks a little like this
ID
Week
Profit
A
1
2
A
2
2
A
3
0
A
4
0
I want to create two new columns called "Bi-Weekly" and "Monthly", so week 1 would be label 2, week 2 would also be label 2, but week 3 would be labeled 4, and week 4 would be labeled 4, and they would all be labeled month 1, so I could groupby weekly, bi-weekly, or monthly profit as needed. Right now I've created two functions which work, but the weeks are going to go up to a year (52 weeks) so I was wondering if there's a more efficient way. My bi-weekly function below.
def biweek(prof_calc):
if (prof_calc['week']==2):
return 2
elif (prof_calc['week']==3):
return 2
elif (prof_calc['week']==4):
return 4
elif (prof_calc['week']==5):
return 4
elif (prof_calc['week']==6):
return 6
elif (prof_calc['week']==7):
return 6
elif (prof_calc['week']==8):
return 8
elif (prof_calc['week']==9):
return 8
elif (prof_calc['week']==10):
return 10
elif (prof_calc['week']==11):
return 10
prof_calc['BiWeek'] = prof_calc.apply(biweek, axis=1)
IIUC, you could try:
df["Biweekly"] = (df["Week"]-1)//2+1
df["Monthly"] = (df["Week"]-1)//4+1
>>> df
ID Week Profit Biweekly Monthly
0 A 1 42 1 1
1 A 2 69 1 1
2 A 3 53 2 1
3 A 4 63 2 1
4 A 5 56 3 2
5 A 6 57 3 2
6 A 7 86 4 2
7 A 8 23 4 2
8 A 9 35 5 3
9 A 10 10 5 3
10 A 11 25 6 3
11 A 12 21 6 3
12 A 13 39 7 4
13 A 14 82 7 4
14 A 15 76 8 4
15 A 16 20 8 4
16 A 17 97 9 5
17 A 18 67 9 5
18 A 19 21 10 5
19 A 20 22 10 5
20 A 21 88 11 6
21 A 22 67 11 6
22 A 23 33 12 6
23 A 24 38 12 6
24 A 25 8 13 7
25 A 26 67 13 7
26 A 27 16 14 7
27 A 28 49 14 7
28 A 29 3 15 8
29 A 30 17 15 8
30 A 31 79 16 8
31 A 32 19 16 8
32 A 33 21 17 9
33 A 34 9 17 9
34 A 35 56 18 9
35 A 36 83 18 9
36 A 37 1 19 10
37 A 38 53 19 10
38 A 39 66 20 10
39 A 40 55 20 10
40 A 41 85 21 11
41 A 42 90 21 11
42 A 43 34 22 11
43 A 44 3 22 11
44 A 45 9 23 12
45 A 46 28 23 12
46 A 47 58 24 12
47 A 48 14 24 12
48 A 49 42 25 13
49 A 50 69 25 13
50 A 51 76 26 13
51 A 52 49 26 13
Related
How to iterate rows in pandas to perform the Manipulation in a format below
I have a csv file that contains a 365 column and 1152 rows(the rows index is divided like(1,48),(1,48)...), I need to select K maximum rows from every (1,48) row index and perform some manipulation.
Steps I took:
I used df.apply to do this.
Code I tried
def with_battery(val):
for i in range(d2i.shape[0]):
if i in [31,32,33,34,35,36]: #[31,32,33,34,35,36] should be replaced by top K max.
#batterysize = 50
if val.iloc[i]>batterysize:
val.iloc[i]=0
else:
val.iloc[i] -= batterysize
return val
D2j = D2i.apply(with_battery,axis=0)
How the data is:
**Input Dataframe**
1 2 3 4 5 6 7
1 10 11 34 21 23 12 10
2 11 11 11 11 11 11 11
3 32 32 32 32 32 32 32
4 21 21 21 21 21 21 21
5 42 42 42 42 42 42 42
6 34 34 34 34 34 34 34
1 21 21 21 21 21 21 21
2 22 22 22 22 22 22 22
3 54 54 54 54 54 54 54
4 45 45 45 45 45 45 45
5 43 43 43 43 43 43 43
6 42 42 42 42 42 42 42
> for K=3, the row (3,5,6) is max so I made the value less than 50 as Zero and value more than 50 as value - 50. Similarly in next chunk of rows (3,4,5) is top 3 max rows and I performed similar action as above
Output Dataframe
1 2 3 4 5 6 7
1 10 11 34 21 23 12 10
2 11 11 11 11 11 11 11
3 0 0 0 0 0 0 0
4 21 21 21 21 21 21 21
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
1 21 21 21 21 21 21 21
2 22 22 22 22 22 22 22
3 4 4 4 4 4 4 4
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 42 42 42 42 42 42 42
I am working on a dataframe and I want to group the data for an hour into 4 different slots of 15 mins,
0-15 - 1st slot
15-30 - 2nd slot
30-45 - 3rd slot
45-00(or 60) - 4th slot
I am not even able to think, how to go forward with this
I tried extracting hours, minutes and seconds from the time, but what to do now?
Use integer division by 15 and then add 1:
df = pd.DataFrame({'M': range(60)})
df['slot'] = df['M'] // 15 + 1
print (df)
M slot
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 1
12 12 1
13 13 1
14 14 1
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 2
22 22 2
23 23 2
24 24 2
25 25 2
26 26 2
27 27 2
28 28 2
29 29 2
30 30 3
31 31 3
32 32 3
33 33 3
34 34 3
35 35 3
36 36 3
37 37 3
38 38 3
39 39 3
40 40 3
41 41 3
42 42 3
43 43 3
44 44 3
45 45 4
46 46 4
47 47 4
48 48 4
49 49 4
50 50 4
51 51 4
52 52 4
53 53 4
54 54 4
55 55 4
56 56 4
57 57 4
58 58 4
59 59 4
I am working with python to create a new frame starting from two frame by using Pandas.
The first frame (called frame1) is composed by the following line:
A B C D E
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
12 12 12 12 12
13 13 13 13 13
14 14 14 14 14
15 15 15 15 15
The second frame (called frame2) is:
A B C D E
19 19 19 19 19
24 24 24 24 24
29 29 29 29 29
34 34 34 34 34
39 39 39 39 39
44 44 44 44 44
49 49 49 49 49
54 54 54 54 54
59 59 59 59 59
64 64 64 64 64
69 69 69 69 69
74 74 74 74 74
79 79 79 79 79
84 84 84 84 84
89 89 89 89 89
94 94 94 94 94
99 99 99 99 99
Now i want to create a new dataset with this logic: starting from frame1 substitute every 5 row until the end of the frame1, the row of the frame1 with a random row of the frame2 (and remove the added row from frame2). A possible output should be:
A B C D E
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
59 59 59 59 59
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
29 29 29 29 29
11 11 11 11 11
12 12 12 12 12
13 13 13 13 13
14 14 14 14 14
84 84 84 84 84
How can i do this operation?
It's quite simple:
frame1.loc[4::5] = frame2.sample(frac=1).reset_index(drop=True)
where
df.loc[4::5] selects every fifth element, starting with the fifth one in df, and
df.sample(frac=1).reset_index(drop=True) shuffles a df around randomly
One way is to first obtain the indices where to update (we could also slice assign, but we'd have the problem of the end not being included), and then assign back taking a sample from df2 of the corresponding size:
ix = np.flatnonzero(np.diff(np.arange(df.shape[0]+1)//5))
df1.iloc[ix] = df2.sample(df1.shape[0]//5).to_numpy()
print(df1)
A B C D E
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 84 84 84 84 84
5 6 6 6 6 6
6 7 7 7 7 7
7 8 8 8 8 8
8 9 9 9 9 9
9 89 89 89 89 89
10 11 11 11 11 11
11 12 12 12 12 12
12 13 13 13 13 13
13 14 14 14 14 14
14 99 99 99 99 99
I have some code here:
for i in range(self.size):
print('{:6d}'.format(self.data[i], end=' '))
if (i + 1) % NUMBER_OF_COLUMNS == 0:
print()
Right now this prints as:
1
1
1
1
1
2
3
3
3
3
(whitespace)
3
3
3
etc.
It creates a new line when it hits 10 digits, but it doens't print the initial 10 in a row...
This is what I want-
1 1 1 1 1 1 1 2 2 3
3 3 3 3 3 4 4 4 4 5
However when it hits two digit numbers it gets messed up -
8 8 8 8 8 9 9 9 9 10
10 10 10 10 10 10 etc.
I want it to be right-aligned like this-
8 8 8 8 8 9
10 10 10 10 11 12 etc.
When I remove the format piece it will print the rows out, but there wont be the extra spacing in there of course!
You can align strings by "padding" values using a string's .rjust method. Using some dummy data:
NUMBER_OF_COLUMNS = 10
for i in range(100):
print("{}".format(i//2).rjust(3), end=' ')
#print("{:3}".format(i//2), end=' ') edit: this also works. Thanks AChampion
if (i + 1) % NUMBER_OF_COLUMNS == 0:
print()
#Output:
0 0 1 1 2 2 3 3 4 4
5 5 6 6 7 7 8 8 9 9
10 10 11 11 12 12 13 13 14 14
15 15 16 16 17 17 18 18 19 19
20 20 21 21 22 22 23 23 24 24
25 25 26 26 27 27 28 28 29 29
30 30 31 31 32 32 33 33 34 34
35 35 36 36 37 37 38 38 39 39
40 40 41 41 42 42 43 43 44 44
45 45 46 46 47 47 48 48 49 49
Another approach is to just chunk up the data into rows and print each row, e.g.:
def chunk(iterable, n):
return zip(*[iter(iterable)]*n)
for row in chunk(self.data, NUMBER_OF_COLUMNS):
print(' '.join(str(data).rjust(6) for data in row))
e.g:
In []:
for row in chunk(range(100), 10):
print(' '.join(str(data//2).rjust(3) for data in row))
Out[]:
0 0 1 1 2 2 3 3 4 4
5 5 6 6 7 7 8 8 9 9
10 10 11 11 12 12 13 13 14 14
15 15 16 16 17 17 18 18 19 19
20 20 21 21 22 22 23 23 24 24
25 25 26 26 27 27 28 28 29 29
30 30 31 31 32 32 33 33 34 34
35 35 36 36 37 37 38 38 39 39
40 40 41 41 42 42 43 43 44 44
45 45 46 46 47 47 48 48 49 49
I have this dataframe, I'm trying to create a new column where I want to store the difference of products sold based on code and date.
for example this is the starting dataframe:
date code sold
0 20150521 0 47
1 20150521 12 39
2 20150521 16 39
3 20150521 20 38
4 20150521 24 38
5 20150521 28 37
6 20150521 32 36
7 20150521 4 43
8 20150521 8 43
9 20150522 0 47
10 20150522 12 37
11 20150522 16 36
12 20150522 20 36
13 20150522 24 36
14 20150522 28 35
15 20150522 32 31
16 20150522 4 42
17 20150522 8 41
18 20150523 0 50
19 20150523 12 48
20 20150523 16 46
21 20150523 20 46
22 20150523 24 46
23 20150523 28 45
24 20150523 32 42
25 20150523 4 49
26 20150523 8 49
27 20150524 0 39
28 20150524 12 33
29 20150524 16 30
... ... ... ...
150 20150606 32 22
151 20150606 4 34
152 20150606 8 33
153 20150607 0 31
154 20150607 12 30
155 20150607 16 30
156 20150607 20 29
157 20150607 24 28
158 20150607 28 26
159 20150607 32 24
160 20150607 4 30
161 20150607 8 30
162 20150608 0 47
I think this could be a solution...
full_df1=full_df[full_df.date == '20150609'].reset_index(drop=True)
full_df1['code'] = full_df1['code'].astype(float)
full_df1= full_df1.sort(['code'], ascending=[False])
code date sold
8 32 20150609 33
7 28 20150609 36
6 24 20150609 37
5 20 20150609 39
4 16 20150609 42
3 12 20150609 46
2 8 20150609 49
1 4 20150609 49
0 0 20150609 50
full_df1.set_index('code')['sold'].diff().reset_index()
that gives me back this output for a single date 20150609 :
code difference
0 32 NaN
1 28 3
2 24 1
3 20 2
4 16 3
5 12 4
6 8 3
7 4 0
8 0 1
is there a better solution to have the same result in a more pythonic way?
I would like to create a new column [difference] and store the data there having as result 4 columns [date, code, sold, difference]
This exactly the kind of thing that panda's groupby functionality is built for, and I highly recommend reading and working through this documentation: panda's groupby documentation
This code replicates what you are asking for, but for every date.
df = pd.DataFrame({'date':['Mon','Mon','Mon','Tue','Tue','Tue'],'code':[10,21,30,10,21,30], 'sold':[12,13,34,10,15,20]})
df['difference'] = df.groupby('date')['sold'].diff()
df
code date sold difference
0 10 Mon 12 NaN
1 21 Mon 13 1
2 30 Mon 34 21
3 10 Tue 10 NaN
4 21 Tue 15 5
5 30 Tue 20 5