How to iterate rows in pandas Dataframe to perform the Manipulation - python

How to iterate rows in pandas to perform the Manipulation in a format below
I have a csv file that contains a 365 column and 1152 rows(the rows index is divided like(1,48),(1,48)...), I need to select K maximum rows from every (1,48) row index and perform some manipulation.
Steps I took:
I used df.apply to do this.
Code I tried
def with_battery(val):
for i in range(d2i.shape[0]):
if i in [31,32,33,34,35,36]: #[31,32,33,34,35,36] should be replaced by top K max.
#batterysize = 50
if val.iloc[i]>batterysize:
val.iloc[i]=0
else:
val.iloc[i] -= batterysize
return val
D2j = D2i.apply(with_battery,axis=0)
How the data is:
**Input Dataframe**
1 2 3 4 5 6 7
1 10 11 34 21 23 12 10
2 11 11 11 11 11 11 11
3 32 32 32 32 32 32 32
4 21 21 21 21 21 21 21
5 42 42 42 42 42 42 42
6 34 34 34 34 34 34 34
1 21 21 21 21 21 21 21
2 22 22 22 22 22 22 22
3 54 54 54 54 54 54 54
4 45 45 45 45 45 45 45
5 43 43 43 43 43 43 43
6 42 42 42 42 42 42 42
> for K=3, the row (3,5,6) is max so I made the value less than 50 as Zero and value more than 50 as value - 50. Similarly in next chunk of rows (3,4,5) is top 3 max rows and I performed similar action as above
Output Dataframe
1 2 3 4 5 6 7
1 10 11 34 21 23 12 10
2 11 11 11 11 11 11 11
3 0 0 0 0 0 0 0
4 21 21 21 21 21 21 21
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
1 21 21 21 21 21 21 21
2 22 22 22 22 22 22 22
3 4 4 4 4 4 4 4
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 42 42 42 42 42 42 42

Related

Create bi-weekly and monthly labels with week numbers in pandas

I have a dataframe with profit values, IDs, and week values. It looks a little like this
ID
Week
Profit
A
1
2
A
2
2
A
3
0
A
4
0
I want to create two new columns called "Bi-Weekly" and "Monthly", so week 1 would be label 2, week 2 would also be label 2, but week 3 would be labeled 4, and week 4 would be labeled 4, and they would all be labeled month 1, so I could groupby weekly, bi-weekly, or monthly profit as needed. Right now I've created two functions which work, but the weeks are going to go up to a year (52 weeks) so I was wondering if there's a more efficient way. My bi-weekly function below.
def biweek(prof_calc):
if (prof_calc['week']==2):
return 2
elif (prof_calc['week']==3):
return 2
elif (prof_calc['week']==4):
return 4
elif (prof_calc['week']==5):
return 4
elif (prof_calc['week']==6):
return 6
elif (prof_calc['week']==7):
return 6
elif (prof_calc['week']==8):
return 8
elif (prof_calc['week']==9):
return 8
elif (prof_calc['week']==10):
return 10
elif (prof_calc['week']==11):
return 10
prof_calc['BiWeek'] = prof_calc.apply(biweek, axis=1)
IIUC, you could try:
df["Biweekly"] = (df["Week"]-1)//2+1
df["Monthly"] = (df["Week"]-1)//4+1
>>> df
ID Week Profit Biweekly Monthly
0 A 1 42 1 1
1 A 2 69 1 1
2 A 3 53 2 1
3 A 4 63 2 1
4 A 5 56 3 2
5 A 6 57 3 2
6 A 7 86 4 2
7 A 8 23 4 2
8 A 9 35 5 3
9 A 10 10 5 3
10 A 11 25 6 3
11 A 12 21 6 3
12 A 13 39 7 4
13 A 14 82 7 4
14 A 15 76 8 4
15 A 16 20 8 4
16 A 17 97 9 5
17 A 18 67 9 5
18 A 19 21 10 5
19 A 20 22 10 5
20 A 21 88 11 6
21 A 22 67 11 6
22 A 23 33 12 6
23 A 24 38 12 6
24 A 25 8 13 7
25 A 26 67 13 7
26 A 27 16 14 7
27 A 28 49 14 7
28 A 29 3 15 8
29 A 30 17 15 8
30 A 31 79 16 8
31 A 32 19 16 8
32 A 33 21 17 9
33 A 34 9 17 9
34 A 35 56 18 9
35 A 36 83 18 9
36 A 37 1 19 10
37 A 38 53 19 10
38 A 39 66 20 10
39 A 40 55 20 10
40 A 41 85 21 11
41 A 42 90 21 11
42 A 43 34 22 11
43 A 44 3 22 11
44 A 45 9 23 12
45 A 46 28 23 12
46 A 47 58 24 12
47 A 48 14 24 12
48 A 49 42 25 13
49 A 50 69 25 13
50 A 51 76 26 13
51 A 52 49 26 13

Categorise hour into four different slots of 15 mins

I am working on a dataframe and I want to group the data for an hour into 4 different slots of 15 mins,
0-15 - 1st slot
15-30 - 2nd slot
30-45 - 3rd slot
45-00(or 60) - 4th slot
I am not even able to think, how to go forward with this
I tried extracting hours, minutes and seconds from the time, but what to do now?
Use integer division by 15 and then add 1:
df = pd.DataFrame({'M': range(60)})
df['slot'] = df['M'] // 15 + 1
print (df)
M slot
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 1
12 12 1
13 13 1
14 14 1
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 2
22 22 2
23 23 2
24 24 2
25 25 2
26 26 2
27 27 2
28 28 2
29 29 2
30 30 3
31 31 3
32 32 3
33 33 3
34 34 3
35 35 3
36 36 3
37 37 3
38 38 3
39 39 3
40 40 3
41 41 3
42 42 3
43 43 3
44 44 3
45 45 4
46 46 4
47 47 4
48 48 4
49 49 4
50 50 4
51 51 4
52 52 4
53 53 4
54 54 4
55 55 4
56 56 4
57 57 4
58 58 4
59 59 4

Pandas code to get the count of each values

Here I'm sharing a sample data(I'm dealing with Big Data), the "counts" value varies from 1 to 3000+,, sometimes more than that..
Sample data looks like :
ID counts
41 44 17 16 19 52 6
17 30 16 19 4
52 41 44 30 17 16 6
41 44 52 41 41 41 6
17 17 17 17 41 5
I was trying to split "ID" column into multiple & trying to get that count,,
data= reading the csv_file
split_data = data.ID.apply(lambda x: pd.Series(str(x).split(" "))) # separating columns
as I mentioned, I'm dealing with big data,, so this method is not that much effective..i'm facing problem to get the "ID" counts
I want to collect the total counts of each ID & map it to the corresponding ID column.
Expected output:
ID counts 16 17 19 30 41 44 52
41 41 17 16 19 52 6 1 1 1 0 2 0 1
17 30 16 19 4 1 1 1 1 0 0 0
52 41 44 30 17 16 6 1 1 0 1 1 1 1
41 44 52 41 41 41 6 0 0 0 0 4 1 1
17 17 17 17 41 5 0 4 0 0 1 0 0
If you have any idea,, please let me know
Thank you
Use Counter for get counts of values splitted by space in list comprehension:
from collections import Counter
L = [{int(k): v for k, v in Counter(x.split()).items()} for x in df['ID']]
df1 = pd.DataFrame(L, index=df.index).fillna(0).astype(int).sort_index(axis=1)
df = df.join(df1)
print (df)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0
Another idea, but I guess slowier:
df1 = df.assign(a = df['ID'].str.split()).explode('a')
df1 = df.join(pd.crosstab(df1['ID'], df1['a']), on='ID')
print (df1)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0

How do you correctly format multiple columns of integers in python?

I have some code here:
for i in range(self.size):
print('{:6d}'.format(self.data[i], end=' '))
if (i + 1) % NUMBER_OF_COLUMNS == 0:
print()
Right now this prints as:
1
1
1
1
1
2
3
3
3
3
(whitespace)
3
3
3
etc.
It creates a new line when it hits 10 digits, but it doens't print the initial 10 in a row...
This is what I want-
1 1 1 1 1 1 1 2 2 3
3 3 3 3 3 4 4 4 4 5
However when it hits two digit numbers it gets messed up -
8 8 8 8 8 9 9 9 9 10
10 10 10 10 10 10 etc.
I want it to be right-aligned like this-
8 8 8 8 8 9
10 10 10 10 11 12 etc.
When I remove the format piece it will print the rows out, but there wont be the extra spacing in there of course!
You can align strings by "padding" values using a string's .rjust method. Using some dummy data:
NUMBER_OF_COLUMNS = 10
for i in range(100):
print("{}".format(i//2).rjust(3), end=' ')
#print("{:3}".format(i//2), end=' ') edit: this also works. Thanks AChampion
if (i + 1) % NUMBER_OF_COLUMNS == 0:
print()
#Output:
0 0 1 1 2 2 3 3 4 4
5 5 6 6 7 7 8 8 9 9
10 10 11 11 12 12 13 13 14 14
15 15 16 16 17 17 18 18 19 19
20 20 21 21 22 22 23 23 24 24
25 25 26 26 27 27 28 28 29 29
30 30 31 31 32 32 33 33 34 34
35 35 36 36 37 37 38 38 39 39
40 40 41 41 42 42 43 43 44 44
45 45 46 46 47 47 48 48 49 49
Another approach is to just chunk up the data into rows and print each row, e.g.:
def chunk(iterable, n):
return zip(*[iter(iterable)]*n)
for row in chunk(self.data, NUMBER_OF_COLUMNS):
print(' '.join(str(data).rjust(6) for data in row))
e.g:
In []:
for row in chunk(range(100), 10):
print(' '.join(str(data//2).rjust(3) for data in row))
Out[]:
0 0 1 1 2 2 3 3 4 4
5 5 6 6 7 7 8 8 9 9
10 10 11 11 12 12 13 13 14 14
15 15 16 16 17 17 18 18 19 19
20 20 21 21 22 22 23 23 24 24
25 25 26 26 27 27 28 28 29 29
30 30 31 31 32 32 33 33 34 34
35 35 36 36 37 37 38 38 39 39
40 40 41 41 42 42 43 43 44 44
45 45 46 46 47 47 48 48 49 49

performing differences between rows in pandas based on columns values

I have this dataframe, I'm trying to create a new column where I want to store the difference of products sold based on code and date.
for example this is the starting dataframe:
date code sold
0 20150521 0 47
1 20150521 12 39
2 20150521 16 39
3 20150521 20 38
4 20150521 24 38
5 20150521 28 37
6 20150521 32 36
7 20150521 4 43
8 20150521 8 43
9 20150522 0 47
10 20150522 12 37
11 20150522 16 36
12 20150522 20 36
13 20150522 24 36
14 20150522 28 35
15 20150522 32 31
16 20150522 4 42
17 20150522 8 41
18 20150523 0 50
19 20150523 12 48
20 20150523 16 46
21 20150523 20 46
22 20150523 24 46
23 20150523 28 45
24 20150523 32 42
25 20150523 4 49
26 20150523 8 49
27 20150524 0 39
28 20150524 12 33
29 20150524 16 30
... ... ... ...
150 20150606 32 22
151 20150606 4 34
152 20150606 8 33
153 20150607 0 31
154 20150607 12 30
155 20150607 16 30
156 20150607 20 29
157 20150607 24 28
158 20150607 28 26
159 20150607 32 24
160 20150607 4 30
161 20150607 8 30
162 20150608 0 47
I think this could be a solution...
full_df1=full_df[full_df.date == '20150609'].reset_index(drop=True)
full_df1['code'] = full_df1['code'].astype(float)
full_df1= full_df1.sort(['code'], ascending=[False])
code date sold
8 32 20150609 33
7 28 20150609 36
6 24 20150609 37
5 20 20150609 39
4 16 20150609 42
3 12 20150609 46
2 8 20150609 49
1 4 20150609 49
0 0 20150609 50
full_df1.set_index('code')['sold'].diff().reset_index()
that gives me back this output for a single date 20150609 :
code difference
0 32 NaN
1 28 3
2 24 1
3 20 2
4 16 3
5 12 4
6 8 3
7 4 0
8 0 1
is there a better solution to have the same result in a more pythonic way?
I would like to create a new column [difference] and store the data there having as result 4 columns [date, code, sold, difference]
This exactly the kind of thing that panda's groupby functionality is built for, and I highly recommend reading and working through this documentation: panda's groupby documentation
This code replicates what you are asking for, but for every date.
df = pd.DataFrame({'date':['Mon','Mon','Mon','Tue','Tue','Tue'],'code':[10,21,30,10,21,30], 'sold':[12,13,34,10,15,20]})
df['difference'] = df.groupby('date')['sold'].diff()
df
code date sold difference
0 10 Mon 12 NaN
1 21 Mon 13 1
2 30 Mon 34 21
3 10 Tue 10 NaN
4 21 Tue 15 5
5 30 Tue 20 5

Categories

Resources