Pandas code to get the count of each values - python

Here I'm sharing a sample data(I'm dealing with Big Data), the "counts" value varies from 1 to 3000+,, sometimes more than that..
Sample data looks like :
ID counts
41 44 17 16 19 52 6
17 30 16 19 4
52 41 44 30 17 16 6
41 44 52 41 41 41 6
17 17 17 17 41 5
I was trying to split "ID" column into multiple & trying to get that count,,
data= reading the csv_file
split_data = data.ID.apply(lambda x: pd.Series(str(x).split(" "))) # separating columns
as I mentioned, I'm dealing with big data,, so this method is not that much effective..i'm facing problem to get the "ID" counts
I want to collect the total counts of each ID & map it to the corresponding ID column.
Expected output:
ID counts 16 17 19 30 41 44 52
41 41 17 16 19 52 6 1 1 1 0 2 0 1
17 30 16 19 4 1 1 1 1 0 0 0
52 41 44 30 17 16 6 1 1 0 1 1 1 1
41 44 52 41 41 41 6 0 0 0 0 4 1 1
17 17 17 17 41 5 0 4 0 0 1 0 0
If you have any idea,, please let me know
Thank you

Use Counter for get counts of values splitted by space in list comprehension:
from collections import Counter
L = [{int(k): v for k, v in Counter(x.split()).items()} for x in df['ID']]
df1 = pd.DataFrame(L, index=df.index).fillna(0).astype(int).sort_index(axis=1)
df = df.join(df1)
print (df)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0
Another idea, but I guess slowier:
df1 = df.assign(a = df['ID'].str.split()).explode('a')
df1 = df.join(pd.crosstab(df1['ID'], df1['a']), on='ID')
print (df1)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0

Related

How to iterate rows in pandas Dataframe to perform the Manipulation

How to iterate rows in pandas to perform the Manipulation in a format below
I have a csv file that contains a 365 column and 1152 rows(the rows index is divided like(1,48),(1,48)...), I need to select K maximum rows from every (1,48) row index and perform some manipulation.
Steps I took:
I used df.apply to do this.
Code I tried
def with_battery(val):
for i in range(d2i.shape[0]):
if i in [31,32,33,34,35,36]: #[31,32,33,34,35,36] should be replaced by top K max.
#batterysize = 50
if val.iloc[i]>batterysize:
val.iloc[i]=0
else:
val.iloc[i] -= batterysize
return val
D2j = D2i.apply(with_battery,axis=0)
How the data is:
**Input Dataframe**
1 2 3 4 5 6 7
1 10 11 34 21 23 12 10
2 11 11 11 11 11 11 11
3 32 32 32 32 32 32 32
4 21 21 21 21 21 21 21
5 42 42 42 42 42 42 42
6 34 34 34 34 34 34 34
1 21 21 21 21 21 21 21
2 22 22 22 22 22 22 22
3 54 54 54 54 54 54 54
4 45 45 45 45 45 45 45
5 43 43 43 43 43 43 43
6 42 42 42 42 42 42 42
> for K=3, the row (3,5,6) is max so I made the value less than 50 as Zero and value more than 50 as value - 50. Similarly in next chunk of rows (3,4,5) is top 3 max rows and I performed similar action as above
Output Dataframe
1 2 3 4 5 6 7
1 10 11 34 21 23 12 10
2 11 11 11 11 11 11 11
3 0 0 0 0 0 0 0
4 21 21 21 21 21 21 21
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
1 21 21 21 21 21 21 21
2 22 22 22 22 22 22 22
3 4 4 4 4 4 4 4
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 42 42 42 42 42 42 42

Categorise hour into four different slots of 15 mins

I am working on a dataframe and I want to group the data for an hour into 4 different slots of 15 mins,
0-15 - 1st slot
15-30 - 2nd slot
30-45 - 3rd slot
45-00(or 60) - 4th slot
I am not even able to think, how to go forward with this
I tried extracting hours, minutes and seconds from the time, but what to do now?
Use integer division by 15 and then add 1:
df = pd.DataFrame({'M': range(60)})
df['slot'] = df['M'] // 15 + 1
print (df)
M slot
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 1
12 12 1
13 13 1
14 14 1
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 2
22 22 2
23 23 2
24 24 2
25 25 2
26 26 2
27 27 2
28 28 2
29 29 2
30 30 3
31 31 3
32 32 3
33 33 3
34 34 3
35 35 3
36 36 3
37 37 3
38 38 3
39 39 3
40 40 3
41 41 3
42 42 3
43 43 3
44 44 3
45 45 4
46 46 4
47 47 4
48 48 4
49 49 4
50 50 4
51 51 4
52 52 4
53 53 4
54 54 4
55 55 4
56 56 4
57 57 4
58 58 4
59 59 4

Get next value from range after reaching specific multiples

I have a range of values i iterating through the number of hours in a year (8760) starting at 1. For every hour, the variable hour increments by 1 until it reaches 24 where it restarts. The variable year_day increments by 1 after every 24 hours is reached. Eg
i hour year_day
1 1 1
2 2 1
3 3 1
...
23 23 1
24 1 2
25 2 2
...
47 24 2
48 1 3
49 2 3
I'm struggling to make it so that when i = 24, hour also is 24 and year_day remains at 1. Then when i is the next value directly after a multiple is found, the hour restarts at 1 and year_day increments by 1. In other words, everytime it reaches midnight, the hour = 24 and year_day is still the previous day. Eg
i hour year_day
23 23 1
24 24 1
25 1 2
...
47 23 2
48 24 2
49 1 3
Here is the code:
hour = 0
year_day = 1
for i in range(1, 8761):
hour = hour + 1
if i % 24 == 0:
hour = 1
year_day = year_day + 1
print(i, hour, year_day)
Your code is ok, you just need to start with hour=1 and print before the if statement. Try the following:
hour = 1
year_day = 1
for i in range(1, 8761):
print(i, hour, year_day)
hour+=1
if i % 24 == 0:
hour = 1
year_day = year_day + 1
Output:
...
21 21 1
22 22 1
23 23 1
24 24 1
25 1 2
26 2 2
27 3 2
...
I have used a pandas approach to this question. The code is as follows:
import numpy as np
import pandas as pd
i = list(range(1,50))
df = pd.DataFrame(i, columns=["i"])
df["hours"] = df["i"]%24
df["hours"][df["hours"]==0] = 24
df["days"] = (df["i"]//24.1+1).astype(int)
display(df)
The output is:
i hours days
0 1 1 1
1 2 2 1
2 3 3 1
3 4 4 1
4 5 5 1
5 6 6 1
6 7 7 1
7 8 8 1
8 9 9 1
9 10 10 1
10 11 11 1
11 12 12 1
12 13 13 1
13 14 14 1
14 15 15 1
15 16 16 1
16 17 17 1
17 18 18 1
18 19 19 1
19 20 20 1
20 21 21 1
21 22 22 1
22 23 23 1
23 24 24 1
24 25 1 2
25 26 2 2
26 27 3 2
27 28 4 2
28 29 5 2
29 30 6 2
30 31 7 2
31 32 8 2
32 33 9 2
33 34 10 2
34 35 11 2
35 36 12 2
36 37 13 2
37 38 14 2
38 39 15 2
39 40 16 2
40 41 17 2
41 42 18 2
42 43 19 2
43 44 20 2
44 45 21 2
45 46 22 2
46 47 23 2
47 48 24 2
48 49 1 3
hour = 0
year_day = 1
for i in range(1, 8761):
if i % 24 == 0:
hour = 0
year_day += 1
hour += 1
print(i, hour, year_day)
Returns:
20 20 1
. . .
24 1 2
25 2 2
. . .
46 23 2
47 24 2
48 1 3

Defining Target based on two column values

I am new to python and I was facing some issue solving the following problem.
I have the following dataframe:
SoldDate CountSoldperMonth
2019-06-01 20
5
10
12
33
16
50
27
2019-05-01 2
5
11
13
2019-04-01 32
35
39
42
47
55
61
80
I need to add a Target column such that for the top 5 values in 'CountSoldperMonth' for a particular SoldDate, target should be 1 else 0. If the number of rows in 'CountSoldperMonth' for a particular 'SoldDate' is less than 5 then only the row with highest count will be marked as 1 in the Target and rest as 0. The resulting dataframe should look as below.
SoldDate CountSoldperMonth Target
2019-06-01 20 1
5 0
10 0
12 0
33 1
16 1
50 1
27 1
2019-05-01 2 0
5 0
11 0
13 1
2019-04-01 32 0
35 0
39 0
42 1
47 1
55 1
61 1
80 1
How do I do this?
In your case , using groupby with your rules chain with apply if...else
df.groupby('SoldDate').CountSoldperMonth.\
apply(lambda x : x==max(x) if len(x)<=5 else x.isin(sorted(x)[-5:])).astype(int)
Out[346]:
0 1
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 0
9 0
10 0
11 1
12 0
13 0
14 0
15 1
16 1
17 1
18 1
19 1
Name: CountSoldperMonth, dtype: int32

Pandas - Replace values based on index

If I create a dataframe like so:
import pandas as pd, numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('AB'))
How would I change the entry in column A to be the number 16 from row 0 -15, for example? In other words, how do I replace cells based purely on index?
Use loc:
df.loc[0:15,'A'] = 16
print (df)
A B
0 16 45
1 16 5
2 16 97
3 16 58
4 16 26
5 16 87
6 16 51
7 16 17
8 16 39
9 16 73
10 16 94
11 16 69
12 16 57
13 16 24
14 16 43
15 16 77
16 41 0
17 3 21
18 0 98
19 45 39
20 66 62
21 8 53
22 69 47
23 48 53
Solution with ix is deprecated.
In addition to the other answers, here is what you can do if you have a list of individual indices:
indices = [0,1,3,6,10,15]
df.loc[indices,'A'] = 16
print(df.head(16))
Output:
A B
0 16 4
1 16 4
2 4 3
3 16 4
4 1 1
5 3 0
6 16 4
7 2 1
8 4 4
9 3 4
10 16 0
11 3 1
12 4 2
13 2 2
14 2 1
15 16 1
One more solution is
df.at[0:15, 'A']=16
print(df.head(20))
OUTPUT:
A B
0 16 44
1 16 86
2 16 97
3 16 79
4 16 94
5 16 24
6 16 88
7 16 43
8 16 64
9 16 39
10 16 84
11 16 42
12 16 8
13 16 72
14 16 23
15 16 28
16 18 11
17 76 15
18 12 38
19 91 6
Very interesting observation, that code below does change the value in the original dataframe
df.loc[0:15,'A'] = 16
But if you use a pretty similar code like this
df.loc[0:15]['A'] = 16
Than it will give back just a copy of your dataframe with changed value and doesn't change the value in the original df object.
Hope that this will save some time for someone dealing with this issue.
Could you instead of 16, update the value of that column to -1.0? for me, it returns 255 instead of -1.0.
>>> effect_df.loc[3:5, ['city_SF', 'city_Seattle']] = -1.0
Rent city_SF city_Seattle
0 3999 1 0
1 4000 1 0
2 4001 1 0
3 3499 255 255
4 3500 255 255
5 3501 255 255
6 2499 0 1
7 2500 0 1
8 2501 0 1
To Mad Physicist: it appears that at first you need to change the column data types from short integer to float. Looks like your -1.0 was cast as short integer.

Categories

Resources