Defining Target based on two column values

Defining Target based on two column values - python

I am new to python and I was facing some issue solving the following problem.
I have the following dataframe:
SoldDate CountSoldperMonth
2019-06-01 20
5
10
12
33
16
50
27
2019-05-01 2
5
11
13
2019-04-01 32
35
39
42
47
55
61
80
I need to add a Target column such that for the top 5 values in 'CountSoldperMonth' for a particular SoldDate, target should be 1 else 0. If the number of rows in 'CountSoldperMonth' for a particular 'SoldDate' is less than 5 then only the row with highest count will be marked as 1 in the Target and rest as 0. The resulting dataframe should look as below.
SoldDate CountSoldperMonth Target
2019-06-01 20 1
5 0
10 0
12 0
33 1
16 1
50 1
27 1
2019-05-01 2 0
5 0
11 0
13 1
2019-04-01 32 0
35 0
39 0
42 1
47 1
55 1
61 1
80 1
How do I do this?

In your case , using groupby with your rules chain with apply if...else
df.groupby('SoldDate').CountSoldperMonth.\
apply(lambda x : x==max(x) if len(x)<=5 else x.isin(sorted(x)[-5:])).astype(int)
Out[346]:
0 1
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 0
9 0
10 0
11 1
12 0
13 0
14 0
15 1
16 1
17 1
18 1
19 1
Name: CountSoldperMonth, dtype: int32

Related

Counting values in data frame rows against another df to see how many values are higher

I have two data frames
df2022fl One is a list of 24 rows
df One is one row of values
1759 columns in each df.
I want to reference every row in dataframe with 24 rows too count how many columns are above the corresponding column in the one row df.
I used the code below, but keep getting the error below the line of code
( df2022fl > df.T[df2022fl.columns].values ).sum(axis=1)
KeyError: "None of [Index(['id', 'table_position', 'performance_rank', 'risk', 'competition_id',\n 'suspended_matches', 'homeAttackAdvantage', 'homeDefenceAdvantage',\n 'homeOverallAdvantage', 'seasonGoals_overall',\n ...\n 'freekicks_total_over275_away', 'freekicks_total_over285_overall',\n 'freekicks_total_over285_home', 'freekicks_total_over285_away',\n 'freekicks_total_over295_overall', 'freekicks_total_over295_home',\n 'freekicks_total_over295_away', 'freekicks_total_over305_overall',\n 'freekicks_total_over305_home', 'freekicks_total_over305_away'],\n dtype='object', length=1759)] are in the [columns]"
I have no idea why this is happening as I removed all type object also, to have just float64 dtypes
Any ideas to help please?
df in text format, this is the dataframe with one row
234 5 5 42 32 0 4 -33 -2 54 30 84 55 29 54 31 19 30 20 10 35 31 34 56 49 58 74 71 71 3 4 -4 16 8 7 13 5 6 7 3 4 38 19 19 4 3 3 1 13 5 5 28 26 21 22 10 9 48 50 39 10 23 9 13 2 3 19 50 42 42 9 10 18 10 6 47 42 32 6 2 2 13 9 9 1 1 1 2 2 1 1 1 1 0 0 0 35 35 30 27 26 25 18 13 21 10 6 2 21 26 8 8 8 17 35 33 39 2 3 8 9 16 17 51 26 17 1 1 0 0 0 0 0 0 0 0 0 0 20 12 7 16 7 5 37 19 14 -8 -2 -9 0 3 5 14 27 34 0 8 13 37 60 81 96 85 67 44 26 4 37 35 32 21 11 2 0 8 25 48 67 79 0 2 5 11 16 18 92 78 65 37 16 0 18 16 14 7 3 0 0 0 0 11 50 83 0 0 0 2 11 16 92 83 67 48 21 4 19 19 16 11 5 1 1 8 24 3 17 52 4 25 48 1 6 11 0 9 57 0 2 12 38 19 19 25 25 22 15 14 9 5 3 2 66 64 49 39 36 19 13 8 4 12 12 9 6 5 3 1 1 1 63 63 47 32 26 16 5 5 5 13 13 12 9 9 4 3 2 0 68 63 50 46 39 17 13 9 0 31 24 19 13 8 5 2 82 63 50 34 21 13 5 16 14 11 6 3 2 1 84 74 56 32 16 11 4 15 10 8 7 5 2 1 78 53 42 37 26 9 5 26 21 15 10 5 3 2 57 47 32 21 11 6 4 12 9 3 2 1 0 0 52 41 14 9 5 0 0 14 12 9 6 2 1 0 61 52 38 25 8 4 0 37 34 25 18 12 4 2 0 0 94 81 57 46 28 8 4 0 0 18 15 11 8 5 1 1 0 0 88 74 46 39 21 4 4 0 0 19 19 14 10 5 3 1 0 0 96 83 63 42 26 13 4 0 0 29 19 7 2 0 0 0 75 40 15 4 0 0 0 12 8 3 2 0 0 0 63 33 13 8 0 0 0 17 10 4 0 0 0 0 83 42 17 0 0 0 0 33 21 11 5 0 0 0 77 55 25 11 0 0 0 17 10 4 2 0 0 0 71 46 17 8 0 0 0 16 11 4 1 0 0 0 83 57 21 4 0 0 0 5 6 2 7 7 14 30 29 30 176 91 85 66 27 35 8 8 9 4 4 4 161 63 94 3 3 3 10 0 4 0 1 1 1 374 229 145 9 12 7 177 88 75 197 127 70 4 4 3 5 6 3 48 51 46 9 9 8 377 182 195 151 66 79 62 28 31 32 16 16 3 2 3 1 1 1 32 31 27 19 12 6 4 96 78 59 41 26 13 9 16 15 12 7 5 1 1 96 78 52 30 22 4 4 16 16 14 10 7 4 1 96 75 57 46 28 17 4 30 18 4 1 0 0 0 78 38 9 2 0 0 0 16 9 0 0 0 0 0 78 39 0 0 0 0 0 14 9 4 1 0 0 0 75 38 17 4 0 0 0 8 4 3 17 17 13 38 19 19 6 4 1 13 17 5 7 3 3 15 13 13 1 1 0 3 5 0 1 1 0 0 0 0 0 0 0 46 31 15 14 9 5 32 19 10 4 10 26 11 26 67 3 7 14 13 37 70 0 3 12 0 16 63 57 33 24 1 1 1 14 8 5 30 35 22 14 5 9 33 26 39 5 2 3 13 11 13 0 2 0 6 4 2 16 21 9 6 2 2 13 9 9 18 9 9 39 39 39 17 6 10 38 26 42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 3 3 1 3 4 4 3 6 3 5 1 5 4 8 2 5 5 5 7 7 10 7 10 11 9 9 16 13 12 16 3 4 5 7 9 11 5 3 6 5 3 5 1 1 0 2 3 2 4 3 4 1 1 2 4 5 6 0 0 0 1 0 2 0 2 1 2 1 2 2 1 3 1 2 0 4 6 7 5 6 5 3 2 9 9 9 9 0 0 0 0 0 2 0 1 2 0 0 3 2 1 3 1 2 0 2 1 1 0 0 1 2 1 2 2 1 1 2 2 1 3 1 3 2 2 6 3 3 7 4 3 7 42 0 0 0 0 0 0 -2 -1 -1 -2 -1 -1 1 5 17 28 2 11 37 67 1 3 9 15 4 13 39 65 0 1 6 12 0 5 28 56 0 1 5 24 0 3 13 63 0 1 5 13 0 4 21 54 0 0 0 10 0 0 0 53 37 18 19 38 19 19 44 21 21 92 44 48 19 8 7 40 19 20 22 11 11 47 23 22 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 2 2 2 50 52 48 23 22 25 0 0 0 27 27 21 40 28 39 15 9 18 14 11 9 56 52 61 34 36 25 50 43 56 66 27 35 73 38 34 1 1 1 1 1 1 139 66 73 3 2 3 4 1 1 3 1 1 3 1 1 0 0 0 38 19 19 10 11 9 3 5 3 2 3 6 4 1 12 8 7 4 8 5 5 3 4 2 2 1 31 21 16 9 42 25 23 14 17 9 9 4 19 16 13 5 4 3 11 9 4 42 36 28 23 18 14 57 47 21 19 13 11 8 4 4 13 11 9 7 7 7 6 2 2 1 1 1 50 34 28 21 11 11 61 56 47 37 37 37 32 11 9 5 5 5 28 18 12 6 11 7 5 2 13 7 3 1 62 40 31 13 50 32 23 9 68 37 16 5 1 0 1 32 15 17 1 0 0 2 0 4 78 74 74 2 0 0 4 1 2 0 0 0 1 0 0 8 4 11 0 0 0 2 0 0 32 16 16 350 172 178 10 10 11 23 27 20 6 9 21 13 8 13 26 17 4 2 5 6 7 2 2 0 0 0 1 1 3 2 0 0 0 1 4 3 1 0 0 0 0 38 19 19 10 8 1 1 1 0 26 35 5 2 4 0 0 0 0 3 1 0 26 11 13 6 4 1 7 3 3 10 7 3 12 3 5 15 13 13 22 13 25 13 21 0 0 0 7 4 0 57 48 54 15 17 4 30 6 6 6 2 2 3 9 10 7 18 12 6 0 0 0 0 0 0 41 52 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 2 7 2 3 17 8 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
df2022fl below why is the 24 rows dataframe

Compare both the dataframe using
df2022fl.ge(df.iloc[0]).sum()
This gives us the number of values in df2022fl which is greater than the value in df
Output :
id 24
table_position 20
performance_rank 20
risk 23
competition_id 24
..
freekicks_total_over295_home 24
freekicks_total_over295_away 24
freekicks_total_over305_overall 24
freekicks_total_over305_home 24
freekicks_total_over305_away 24
Length: 1759, dtype: int64
To get the number of column which came out to be greater than the values in dataframe df you can use axis = 1.
df2022fl['stats'] = df2022fl.ge(df.iloc[0]).sum(axis=1)
This gives you expected output :
id table_position ... freekicks_total_over305_away stats
1 234.0 6.0 ... 0.0 1688
2 235.0 18.0 ... 0.0 1529
3 236.0 16.0 ... 0.0 1565
4 237.0 24.0 ... 0.0 1409
5 242.0 3.0 ... 0.0 1566
6 244.0 4.0 ... 0.0 1681
7 246.0 23.0 ... 0.0 1607
8 247.0 5.0 ... 0.0 1642
9 248.0 14.0 ... 0.0 1603
10 253.0 15.0 ... 0.0 1575
11 254.0 12.0 ... 0.0 1554
12 255.0 13.0 ... 0.0 1593
13 257.0 20.0 ... 0.0 1533
14 258.0 21.0 ... 0.0 1537
15 259.0 9.0 ... 0.0 1585
16 262.0 17.0 ... 0.0 1488
17 265.0 11.0 ... 0.0 1647
18 267.0 7.0 ... 0.0 1628
19 268.0 2.0 ... 0.0 1615
20 1020.0 1.0 ... 0.0 1601
21 1827.0 8.0 ... 0.0 1603
22 1833.0 22.0 ... 0.0 1587
23 3124.0 19.0 ... 0.0 1594
24 3141.0 10.0 ... 0.0 1623

Pandas code to get the count of each values

Here I'm sharing a sample data(I'm dealing with Big Data), the "counts" value varies from 1 to 3000+,, sometimes more than that..
Sample data looks like :
ID counts
41 44 17 16 19 52 6
17 30 16 19 4
52 41 44 30 17 16 6
41 44 52 41 41 41 6
17 17 17 17 41 5
I was trying to split "ID" column into multiple & trying to get that count,,
data= reading the csv_file
split_data = data.ID.apply(lambda x: pd.Series(str(x).split(" "))) # separating columns
as I mentioned, I'm dealing with big data,, so this method is not that much effective..i'm facing problem to get the "ID" counts
I want to collect the total counts of each ID & map it to the corresponding ID column.
Expected output:
ID counts 16 17 19 30 41 44 52
41 41 17 16 19 52 6 1 1 1 0 2 0 1
17 30 16 19 4 1 1 1 1 0 0 0
52 41 44 30 17 16 6 1 1 0 1 1 1 1
41 44 52 41 41 41 6 0 0 0 0 4 1 1
17 17 17 17 41 5 0 4 0 0 1 0 0
If you have any idea,, please let me know
Thank you

Use Counter for get counts of values splitted by space in list comprehension:
from collections import Counter
L = [{int(k): v for k, v in Counter(x.split()).items()} for x in df['ID']]
df1 = pd.DataFrame(L, index=df.index).fillna(0).astype(int).sort_index(axis=1)
df = df.join(df1)
print (df)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0
Another idea, but I guess slowier:
df1 = df.assign(a = df['ID'].str.split()).explode('a')
df1 = df.join(pd.crosstab(df1['ID'], df1['a']), on='ID')
print (df1)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0

assign a number id for every 4 rows in pandas dataframe

I have a pandas dataframe like this:
pd.DataFrame({'week': ['2019-w01', '2019-w02','2019-w03','2019-w04',
'2019-w05','2019-w06','2019-w07','2019-w08',
'2019-w9','2019-w10','2019-w11','2019-w12'],
'value': [11,22,33,34,57,88,2,9,10,1,76,14],
'period': [1,1,1,1,2,2,2,2,3,3,3,3]})
week value
0 2019-w1 11
1 2019-w2 22
2 2019-w3 33
3 2019-w4 34
4 2019-w5 57
5 2019-w6 88
6 2019-w7 2
7 2019-w8 9
8 2019-w9 10
9 2019-w10 1
10 2019-w11 76
11 2019-w12 14
what I need is like below. I would like to assign a period ID every 4-week interval.
week value period
0 2019-w01 11 1
1 2019-w02 22 1
2 2019-w03 33 1
3 2019-w04 34 1
4 2019-w05 57 2
5 2019-w06 88 2
6 2019-w07 2 2
7 2019-w08 9 2
8 2019-w9 10 3
9 2019-w10 1 3
10 2019-w11 76 3
11 2019-w12 14 3
what is the best way to achieve that? Thanks.

try with:
df['period']=(pd.to_numeric(df['week'].str.split('-').str[-1]
.str.replace('w',''))//4).shift(fill_value=0).add(1)
print(df)
week value period
0 2019-w01 11 1
1 2019-w02 22 1
2 2019-w03 33 1
3 2019-w04 34 1
4 2019-w05 57 2
5 2019-w06 88 2
6 2019-w07 2 2
7 2019-w08 9 2
8 2019-w9 10 3
9 2019-w10 1 3
10 2019-w11 76 3
11 2019-w12 14 3

Merge dataframes including extreme values

I have 2 data frames, df1 and df2:
df1
Out[66]:
A B
0 1 11
1 1 2
2 1 32
3 1 42
4 1 54
5 1 66
6 2 16
7 2 23
8 3 13
9 3 24
10 3 35
11 3 46
12 3 51
13 4 12
14 4 28
15 4 39
16 4 49
df2
Out[80]:
B
0 32
1 42
2 13
3 24
4 35
5 39
6 49
I want to merge dataframes but at the same time including the first and/or last value of the set in column A. This is an example of the desired outcome:
df3
Out[93]:
A B
0 1 2
1 1 32
2 1 42
3 1 54
4 3 13
5 3 24
6 3 35
7 3 46
8 4 28
9 4 39
10 4 49
I'm trying to use merge but that only slice the portion of data frames that coincides. Someone have an idea to deal with this? thanks!

Here's one way to do it using merge with indicator, groupby, and rolling:
df[df.merge(df2, on='B', how='left', indicator='Ind').eval('Found=Ind == "both"')
.groupby('A')['Found']
.apply(lambda x: x.rolling(3, center=True, min_periods=2).max()).astype(bool)]
Output:
A B
1 1 2
2 1 32
3 1 42
4 1 54
8 3 13
9 3 24
10 3 35
11 3 46
14 4 28
15 4 39
16 4 49

pd.concat([df1.groupby('A').min().reset_index(), pd.merge(df1,df2, on="B"), df1.groupby('A').max().reset_index()]).reset_index(drop=True).drop_duplicates().sort_values(['A','B'])
A B
0 1 2
4 1 32
5 1 42
1 2 16
2 3 13
7 3 24
8 3 35
3 4 12
9 4 39
10 4 49
Breaking down each part
#Get Minimum
df1.groupby('A').min().reset_index()
# Merge on B
pd.merge(df1,df2, on="B")
# Get Maximum
df1.groupby('A').max().reset_index()
# Reset the Index and drop duplicated rows since there may be similarities between the Merge and Min/Max. Sort values by 'A' then by 'B'
.reset_index(drop=True).drop_duplicates().sort_values(['A','B'])

Pandas DataFrame Return Value from Column Index

I have a dataframe that has values of the different column numbers for another dataframe. Is there a way that I can just return the value from the other dataframe instead of just having the column index.
I basically want to match up the index between the Push and df dataframes. The values in the Push dataframe contain what column I want to return from the df dataframe.
Push dataframe:
0 1
0 1 2
1 0 3
2 0 3
3 1 3
4 0 2
df dataframe:
0 1 2 3 4
0 10 11 22 33 44
1 10 11 22 33 44
2 10 11 22 33 44
3 10 11 22 33 44
4 10 11 22 33 44
return:
0 1
0 11 22
1 10 33
2 10 33
3 11 33
4 10 22

You can do it with np.take ; However this function works on the flattened array. push must be shift like that :
In [285]: push1 = push.values+np.arange(0,25,5)[:,None]
In [229]: pd.DataFrame(df.values.take(push1))
EDIT
No, I just reinvent np.choose :
In [24]: df
Out[24]:
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 20 21 22 23 24
3 30 31 32 33 34
4 40 41 42 43 44
In [25]: push
Out[25]:
0 1
0 1 2
1 0 3
2 0 3
3 1 3
4 0 2
In [27]: np.choose(push.T,df).T
Out[27]:
0 1
0 1 2
1 10 13
2 20 23
3 31 33
4 40 42

We using melt then replace notice (df1 is your push , df2 is your df)
df1.astype(str).replace(df2.melt().drop_duplicates().set_index('variable').value.to_dict())
Out[31]:
0 1
0 11 22
1 10 33
2 10 33
3 11 33
4 10 22

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Defining Target based on two column values - python

Related

Counting values in data frame rows against another df to see how many values are higher

Pandas code to get the count of each values

assign a number id for every 4 rows in pandas dataframe

Merge dataframes including extreme values

Pandas DataFrame Return Value from Column Index

Categories

Resources