Adding column to dataframe based on another dataframe using pandas - python

I need to create a new column in dataframe based on intervals from another dataframe.
For example, I have a dataframe where in the time column I have values ​​and I want to create column in another dataframe based on the intervals in that time column.
I think a practical example is simpler to understand:
Dataframe with intervals
df1
time value var2
0 1.0 34.0 35.0
1 4.0 754.0 755.0
2 9.0 768.0 769.0
3 12.0 65.0 66.0
Dataframe that I need to filter
df2
time value var2
0 1.0 23.0 23.0
1 2.0 43.0 43.0
2 3.0 76.0 12.0
3 4.0 88.0 22.0
4 5.0 64.0 45.0
5 6.0 98.0 33.0
6 7.0 76.0 11.0
7 8.0 56.0 44.0
8 9.0 23.0 22.0
9 10.0 54.0 44.0
10 11.0 65.0 22.0
11 12.0 25.0 25.0
should result
df3
time value var2 interval
0 1.0 23.0 23.0 1
1 2.0 43.0 43.0 1
2 3.0 76.0 12.0 1
3 4.0 88.0 22.0 1
4 5.0 64.0 45.0 2
5 6.0 98.0 33.0 2
6 7.0 76.0 11.0 2
7 8.0 56.0 44.0 2
8 9.0 23.0 22.0 2
9 10.0 54.0 44.0 3
10 11.0 65.0 22.0 3
11 12.0 25.0 25.0 3
EDIT: As Shubham Sharma said, it's not a filter, I want to add a new column based on intervals in other dataframe.

You can use pd.cut to categorize the time in df2 into discrete intervals based on the time in df1 then use Series.factorize to obtain a numeric array identifying distinct ordered values.
df2['interval'] = pd.cut(df2['time'], df1['time'], include_lowest=True)\
.factorize(sort=True)[0] + 1
Result:
time value var2 interval
0 1.0 23.0 23.0 1
1 2.0 43.0 43.0 1
2 3.0 76.0 12.0 1
3 4.0 88.0 22.0 1
4 5.0 64.0 45.0 2
5 6.0 98.0 33.0 2
6 7.0 76.0 11.0 2
7 8.0 56.0 44.0 2
8 9.0 23.0 22.0 2
9 10.0 54.0 44.0 3
10 11.0 65.0 22.0 3
11 12.0 25.0 25.0 3

Related

How to do similar to conditional countifs on a dataframe

I am trying to replicate countifs in excel to get a rank between two unique values that are listed in my dataframe. I have attached the expected output calculated in excel using countif and let/rank functions.
I am trying to generate "average rank of gas and coal plants" that takes the number from the "average rank column" and then ranks the two unique types from technology (CCGT or COAL) into two new ranks (Gas or Coal) so then I can get the relavant quantiles for this. In case you are wondering why I would need to do this seeing as there are only two coal plants, well when I run this model on a larger dataset it will be useful to know how to do this in code and not manually on my dataset.
Ideally the output will return two ranks 1-47 for all units with technology == CCGT and 1-2 for all units with technology == COAL.
This is the column I am looking to make
Unit ID
Technology
03/01/2022
04/01/2022
05/01/2022
06/01/2022
07/01/2022
08/01/2022
Average Rank
Unit Rank
Avg Rank of Gas & Coal plants
Gas Quintiles
Coal Quintiles
Quintiles
FAWN-1
CCGT
1.0
5.0
1.0
5.0
2.0
1.0
2.5
1
1
1
0
Gas_1
GRAI-6
CCGT
4.0
18.0
2.0
4.0
3.0
3.0
5.7
2
2
1
0
Gas_1
EECL-1
CCGT
5.0
29.0
4.0
1.0
1.0
2.0
7.0
3
3
1
0
Gas_1
PEMB-21
CCGT
7.0
1.0
6.0
13.0
8.0
8.0
7.2
4
4
1
0
Gas_1
PEMB-51
CCGT
3.0
3.0
3.0
11.0
16.0
7.2
5
5
1
0
Gas_1
PEMB-41
CCGT
9.0
4.0
7.0
7.0
10.0
13.0
8.3
6
6
1
0
Gas_1
WBURB-1
CCGT
6.0
9.0
22.0
2.0
7.0
5.0
8.5
7
7
1
0
Gas_1
PEMB-31
CCGT
14.0
6.0
13.0
6.0
4.0
9.0
8.7
8
8
1
0
Gas_1
GRMO-1
CCGT
2.0
7.0
10.0
24.0
11.0
6.0
10.0
9
9
1
0
Gas_1
PEMB-11
CCGT
21.0
2.0
9.0
10.0
9.0
14.0
10.8
10
10
2
0
Gas_2
STAY-1
CCGT
19.0
12.0
5.0
23.0
6.0
7.0
12.0
11
11
2
0
Gas_2
GRAI-7
CCGT
10.0
27.0
15.0
9.0
15.0
11.0
14.5
12
12
2
0
Gas_2
DIDCB6
CCGT
28.0
11.0
11.0
8.0
19.0
15.0
15.3
13
13
2
0
Gas_2
SCCL-3
CCGT
17.0
16.0
31.0
3.0
18.0
10.0
15.8
14
14
2
0
Gas_2
STAY-4
CCGT
12.0
8.0
20.0
18.0
14.0
23.0
15.8
14
14
2
0
Gas_2
CDCL-1
CCGT
13.0
22.0
8.0
25.0
12.0
16.0
16.0
16
16
2
0
Gas_2
STAY-3
CCGT
8.0
17.0
17.0
20.0
13.0
22.0
16.2
17
17
2
0
Gas_2
MRWD-1
CCGT
19.0
26.0
5.0
19.0
17.3
18
18
2
0
Gas_2
WBURB-3
CCGT
24.0
14.0
17.0
17.0
18.0
19
19
3
0
Gas_3
WBURB-2
CCGT
14.0
21.0
12.0
31.0
18.0
19.2
20
20
3
0
Gas_3
GYAR-1
CCGT
26.0
14.0
17.0
20.0
21.0
19.6
21
21
3
0
Gas_3
STAY-2
CCGT
18.0
20.0
18.0
21.0
24.0
20.0
20.2
22
22
3
0
Gas_3
KLYN-A-1
CCGT
24.0
12.0
19.0
27.0
20.5
23
23
3
0
Gas_3
SHOS-1
CCGT
16.0
15.0
28.0
15.0
29.0
27.0
21.7
24
24
3
0
Gas_3
DIDCB5
CCGT
10.0
35.0
22.0
22.3
25
25
3
0
Gas_3
CARR-1
CCGT
33.0
26.0
27.0
22.0
4.0
22.4
26
26
3
0
Gas_3
LAGA-1
CCGT
15.0
13.0
29.0
32.0
23.0
24.0
22.7
27
27
3
0
Gas_3
CARR-2
CCGT
24.0
25.0
27.0
29.0
21.0
12.0
23.0
28
28
3
0
Gas_3
GRAI-8
CCGT
11.0
28.0
36.0
16.0
26.0
25.0
23.7
29
29
4
0
Gas_4
SCCL-2
CCGT
29.0
16.0
28.0
25.0
24.5
30
30
4
0
Gas_4
LBAR-1
CCGT
19.0
25.0
31.0
28.0
25.8
31
31
4
0
Gas_4
CNQPS-2
CCGT
20.0
32.0
32.0
26.0
27.5
32
32
4
0
Gas_4
SPLN-1
CCGT
23.0
30.0
30.0
27.7
33
33
4
0
Gas_4
DAMC-1
CCGT
23.0
21.0
38.0
34.0
29.0
34
34
4
0
Gas_4
KEAD-2
CCGT
30.0
30.0
35
35
4
0
Gas_4
SHBA-1
CCGT
26.0
23.0
35.0
37.0
30.3
36
36
4
0
Gas_4
HUMR-1
CCGT
22.0
30.0
37.0
37.0
33.0
28.0
31.2
37
37
4
0
Gas_4
CNQPS-4
CCGT
27.0
33.0
35.0
30.0
31.3
38
38
5
0
Gas_5
CNQPS-1
CCGT
25.0
40.0
33.0
32.7
39
39
5
0
Gas_5
SEAB-1
CCGT
32.0
34.0
36.0
29.0
32.8
40
40
5
0
Gas_5
PETEM1
CCGT
35.0
35.0
41
41
5
0
Gas_5
ROCK-1
CCGT
31.0
34.0
38.0
38.0
35.3
42
42
5
0
Gas_5
SEAB-2
CCGT
31.0
39.0
39.0
34.0
35.8
43
43
5
0
Gas_5
WBURB-43
COAL
32.0
37.0
40.0
39.0
31.0
35.8
44
1
0
1
Coal_1
FDUNT-1
CCGT
36.0
36.0
45
44
5
0
Gas_5
COSO-1
CCGT
30.0
42.0
36.0
36.0
45
44
5
0
Gas_5
WBURB-41
COAL
33.0
38.0
41.0
40.0
32.0
36.8
47
2
0
1
Coal_1
FELL-1
CCGT
34.0
39.0
43.0
41.0
33.0
38.0
48
46
5
0
Gas_5
KEAD-1
CCGT
43.0
43.0
49
47
5
0
Gas_5
I have tried to do it the same way I got average rank, which is a rank of the average of inputs in the dataframe but it doesn't seem to work with additional conditions.
Thank you!!
import pandas as pd
df = pd.read_csv("gas.csv")
display(df['Technology'].value_counts())
print('------')
display(df['Technology'].value_counts()[0]) # This is how you access count of CCGT
display(df['Technology'].value_counts()[1])
Output:
CCGT 47
COAL 2
Name: Technology, dtype: int64
------
47
2
By the way: pd.cut or pd.qcut can be used to calculate quantiles. You don't have to manually define what a quantile is.
Refer to the documentation and other websites:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
https://www.geeksforgeeks.org/how-to-use-pandas-cut-and-qcut/
There are many methods you can pass to rank. Refer to documentation:
https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html
df['rank'] = df.groupby("Technology")["Average Rank"].rank(method = "dense", ascending = True)
df
method{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’
How to rank the group of records that have the same value (i.e. ties):
average: average rank of the group
min: lowest rank in the group
max: highest rank in the group
first: ranks assigned in order they appear in the array
dense: like ‘min’, but rank always increases by 1 between groups.

Is there a way to replace a whole pandas dataframe row using ffill, if one value of a specific column is NaN?

I am trying to sort a dataframe where some rows are all NaN. I want to fill these using ffill. I'm currently trying this although i feel like it's a mismatch of a few commands
df.loc[df['A'].isna(), :] = df.fillna(method='ffill')
This gives an error:
AttributeError: 'NoneType' object has no attribute 'fillna'
but I want to filter the NaNs I fill using ffill if one of the columns is NaN. i.e.
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 NaN NaN NaN NaN NaN
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 NaN NaN NaN NaN NaN
So I would only like to fill a row IFF the value of A is NaN, whilst leaving C,0 and D,0 as NaN. Giving the below dataframe
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 85 65 11 31 5
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 10 32 5 15 11
So just to clarify, the ONLY rows that get replaced with ffill are 3,8 and the reason is because the value of column A in rows 3 and 8 are NaN
Thanks
---Update---
When I'm debugging and evaluate the expression : df.loc[df['A'].isna(), :]
I get
3 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
So I assume whats happening here is, I then attempt ffill on this new dataframe only containing 3 and 8 and obviously i cant ffill NaNs with NaNs.
Change values only to those row that start with nan
df.loc[df['A'].isna(), :] = df.ffill().loc[df['A'].isna(), :]
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
Try using a mask to identify the relevant rows where column A is null. The take those same rows from the forward filled dataframe.
mask = df['A'].isnull()
df.loc[mask, :] = df.ffill().loc[mask, :]
>>> df
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
you just want to fill (DataFrame.ffill ) where(DataFrame.where) df['A'] is nan and the rest leave it as before (df):
df=df.ffill().where(df['A'].isna(),df)
print(df)
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0

How to prioritise columns based on overall significance level of values?

How should I select columns based on significance level and which factors should I choose to decide the priority of the column of the given DataFrame
A B C D E F G H
5 61.0 77.0 40.0 46.0 60.0 73.0 66.0 1.0
4 60.0 51.0 49.0 59.0 59.0 67.0 69.0 3.0
3 35.0 32.0 48.0 57.0 43.0 34.0 34.0 4.0
2 17.0 16.0 22.0 12.0 15.0 5.0 5.0 1.0
1 10.0 7.0 11.0 3.0 4.0 5.0 8.0 4.0
At first, each index is a rating by the users for different functionalities (columns). I concatenated the count_values of each column and now trying to understand which functionalities(columns) are more relevant to the users. From the above df, I have to select the top three columns.
If anyone needs further clarification, please let me know.

Subtract a batch of columns in pandas

I am transitioning to using pandas for handling my csv datasets. I am currently trying to do in pandas what I was already doing very easily in numpy: subtract a group of columns from another group several times. This is effectively a element-wise matrix subtraction.
Just for reference, this used to be my numpy solution for this
def subtract_baseline(data, baseline_columns, features_columns):
"""Takes in a list of baseline columns and feature columns, and subtracts the baseline values from all features"""
assert len(features_columns)%len(baseline_columns)==0, "The number of feature columns is not divisible by baseline columns"
num_blocks = len(features_columns)/len(baseline_columns)
block_size = len(baseline_columns)
for i in range(num_blocks):
#Grab each feature block and subract the baseline
init_col = block_size*i+features_columns[0]
final_col = init_col+block_size
data[:, init_col:final_col] = numpy.subtract(data[:, init_col:final_col], data[:,baseline_columns])
return data
To ilustrate better, we can create the following toy dataset:
data = [[10,11,12,13,1,10],[20,21,22,23,1,10],[30,31,32,33,1,10],[40,41,42,43,1,10],[50,51,52,53,1,10],[60,61,62,63,1,10]]
df = pd.DataFrame(data,columns=['L1P1','L1P2','L2P1','L2P2','BP1','BP2'],dtype=float)
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 10.0 11.0 12.0 13.0 1.0 10.0
1 20.0 21.0 22.0 23.0 1.0 10.0
2 30.0 31.0 32.0 33.0 1.0 10.0
3 40.0 41.0 42.0 43.0 1.0 10.0
4 50.0 51.0 52.0 53.0 1.0 10.0
5 60.0 61.0 62.0 63.0 1.0 10.0
The correct output would be the result of grabbing the values in L1P1 & L1P2 and subtracting G1P1 & G1P2 (AKA the baseline), then doing it again for L2P1, L2P2 and any other columns there might be (this is what my for loop does in the original function).
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 9.0 1.0 11.0 3.0 1.0 10.0
1 19.0 11.0 21.0 13.0 1.0 10.0
2 29.0 21.0 31.0 23.0 1.0 10.0
3 39.0 31.0 41.0 33.0 1.0 10.0
4 49.0 41.0 51.0 43.0 1.0 10.0
5 59.0 51.0 61.0 53.0 1.0 10.0
Note that labels for the dataframe should not change, and ideally I'd want a method that relies on the columns indexes, not labels, because the actual data block is 30 columns, not 2 like in this example. This is how my original function in numpy worked, the parameters baseline_columns and features_columns were just lists of the columns indexes.
After this the baseline columns would be deleted all together from the dataframe, as their function has already been fulfilled.
I tried doing this for just 1 batch using iloc but I get Nan values
df.iloc[:,[0,1]] = df.iloc[:,[0,1]] - df.iloc[:,[4,5]]
L1P1 L1P2 L2P1 L2P2 G1P1 G1P2
0 NaN NaN 12.0 13.0 1.0 10.0
1 NaN NaN 22.0 23.0 1.0 10.0
2 NaN NaN 32.0 33.0 1.0 10.0
3 NaN NaN 42.0 43.0 1.0 10.0
4 NaN NaN 52.0 53.0 1.0 10.0
5 NaN NaN 62.0 63.0 1.0 10.0
Adding .values at the end , pandas dataframe will search the column and index match to do the subtract , since the column is not match for 0,1 and 4,5 it will return NaN
df.iloc[:,[0,1]]=df.iloc[:,[0,1]].values - df.iloc[:,[4,5]].values
df
Out[176]:
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 9.0 1.0 12.0 13.0 1.0 10.0
1 19.0 11.0 22.0 23.0 1.0 10.0
2 29.0 21.0 32.0 33.0 1.0 10.0
3 39.0 31.0 42.0 43.0 1.0 10.0
4 49.0 41.0 52.0 53.0 1.0 10.0
5 59.0 51.0 62.0 63.0 1.0 10.0
Is there a reason you want to do it in one line? I.e. would it be okay for your purposes to do it with two lines:
df.iloc[:,0] = df.iloc[:,0] - df.iloc[:,4]
df.iloc[:,1] = df.iloc[:,1] - df.iloc[:,5]
These two lines achieve what I think is your intent.

How to calculate counts on pandas pivot_table

I have data something like this
import random
import pandas as pd
jobs = ['Agriculture', 'Crafts', 'Labor', 'Professional']
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(300)]
})
And I want a simple table showing the count of jobs in each region.
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len))
Output is
MaritalStatus
Region 1 2 3 4 5 All
JobCategory
Agriculture 13.0 23.0 17.0 18.0 8.0 79.0
Crafts 16.0 13.0 18.0 19.0 14.0 80.0
Labor 15.0 11.0 19.0 11.0 14.0 70.0
Professional 22.0 17.0 16.0 7.0 9.0 71.0
All 66.0 64.0 70.0 55.0 45.0 300.0
I assume "MaritalStatus" is showing up in the output because that is the column that the count is being calculated on. How do I get Pandas to calculate based on the Region-JobCategory count and ignore extraneous columns in the dataframe?
Added in edit ---
I am looking for a table with margin values to be output. The values in the table I show are what I want but I don't want MaritalStatus to be what is counted. If there is a Nan in that column, e.g. change the column definition to
'MaritalStatus':[random.choice(['Not Married', 'Married'])
for i in range(299)].append(np.NaN)
This is the output (both with and without values = 'MaritalStatus',)
MaritalStatus
Region 1 2 3 4 5 All
JobCategory
Agriculture 16.0 14.0 16.0 14.0 16.0 NaN
Crafts 25.0 17.0 15.0 14.0 16.0 NaN
Labor 14.0 16.0 8.0 17.0 15.0 NaN
Professional 13.0 14.0 14.0 13.0 13.0 NaN
All NaN NaN NaN NaN NaN 0.0
You can fill the nan values with 0 and then find the len i.e
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(299)].append(np.NaN)})
df = df.fillna(0)
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
values='MaritalStatus',
aggfunc=len))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 19.0 17.0 13.0 20.0 9.0 78.0
Crafts 17.0 14.0 9.0 11.0 16.0 67.0
Labor 10.0 17.0 15.0 19.0 11.0 72.0
Professional 11.0 14.0 19.0 19.0 20.0 83.0
All 57.0 62.0 56.0 69.0 56.0 300.0
If you cut the dataframe down to just the columns that are to be part of the final index counting rows works without having to refer to another column.
pd.pivot_table(testdata[['JobCategory', 'Region']],
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len)
Output is the same as in the question except the line with "MaritialStatus" is not present.
The len aggregation function counts the number of times a value of MaritalStatus appears along a particular combination of JobCategory - Region. Thus you're counting the number of JobCategory - Region instances, which is what you're expecting I guess.
EDIT
We can assign key value to each records and count or size that value.
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(299)].append(np.NaN)})
print(pd.pivot_table(df.assign(key=1),
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len,
values='key'))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 16.0 14.0 13.0 16.0 16.0 75.0
Crafts 14.0 9.0 17.0 22.0 13.0 75.0
Labor 11.0 18.0 20.0 10.0 16.0 75.0
Professional 16.0 14.0 15.0 14.0 16.0 75.0
All 57.0 55.0 65.0 62.0 61.0 300.0
You could add MaritalStatus as the values parameter, and this would eliminate that extra level in the column index. With aggfunc=len, it really doesn't matter what you select as the values parameter it is going to return a count of 1 for every row in that aggregation.
So, try:
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len,
values='MaritalStatus'))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 10.0 18.0 10.0 15.0 19.0 72.0
Crafts 11.0 13.0 17.0 11.0 22.0 74.0
Labor 12.0 10.0 18.0 16.0 12.0 68.0
Professional 21.0 16.0 20.0 13.0 16.0 86.0
All 54.0 57.0 65.0 55.0 69.0 300.0
Option 2
Use groupby and size:
df.groupby(['JobCategory','Region']).size()
Output:
JobCategory Region
Agriculture 1 10
2 18
3 10
4 15
5 19
Crafts 1 11
2 13
3 17
4 11
5 22
Labor 1 12
2 10
3 18
4 16
5 12
Professional 1 21
2 16
3 20
4 13
5 16
dtype: int64

Categories

Resources