How to prioritise columns based on overall significance level of values? - python

How should I select columns based on significance level and which factors should I choose to decide the priority of the column of the given DataFrame
A B C D E F G H
5 61.0 77.0 40.0 46.0 60.0 73.0 66.0 1.0
4 60.0 51.0 49.0 59.0 59.0 67.0 69.0 3.0
3 35.0 32.0 48.0 57.0 43.0 34.0 34.0 4.0
2 17.0 16.0 22.0 12.0 15.0 5.0 5.0 1.0
1 10.0 7.0 11.0 3.0 4.0 5.0 8.0 4.0
At first, each index is a rating by the users for different functionalities (columns). I concatenated the count_values of each column and now trying to understand which functionalities(columns) are more relevant to the users. From the above df, I have to select the top three columns.
If anyone needs further clarification, please let me know.

Related

how to ffill a multi-index dataframe based on a first level mask

I have a multi-index dataframes.
import pandas as pd
from itertools import product
arrays = [['bar', 'baz','foo'], range(4)]
tuples = list(product(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
multi_ind=pd.DataFrame(np.random.randn(6, len(tuples)), index=range(6), columns=index)
Some values are nans:
multi_ind.loc[3,('bar',2)]=np.nan
multi_ind.loc[3,('bar',3)]=np.nan
multi_ind.loc[4,('bar',1)]=np.nan
For 'bar' I would like to fill all nans expect last, as described in:
Forward fill all except last value in python pandas dataframe
mask=multi_ind['bar']
last_valid_column_per_row = mask.apply(pd.Series.last_valid_index,axis=1)
mask=mask.apply(lambda series:series[:int(last_valid_column_per_row.loc[series.name])].ffill(),axis=1)
Then I would like to ffill() also the other first levels (.e.g baz,foo), using the same logic as for bar (up to last valid index from df['bar']), and I would like to set a nan also any value which was still nan in bar
How to achieve that in an efficient way?
Now I am doing the following, but it is very slow...
df_as_dict={}
df=df.ffill(axis=1) # start by ffilling
for first_level,gr in df.groupby(level=0,axis=1):
gr[first_level][(mask.isnull())]=np.nan # then remove the nans (they should be only at the end)
df_as_dict[first_level]=gr[first_level]
The code based on last_valid_index (in the indicated post) actually
fills NaN along the given axis:
without initial NaN cells (ffill has no previous value to
take as source),
without trailing NaN cells (whatever their number), just
because of last_valid_index, terminating the action just
before the trailing continuous sequence of NaNs,
but if you are happy with this arrangement, let it be.
I created the test DataFrame the following, more concise way:
arrays = [['bar', 'baz','foo'], range(4)]
cols = pd.MultiIndex.from_product(arrays, names=['first', 'second'])
np.random.seed(2)
arr = np.arange(1, 6 * 12 + 1, dtype=float).reshape(6, -1)
# Where to put NaN (x / y)
ind = (np.array([0, 0, 1, 2, 2, 2, 3, 4, 4, 5, 5]),
np.array([1, 2, 6, 1, 3, 5,10, 2, 3,10,11]))
arr[ind] = np.nan
multi_ind = pd.DataFrame(arr, columns=cols)
so that it contains:
first bar baz foo
second 0 1 2 3 0 1 2 3 0 1 2 3
0 1.0 NaN NaN 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
1 13.0 14.0 15.0 16.0 17.0 18.0 NaN 20.0 21.0 22.0 23.0 24.0
2 25.0 NaN 27.0 NaN 29.0 NaN 31.0 32.0 33.0 34.0 35.0 36.0
3 37.0 38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 NaN 48.0
4 49.0 50.0 NaN NaN 53.0 54.0 55.0 56.0 57.0 58.0 59.0 60.0
5 61.0 62.0 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 NaN NaN
To get your result, run:
result = multi_ind.stack(level=0).apply(
lambda row: row[: row.last_valid_index() + 1].ffill(), axis=1)\
.unstack(level=1).swaplevel(axis=1).reindex(columns=multi_ind.columns)
Note that your last_valid_column_per_row is not needed.
It is enough to pass axis=1 to operate on rows, instead of
columns (like in the indicated post).
The result is:
first bar baz foo
second 0 1 2 3 0 1 2 3 0 1 2 3
0 1.0 1.0 1.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
1 13.0 14.0 15.0 16.0 17.0 18.0 18.0 20.0 21.0 22.0 23.0 24.0
2 25.0 25.0 27.0 NaN 29.0 29.0 31.0 32.0 33.0 34.0 35.0 36.0
3 37.0 38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 46.0 48.0
4 49.0 50.0 NaN NaN 53.0 54.0 55.0 56.0 57.0 58.0 59.0 60.0
5 61.0 62.0 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 NaN NaN
Details:
stack(level=0) - put bar, baz and foo "fragments"
in consecutive rows.
apply(….ffill(), axis=1) - fill each row, without the trailing
sequence of NaN (if any). Note that I added + 1 in order to
include the last non-NaN value in the result. Otherwise the last
column would have been dropped.
unstack(level=1) - restore the previous ("wide") arangement,
but unfortunately the order of column MultiIndex levels is reversed.
swaplevel(axis=1) - restore the original order of column levels,
but unfortunately the order of column names is wrong.
reindex(…) - restore the original column order.

Adding column to dataframe based on another dataframe using pandas

I need to create a new column in dataframe based on intervals from another dataframe.
For example, I have a dataframe where in the time column I have values ​​and I want to create column in another dataframe based on the intervals in that time column.
I think a practical example is simpler to understand:
Dataframe with intervals
df1
time value var2
0 1.0 34.0 35.0
1 4.0 754.0 755.0
2 9.0 768.0 769.0
3 12.0 65.0 66.0
Dataframe that I need to filter
df2
time value var2
0 1.0 23.0 23.0
1 2.0 43.0 43.0
2 3.0 76.0 12.0
3 4.0 88.0 22.0
4 5.0 64.0 45.0
5 6.0 98.0 33.0
6 7.0 76.0 11.0
7 8.0 56.0 44.0
8 9.0 23.0 22.0
9 10.0 54.0 44.0
10 11.0 65.0 22.0
11 12.0 25.0 25.0
should result
df3
time value var2 interval
0 1.0 23.0 23.0 1
1 2.0 43.0 43.0 1
2 3.0 76.0 12.0 1
3 4.0 88.0 22.0 1
4 5.0 64.0 45.0 2
5 6.0 98.0 33.0 2
6 7.0 76.0 11.0 2
7 8.0 56.0 44.0 2
8 9.0 23.0 22.0 2
9 10.0 54.0 44.0 3
10 11.0 65.0 22.0 3
11 12.0 25.0 25.0 3
EDIT: As Shubham Sharma said, it's not a filter, I want to add a new column based on intervals in other dataframe.
You can use pd.cut to categorize the time in df2 into discrete intervals based on the time in df1 then use Series.factorize to obtain a numeric array identifying distinct ordered values.
df2['interval'] = pd.cut(df2['time'], df1['time'], include_lowest=True)\
.factorize(sort=True)[0] + 1
Result:
time value var2 interval
0 1.0 23.0 23.0 1
1 2.0 43.0 43.0 1
2 3.0 76.0 12.0 1
3 4.0 88.0 22.0 1
4 5.0 64.0 45.0 2
5 6.0 98.0 33.0 2
6 7.0 76.0 11.0 2
7 8.0 56.0 44.0 2
8 9.0 23.0 22.0 2
9 10.0 54.0 44.0 3
10 11.0 65.0 22.0 3
11 12.0 25.0 25.0 3

Updating row values in a dataframe by finding similar rows based on a defined number of similar column values

I am trying to update rows in my dataframe to account for missing data by using a similarity threshold to compare how many values are the same in different rows. Below is what I am trying, but it is not updating the rows despite calling out the correct rows to fill. The current threshold is over half of the values being the same, so in this example it is any row that has 3 or more similar values, and I am looking for it to only return values that exist within the dataframe already.
threshold = .5
for index1, row1 in df.iterrows():
if row1.isnull().values.any():
for index2, row2 in df.iterrows():
count = 0
for col in df.columns:
print (col)
if row1[col] == row2[col] and index1 != index2:
count = count + 1
else:
count = count
if count > threshold*len(df.columns) and count < len(df.columns):
row1.at[index1] = index2
break
My input dataframe looks like this, so an example of what I am looking for is that row 2 should have the NaN replaced with the value of the column from row 1:
CODE B2004 B2014 C2100 X3200 X1300
ID
20326 40.0 40.0 29.0 39.0 49.0
20338 40.0 NaN 29.0 39.0 49.0
20361 40.0 40.0 NaN 59.0 89.0
20381 40.0 40.0 NaN 59.0 NaN
20384 40.0 40.0 49.0 59.0 89.0
12385 40.0 40.0 29.0 29.0 55.0
12485 40.0 NaN NaN NaN 49.0
12492 35.0 35.0 NaN NaN 49.0
12685 35.0 35.0 29.0 39.0 49.0
12687 40.0 NaN 29.0 29.0 55.0
The expected dataframe would be this:
CODE B2004 B2014 C2100 X3200 X1300
ID
20326 40.0 40.0 29.0 39.0 49.0
20338 40.0 40.0 29.0 39.0 49.0
20361 40.0 40.0 49.0 59.0 89.0
20381 40.0 40.0 49.0 59.0 89.0
20384 40.0 40.0 49.0 59.0 89.0
12385 40.0 40.0 29.0 29.0 55.0
12485 40.0 NaN NaN NaN 49.0
12492 35.0 35.0 29.0 29.0 49.0
12685 35.0 35.0 29.0 39.0 49.0
12687 40.0 40.0 29.0 29.0 55.0
Any thoughts or ideas are appreciated!
I figured out what was wrong. Since row is only a copy of the df, it wasn't actually assigning the value. By changing the 2nd to last row to
df.loc[index1] = row2
I was able to solve the issue

Subtract a batch of columns in pandas

I am transitioning to using pandas for handling my csv datasets. I am currently trying to do in pandas what I was already doing very easily in numpy: subtract a group of columns from another group several times. This is effectively a element-wise matrix subtraction.
Just for reference, this used to be my numpy solution for this
def subtract_baseline(data, baseline_columns, features_columns):
"""Takes in a list of baseline columns and feature columns, and subtracts the baseline values from all features"""
assert len(features_columns)%len(baseline_columns)==0, "The number of feature columns is not divisible by baseline columns"
num_blocks = len(features_columns)/len(baseline_columns)
block_size = len(baseline_columns)
for i in range(num_blocks):
#Grab each feature block and subract the baseline
init_col = block_size*i+features_columns[0]
final_col = init_col+block_size
data[:, init_col:final_col] = numpy.subtract(data[:, init_col:final_col], data[:,baseline_columns])
return data
To ilustrate better, we can create the following toy dataset:
data = [[10,11,12,13,1,10],[20,21,22,23,1,10],[30,31,32,33,1,10],[40,41,42,43,1,10],[50,51,52,53,1,10],[60,61,62,63,1,10]]
df = pd.DataFrame(data,columns=['L1P1','L1P2','L2P1','L2P2','BP1','BP2'],dtype=float)
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 10.0 11.0 12.0 13.0 1.0 10.0
1 20.0 21.0 22.0 23.0 1.0 10.0
2 30.0 31.0 32.0 33.0 1.0 10.0
3 40.0 41.0 42.0 43.0 1.0 10.0
4 50.0 51.0 52.0 53.0 1.0 10.0
5 60.0 61.0 62.0 63.0 1.0 10.0
The correct output would be the result of grabbing the values in L1P1 & L1P2 and subtracting G1P1 & G1P2 (AKA the baseline), then doing it again for L2P1, L2P2 and any other columns there might be (this is what my for loop does in the original function).
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 9.0 1.0 11.0 3.0 1.0 10.0
1 19.0 11.0 21.0 13.0 1.0 10.0
2 29.0 21.0 31.0 23.0 1.0 10.0
3 39.0 31.0 41.0 33.0 1.0 10.0
4 49.0 41.0 51.0 43.0 1.0 10.0
5 59.0 51.0 61.0 53.0 1.0 10.0
Note that labels for the dataframe should not change, and ideally I'd want a method that relies on the columns indexes, not labels, because the actual data block is 30 columns, not 2 like in this example. This is how my original function in numpy worked, the parameters baseline_columns and features_columns were just lists of the columns indexes.
After this the baseline columns would be deleted all together from the dataframe, as their function has already been fulfilled.
I tried doing this for just 1 batch using iloc but I get Nan values
df.iloc[:,[0,1]] = df.iloc[:,[0,1]] - df.iloc[:,[4,5]]
L1P1 L1P2 L2P1 L2P2 G1P1 G1P2
0 NaN NaN 12.0 13.0 1.0 10.0
1 NaN NaN 22.0 23.0 1.0 10.0
2 NaN NaN 32.0 33.0 1.0 10.0
3 NaN NaN 42.0 43.0 1.0 10.0
4 NaN NaN 52.0 53.0 1.0 10.0
5 NaN NaN 62.0 63.0 1.0 10.0
Adding .values at the end , pandas dataframe will search the column and index match to do the subtract , since the column is not match for 0,1 and 4,5 it will return NaN
df.iloc[:,[0,1]]=df.iloc[:,[0,1]].values - df.iloc[:,[4,5]].values
df
Out[176]:
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 9.0 1.0 12.0 13.0 1.0 10.0
1 19.0 11.0 22.0 23.0 1.0 10.0
2 29.0 21.0 32.0 33.0 1.0 10.0
3 39.0 31.0 42.0 43.0 1.0 10.0
4 49.0 41.0 52.0 53.0 1.0 10.0
5 59.0 51.0 62.0 63.0 1.0 10.0
Is there a reason you want to do it in one line? I.e. would it be okay for your purposes to do it with two lines:
df.iloc[:,0] = df.iloc[:,0] - df.iloc[:,4]
df.iloc[:,1] = df.iloc[:,1] - df.iloc[:,5]
These two lines achieve what I think is your intent.

How to calculate counts on pandas pivot_table

I have data something like this
import random
import pandas as pd
jobs = ['Agriculture', 'Crafts', 'Labor', 'Professional']
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(300)]
})
And I want a simple table showing the count of jobs in each region.
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len))
Output is
MaritalStatus
Region 1 2 3 4 5 All
JobCategory
Agriculture 13.0 23.0 17.0 18.0 8.0 79.0
Crafts 16.0 13.0 18.0 19.0 14.0 80.0
Labor 15.0 11.0 19.0 11.0 14.0 70.0
Professional 22.0 17.0 16.0 7.0 9.0 71.0
All 66.0 64.0 70.0 55.0 45.0 300.0
I assume "MaritalStatus" is showing up in the output because that is the column that the count is being calculated on. How do I get Pandas to calculate based on the Region-JobCategory count and ignore extraneous columns in the dataframe?
Added in edit ---
I am looking for a table with margin values to be output. The values in the table I show are what I want but I don't want MaritalStatus to be what is counted. If there is a Nan in that column, e.g. change the column definition to
'MaritalStatus':[random.choice(['Not Married', 'Married'])
for i in range(299)].append(np.NaN)
This is the output (both with and without values = 'MaritalStatus',)
MaritalStatus
Region 1 2 3 4 5 All
JobCategory
Agriculture 16.0 14.0 16.0 14.0 16.0 NaN
Crafts 25.0 17.0 15.0 14.0 16.0 NaN
Labor 14.0 16.0 8.0 17.0 15.0 NaN
Professional 13.0 14.0 14.0 13.0 13.0 NaN
All NaN NaN NaN NaN NaN 0.0
You can fill the nan values with 0 and then find the len i.e
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(299)].append(np.NaN)})
df = df.fillna(0)
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
values='MaritalStatus',
aggfunc=len))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 19.0 17.0 13.0 20.0 9.0 78.0
Crafts 17.0 14.0 9.0 11.0 16.0 67.0
Labor 10.0 17.0 15.0 19.0 11.0 72.0
Professional 11.0 14.0 19.0 19.0 20.0 83.0
All 57.0 62.0 56.0 69.0 56.0 300.0
If you cut the dataframe down to just the columns that are to be part of the final index counting rows works without having to refer to another column.
pd.pivot_table(testdata[['JobCategory', 'Region']],
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len)
Output is the same as in the question except the line with "MaritialStatus" is not present.
The len aggregation function counts the number of times a value of MaritalStatus appears along a particular combination of JobCategory - Region. Thus you're counting the number of JobCategory - Region instances, which is what you're expecting I guess.
EDIT
We can assign key value to each records and count or size that value.
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(299)].append(np.NaN)})
print(pd.pivot_table(df.assign(key=1),
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len,
values='key'))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 16.0 14.0 13.0 16.0 16.0 75.0
Crafts 14.0 9.0 17.0 22.0 13.0 75.0
Labor 11.0 18.0 20.0 10.0 16.0 75.0
Professional 16.0 14.0 15.0 14.0 16.0 75.0
All 57.0 55.0 65.0 62.0 61.0 300.0
You could add MaritalStatus as the values parameter, and this would eliminate that extra level in the column index. With aggfunc=len, it really doesn't matter what you select as the values parameter it is going to return a count of 1 for every row in that aggregation.
So, try:
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len,
values='MaritalStatus'))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 10.0 18.0 10.0 15.0 19.0 72.0
Crafts 11.0 13.0 17.0 11.0 22.0 74.0
Labor 12.0 10.0 18.0 16.0 12.0 68.0
Professional 21.0 16.0 20.0 13.0 16.0 86.0
All 54.0 57.0 65.0 55.0 69.0 300.0
Option 2
Use groupby and size:
df.groupby(['JobCategory','Region']).size()
Output:
JobCategory Region
Agriculture 1 10
2 18
3 10
4 15
5 19
Crafts 1 11
2 13
3 17
4 11
5 22
Labor 1 12
2 10
3 18
4 16
5 12
Professional 1 21
2 16
3 20
4 13
5 16
dtype: int64

Categories

Resources