I am transitioning to using pandas for handling my csv datasets. I am currently trying to do in pandas what I was already doing very easily in numpy: subtract a group of columns from another group several times. This is effectively a element-wise matrix subtraction.
Just for reference, this used to be my numpy solution for this
def subtract_baseline(data, baseline_columns, features_columns):
"""Takes in a list of baseline columns and feature columns, and subtracts the baseline values from all features"""
assert len(features_columns)%len(baseline_columns)==0, "The number of feature columns is not divisible by baseline columns"
num_blocks = len(features_columns)/len(baseline_columns)
block_size = len(baseline_columns)
for i in range(num_blocks):
#Grab each feature block and subract the baseline
init_col = block_size*i+features_columns[0]
final_col = init_col+block_size
data[:, init_col:final_col] = numpy.subtract(data[:, init_col:final_col], data[:,baseline_columns])
return data
To ilustrate better, we can create the following toy dataset:
data = [[10,11,12,13,1,10],[20,21,22,23,1,10],[30,31,32,33,1,10],[40,41,42,43,1,10],[50,51,52,53,1,10],[60,61,62,63,1,10]]
df = pd.DataFrame(data,columns=['L1P1','L1P2','L2P1','L2P2','BP1','BP2'],dtype=float)
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 10.0 11.0 12.0 13.0 1.0 10.0
1 20.0 21.0 22.0 23.0 1.0 10.0
2 30.0 31.0 32.0 33.0 1.0 10.0
3 40.0 41.0 42.0 43.0 1.0 10.0
4 50.0 51.0 52.0 53.0 1.0 10.0
5 60.0 61.0 62.0 63.0 1.0 10.0
The correct output would be the result of grabbing the values in L1P1 & L1P2 and subtracting G1P1 & G1P2 (AKA the baseline), then doing it again for L2P1, L2P2 and any other columns there might be (this is what my for loop does in the original function).
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 9.0 1.0 11.0 3.0 1.0 10.0
1 19.0 11.0 21.0 13.0 1.0 10.0
2 29.0 21.0 31.0 23.0 1.0 10.0
3 39.0 31.0 41.0 33.0 1.0 10.0
4 49.0 41.0 51.0 43.0 1.0 10.0
5 59.0 51.0 61.0 53.0 1.0 10.0
Note that labels for the dataframe should not change, and ideally I'd want a method that relies on the columns indexes, not labels, because the actual data block is 30 columns, not 2 like in this example. This is how my original function in numpy worked, the parameters baseline_columns and features_columns were just lists of the columns indexes.
After this the baseline columns would be deleted all together from the dataframe, as their function has already been fulfilled.
I tried doing this for just 1 batch using iloc but I get Nan values
df.iloc[:,[0,1]] = df.iloc[:,[0,1]] - df.iloc[:,[4,5]]
L1P1 L1P2 L2P1 L2P2 G1P1 G1P2
0 NaN NaN 12.0 13.0 1.0 10.0
1 NaN NaN 22.0 23.0 1.0 10.0
2 NaN NaN 32.0 33.0 1.0 10.0
3 NaN NaN 42.0 43.0 1.0 10.0
4 NaN NaN 52.0 53.0 1.0 10.0
5 NaN NaN 62.0 63.0 1.0 10.0
Adding .values at the end , pandas dataframe will search the column and index match to do the subtract , since the column is not match for 0,1 and 4,5 it will return NaN
df.iloc[:,[0,1]]=df.iloc[:,[0,1]].values - df.iloc[:,[4,5]].values
df
Out[176]:
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 9.0 1.0 12.0 13.0 1.0 10.0
1 19.0 11.0 22.0 23.0 1.0 10.0
2 29.0 21.0 32.0 33.0 1.0 10.0
3 39.0 31.0 42.0 43.0 1.0 10.0
4 49.0 41.0 52.0 53.0 1.0 10.0
5 59.0 51.0 62.0 63.0 1.0 10.0
Is there a reason you want to do it in one line? I.e. would it be okay for your purposes to do it with two lines:
df.iloc[:,0] = df.iloc[:,0] - df.iloc[:,4]
df.iloc[:,1] = df.iloc[:,1] - df.iloc[:,5]
These two lines achieve what I think is your intent.
Related
Now, I have two dataframe. I have use groupby. and count() function to export this dataframe(df1). When I used groupby. to count the total number of each category. It filtered out the category which the count is 0. How can I use Python to get the outcome?
However,I will like to have a dataframe which also required categories.
Original dataframe:
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 Beam Sensor 27.0 12.0 13.0 14.0
3 CLPS 1.0 NaN NaN 1.0
However,I will like to have a dataframe which also required categories.
(required categories: ATIDS, BasicCrane, LLP, Beam Sensor, CLPS, SPR)
Expected dataframe (The count number of 'LLP' and 'SPR' is 0)
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN
>>> categories
['ATIDS', 'BasicCrane', 'LLP', 'Beam Sensor', 'CLPS', 'SPR']
>>> pd.merge(pd.DataFrame({'Cat': categories}), df, how='outer')
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN
One way you could easily do is to fill NaN value with 0 'before' doing a groupby function. All zero data (previously NaN value) will be totally be counted as zero.
df.fillna(0)
I have a multi-index dataframes.
import pandas as pd
from itertools import product
arrays = [['bar', 'baz','foo'], range(4)]
tuples = list(product(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
multi_ind=pd.DataFrame(np.random.randn(6, len(tuples)), index=range(6), columns=index)
Some values are nans:
multi_ind.loc[3,('bar',2)]=np.nan
multi_ind.loc[3,('bar',3)]=np.nan
multi_ind.loc[4,('bar',1)]=np.nan
For 'bar' I would like to fill all nans expect last, as described in:
Forward fill all except last value in python pandas dataframe
mask=multi_ind['bar']
last_valid_column_per_row = mask.apply(pd.Series.last_valid_index,axis=1)
mask=mask.apply(lambda series:series[:int(last_valid_column_per_row.loc[series.name])].ffill(),axis=1)
Then I would like to ffill() also the other first levels (.e.g baz,foo), using the same logic as for bar (up to last valid index from df['bar']), and I would like to set a nan also any value which was still nan in bar
How to achieve that in an efficient way?
Now I am doing the following, but it is very slow...
df_as_dict={}
df=df.ffill(axis=1) # start by ffilling
for first_level,gr in df.groupby(level=0,axis=1):
gr[first_level][(mask.isnull())]=np.nan # then remove the nans (they should be only at the end)
df_as_dict[first_level]=gr[first_level]
The code based on last_valid_index (in the indicated post) actually
fills NaN along the given axis:
without initial NaN cells (ffill has no previous value to
take as source),
without trailing NaN cells (whatever their number), just
because of last_valid_index, terminating the action just
before the trailing continuous sequence of NaNs,
but if you are happy with this arrangement, let it be.
I created the test DataFrame the following, more concise way:
arrays = [['bar', 'baz','foo'], range(4)]
cols = pd.MultiIndex.from_product(arrays, names=['first', 'second'])
np.random.seed(2)
arr = np.arange(1, 6 * 12 + 1, dtype=float).reshape(6, -1)
# Where to put NaN (x / y)
ind = (np.array([0, 0, 1, 2, 2, 2, 3, 4, 4, 5, 5]),
np.array([1, 2, 6, 1, 3, 5,10, 2, 3,10,11]))
arr[ind] = np.nan
multi_ind = pd.DataFrame(arr, columns=cols)
so that it contains:
first bar baz foo
second 0 1 2 3 0 1 2 3 0 1 2 3
0 1.0 NaN NaN 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
1 13.0 14.0 15.0 16.0 17.0 18.0 NaN 20.0 21.0 22.0 23.0 24.0
2 25.0 NaN 27.0 NaN 29.0 NaN 31.0 32.0 33.0 34.0 35.0 36.0
3 37.0 38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 NaN 48.0
4 49.0 50.0 NaN NaN 53.0 54.0 55.0 56.0 57.0 58.0 59.0 60.0
5 61.0 62.0 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 NaN NaN
To get your result, run:
result = multi_ind.stack(level=0).apply(
lambda row: row[: row.last_valid_index() + 1].ffill(), axis=1)\
.unstack(level=1).swaplevel(axis=1).reindex(columns=multi_ind.columns)
Note that your last_valid_column_per_row is not needed.
It is enough to pass axis=1 to operate on rows, instead of
columns (like in the indicated post).
The result is:
first bar baz foo
second 0 1 2 3 0 1 2 3 0 1 2 3
0 1.0 1.0 1.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
1 13.0 14.0 15.0 16.0 17.0 18.0 18.0 20.0 21.0 22.0 23.0 24.0
2 25.0 25.0 27.0 NaN 29.0 29.0 31.0 32.0 33.0 34.0 35.0 36.0
3 37.0 38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 46.0 48.0
4 49.0 50.0 NaN NaN 53.0 54.0 55.0 56.0 57.0 58.0 59.0 60.0
5 61.0 62.0 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 NaN NaN
Details:
stack(level=0) - put bar, baz and foo "fragments"
in consecutive rows.
apply(….ffill(), axis=1) - fill each row, without the trailing
sequence of NaN (if any). Note that I added + 1 in order to
include the last non-NaN value in the result. Otherwise the last
column would have been dropped.
unstack(level=1) - restore the previous ("wide") arangement,
but unfortunately the order of column MultiIndex levels is reversed.
swaplevel(axis=1) - restore the original order of column levels,
but unfortunately the order of column names is wrong.
reindex(…) - restore the original column order.
I need to create a new column in dataframe based on intervals from another dataframe.
For example, I have a dataframe where in the time column I have values and I want to create column in another dataframe based on the intervals in that time column.
I think a practical example is simpler to understand:
Dataframe with intervals
df1
time value var2
0 1.0 34.0 35.0
1 4.0 754.0 755.0
2 9.0 768.0 769.0
3 12.0 65.0 66.0
Dataframe that I need to filter
df2
time value var2
0 1.0 23.0 23.0
1 2.0 43.0 43.0
2 3.0 76.0 12.0
3 4.0 88.0 22.0
4 5.0 64.0 45.0
5 6.0 98.0 33.0
6 7.0 76.0 11.0
7 8.0 56.0 44.0
8 9.0 23.0 22.0
9 10.0 54.0 44.0
10 11.0 65.0 22.0
11 12.0 25.0 25.0
should result
df3
time value var2 interval
0 1.0 23.0 23.0 1
1 2.0 43.0 43.0 1
2 3.0 76.0 12.0 1
3 4.0 88.0 22.0 1
4 5.0 64.0 45.0 2
5 6.0 98.0 33.0 2
6 7.0 76.0 11.0 2
7 8.0 56.0 44.0 2
8 9.0 23.0 22.0 2
9 10.0 54.0 44.0 3
10 11.0 65.0 22.0 3
11 12.0 25.0 25.0 3
EDIT: As Shubham Sharma said, it's not a filter, I want to add a new column based on intervals in other dataframe.
You can use pd.cut to categorize the time in df2 into discrete intervals based on the time in df1 then use Series.factorize to obtain a numeric array identifying distinct ordered values.
df2['interval'] = pd.cut(df2['time'], df1['time'], include_lowest=True)\
.factorize(sort=True)[0] + 1
Result:
time value var2 interval
0 1.0 23.0 23.0 1
1 2.0 43.0 43.0 1
2 3.0 76.0 12.0 1
3 4.0 88.0 22.0 1
4 5.0 64.0 45.0 2
5 6.0 98.0 33.0 2
6 7.0 76.0 11.0 2
7 8.0 56.0 44.0 2
8 9.0 23.0 22.0 2
9 10.0 54.0 44.0 3
10 11.0 65.0 22.0 3
11 12.0 25.0 25.0 3
How should I select columns based on significance level and which factors should I choose to decide the priority of the column of the given DataFrame
A B C D E F G H
5 61.0 77.0 40.0 46.0 60.0 73.0 66.0 1.0
4 60.0 51.0 49.0 59.0 59.0 67.0 69.0 3.0
3 35.0 32.0 48.0 57.0 43.0 34.0 34.0 4.0
2 17.0 16.0 22.0 12.0 15.0 5.0 5.0 1.0
1 10.0 7.0 11.0 3.0 4.0 5.0 8.0 4.0
At first, each index is a rating by the users for different functionalities (columns). I concatenated the count_values of each column and now trying to understand which functionalities(columns) are more relevant to the users. From the above df, I have to select the top three columns.
If anyone needs further clarification, please let me know.
I have data something like this
import random
import pandas as pd
jobs = ['Agriculture', 'Crafts', 'Labor', 'Professional']
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(300)]
})
And I want a simple table showing the count of jobs in each region.
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len))
Output is
MaritalStatus
Region 1 2 3 4 5 All
JobCategory
Agriculture 13.0 23.0 17.0 18.0 8.0 79.0
Crafts 16.0 13.0 18.0 19.0 14.0 80.0
Labor 15.0 11.0 19.0 11.0 14.0 70.0
Professional 22.0 17.0 16.0 7.0 9.0 71.0
All 66.0 64.0 70.0 55.0 45.0 300.0
I assume "MaritalStatus" is showing up in the output because that is the column that the count is being calculated on. How do I get Pandas to calculate based on the Region-JobCategory count and ignore extraneous columns in the dataframe?
Added in edit ---
I am looking for a table with margin values to be output. The values in the table I show are what I want but I don't want MaritalStatus to be what is counted. If there is a Nan in that column, e.g. change the column definition to
'MaritalStatus':[random.choice(['Not Married', 'Married'])
for i in range(299)].append(np.NaN)
This is the output (both with and without values = 'MaritalStatus',)
MaritalStatus
Region 1 2 3 4 5 All
JobCategory
Agriculture 16.0 14.0 16.0 14.0 16.0 NaN
Crafts 25.0 17.0 15.0 14.0 16.0 NaN
Labor 14.0 16.0 8.0 17.0 15.0 NaN
Professional 13.0 14.0 14.0 13.0 13.0 NaN
All NaN NaN NaN NaN NaN 0.0
You can fill the nan values with 0 and then find the len i.e
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(299)].append(np.NaN)})
df = df.fillna(0)
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
values='MaritalStatus',
aggfunc=len))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 19.0 17.0 13.0 20.0 9.0 78.0
Crafts 17.0 14.0 9.0 11.0 16.0 67.0
Labor 10.0 17.0 15.0 19.0 11.0 72.0
Professional 11.0 14.0 19.0 19.0 20.0 83.0
All 57.0 62.0 56.0 69.0 56.0 300.0
If you cut the dataframe down to just the columns that are to be part of the final index counting rows works without having to refer to another column.
pd.pivot_table(testdata[['JobCategory', 'Region']],
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len)
Output is the same as in the question except the line with "MaritialStatus" is not present.
The len aggregation function counts the number of times a value of MaritalStatus appears along a particular combination of JobCategory - Region. Thus you're counting the number of JobCategory - Region instances, which is what you're expecting I guess.
EDIT
We can assign key value to each records and count or size that value.
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(299)].append(np.NaN)})
print(pd.pivot_table(df.assign(key=1),
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len,
values='key'))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 16.0 14.0 13.0 16.0 16.0 75.0
Crafts 14.0 9.0 17.0 22.0 13.0 75.0
Labor 11.0 18.0 20.0 10.0 16.0 75.0
Professional 16.0 14.0 15.0 14.0 16.0 75.0
All 57.0 55.0 65.0 62.0 61.0 300.0
You could add MaritalStatus as the values parameter, and this would eliminate that extra level in the column index. With aggfunc=len, it really doesn't matter what you select as the values parameter it is going to return a count of 1 for every row in that aggregation.
So, try:
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len,
values='MaritalStatus'))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 10.0 18.0 10.0 15.0 19.0 72.0
Crafts 11.0 13.0 17.0 11.0 22.0 74.0
Labor 12.0 10.0 18.0 16.0 12.0 68.0
Professional 21.0 16.0 20.0 13.0 16.0 86.0
All 54.0 57.0 65.0 55.0 69.0 300.0
Option 2
Use groupby and size:
df.groupby(['JobCategory','Region']).size()
Output:
JobCategory Region
Agriculture 1 10
2 18
3 10
4 15
5 19
Crafts 1 11
2 13
3 17
4 11
5 22
Labor 1 12
2 10
3 18
4 16
5 12
Professional 1 21
2 16
3 20
4 13
5 16
dtype: int64