How to join two dataframe with same category? - python

Now, I have two dataframe. I have use groupby. and count() function to export this dataframe(df1). When I used groupby. to count the total number of each category. It filtered out the category which the count is 0. How can I use Python to get the outcome?
However,I will like to have a dataframe which also required categories.
Original dataframe:
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 Beam Sensor 27.0 12.0 13.0 14.0
3 CLPS 1.0 NaN NaN 1.0
However,I will like to have a dataframe which also required categories.
(required categories: ATIDS, BasicCrane, LLP, Beam Sensor, CLPS, SPR)
Expected dataframe (The count number of 'LLP' and 'SPR' is 0)
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN

>>> categories
['ATIDS', 'BasicCrane', 'LLP', 'Beam Sensor', 'CLPS', 'SPR']
>>> pd.merge(pd.DataFrame({'Cat': categories}), df, how='outer')
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN

One way you could easily do is to fill NaN value with 0 'before' doing a groupby function. All zero data (previously NaN value) will be totally be counted as zero.
df.fillna(0)

Related

Convert two pandas rows into one

I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.
You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()

Replace row in pandas with next row only when the entire row (each column) has NaN values

I have a DataFrame where I want to replace only the rows with NaN values in each column by the row below it. I tried solutions from multiple feeds and used ffill but that resulted in filling few cells and not the entire row.
ss s h b sb
0 NaN NaN NaN NaN NaN
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 1.0 6.0 7.0 11.0 3.0
Expected output:
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
You can create groups by testing rows with only missing values with cumulative sum by swapped order of column and pass to GroupBy.bfill:
df = df.groupby((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1]).bfill()
print (df)
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
Detail:
print ((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1])
0 3
1 3
2 2
3 1
4 1
5 1
dtype: int32

How to append to individual columns in a Pandas DataFrame

So I want to add/append data to a specific pandas dataFrame column but without it causing NaN values in the remaining columns
I.e.
DataFrame = pd.DataFrame(columns=["column1", "column2", "column3"])
for i in range():
DataFrame = DataFrame.append({"column1":int(i)}, ignore_index=True)
DataFrame = DataFrame.append({"column2":float(i*2)}, ignore_index=True)
DataFrame = DataFrame.append({"column3":int(i*5)}, ignore_index=True)
print(DataFrame)
This will return:
column1 column2 column3
0 0.0 NaN NaN
1 NaN 0.0 NaN
2 NaN NaN 0.0
3 1.0 NaN NaN
4 NaN 2.0 NaN
5 NaN NaN 5.0
6 2.0 NaN NaN
7 NaN 4.0 NaN
8 NaN NaN 10.0
What we want returned:
column1 column2 column3
0 0.0 0.0 0.0
1 1.0 2.0 5.0
2 2.0 4.0 10.0
I know I can in this case use one .append for all the different columns. But I have some cases where the data to be appended will vary based on multiple conditions. Hence I'd like to know if it's possible to append to single columns in a dataframe without producing NaN values in the remaining columns. So that I can avoid writing hundreds of if else statements.
Or if someone has any good idea regarding how to 'collapse' the NaN values (removing the NaN values without removing the entire row so that if there is a NaN value at index 0 in column 3 and there is a integer 5 at index 1 in the same column the integer 5 gets moved up to index 0)
Happy to hear any ideas.
IIUC for your current example you can try this:
DataFrame[['column2','column3']]=DataFrame[['column2','column3']].bfill()
Output:
column1 column2 column3
0 0.0 0.0 0.0
1 NaN 0.0 0.0
2 NaN 2.0 0.0
3 1.0 2.0 5.0
4 NaN 2.0 5.0
5 NaN 4.0 5.0
6 2.0 4.0 10.0
7 NaN 4.0 10.0
8 NaN 6.0 10.0
9 3.0 6.0 15.0
10 NaN 6.0 15.0
11 NaN 8.0 15.0
12 4.0 8.0 20.0
13 NaN 8.0 20.0
14 NaN NaN 20.0
then remove the NaN :
DataFrame.dropna(inplace=True)
Outpt:
column1 column2 column3
0 0.0 0.0 0.0
3 1.0 2.0 5.0
6 2.0 4.0 10.0
9 3.0 6.0 15.0
12 4.0 8.0 20.0

Subtract a batch of columns in pandas

I am transitioning to using pandas for handling my csv datasets. I am currently trying to do in pandas what I was already doing very easily in numpy: subtract a group of columns from another group several times. This is effectively a element-wise matrix subtraction.
Just for reference, this used to be my numpy solution for this
def subtract_baseline(data, baseline_columns, features_columns):
"""Takes in a list of baseline columns and feature columns, and subtracts the baseline values from all features"""
assert len(features_columns)%len(baseline_columns)==0, "The number of feature columns is not divisible by baseline columns"
num_blocks = len(features_columns)/len(baseline_columns)
block_size = len(baseline_columns)
for i in range(num_blocks):
#Grab each feature block and subract the baseline
init_col = block_size*i+features_columns[0]
final_col = init_col+block_size
data[:, init_col:final_col] = numpy.subtract(data[:, init_col:final_col], data[:,baseline_columns])
return data
To ilustrate better, we can create the following toy dataset:
data = [[10,11,12,13,1,10],[20,21,22,23,1,10],[30,31,32,33,1,10],[40,41,42,43,1,10],[50,51,52,53,1,10],[60,61,62,63,1,10]]
df = pd.DataFrame(data,columns=['L1P1','L1P2','L2P1','L2P2','BP1','BP2'],dtype=float)
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 10.0 11.0 12.0 13.0 1.0 10.0
1 20.0 21.0 22.0 23.0 1.0 10.0
2 30.0 31.0 32.0 33.0 1.0 10.0
3 40.0 41.0 42.0 43.0 1.0 10.0
4 50.0 51.0 52.0 53.0 1.0 10.0
5 60.0 61.0 62.0 63.0 1.0 10.0
The correct output would be the result of grabbing the values in L1P1 & L1P2 and subtracting G1P1 & G1P2 (AKA the baseline), then doing it again for L2P1, L2P2 and any other columns there might be (this is what my for loop does in the original function).
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 9.0 1.0 11.0 3.0 1.0 10.0
1 19.0 11.0 21.0 13.0 1.0 10.0
2 29.0 21.0 31.0 23.0 1.0 10.0
3 39.0 31.0 41.0 33.0 1.0 10.0
4 49.0 41.0 51.0 43.0 1.0 10.0
5 59.0 51.0 61.0 53.0 1.0 10.0
Note that labels for the dataframe should not change, and ideally I'd want a method that relies on the columns indexes, not labels, because the actual data block is 30 columns, not 2 like in this example. This is how my original function in numpy worked, the parameters baseline_columns and features_columns were just lists of the columns indexes.
After this the baseline columns would be deleted all together from the dataframe, as their function has already been fulfilled.
I tried doing this for just 1 batch using iloc but I get Nan values
df.iloc[:,[0,1]] = df.iloc[:,[0,1]] - df.iloc[:,[4,5]]
L1P1 L1P2 L2P1 L2P2 G1P1 G1P2
0 NaN NaN 12.0 13.0 1.0 10.0
1 NaN NaN 22.0 23.0 1.0 10.0
2 NaN NaN 32.0 33.0 1.0 10.0
3 NaN NaN 42.0 43.0 1.0 10.0
4 NaN NaN 52.0 53.0 1.0 10.0
5 NaN NaN 62.0 63.0 1.0 10.0
Adding .values at the end , pandas dataframe will search the column and index match to do the subtract , since the column is not match for 0,1 and 4,5 it will return NaN
df.iloc[:,[0,1]]=df.iloc[:,[0,1]].values - df.iloc[:,[4,5]].values
df
Out[176]:
L1P1 L1P2 L2P1 L2P2 BP1 BP2
0 9.0 1.0 12.0 13.0 1.0 10.0
1 19.0 11.0 22.0 23.0 1.0 10.0
2 29.0 21.0 32.0 33.0 1.0 10.0
3 39.0 31.0 42.0 43.0 1.0 10.0
4 49.0 41.0 52.0 53.0 1.0 10.0
5 59.0 51.0 62.0 63.0 1.0 10.0
Is there a reason you want to do it in one line? I.e. would it be okay for your purposes to do it with two lines:
df.iloc[:,0] = df.iloc[:,0] - df.iloc[:,4]
df.iloc[:,1] = df.iloc[:,1] - df.iloc[:,5]
These two lines achieve what I think is your intent.

convert specific rows of pandas dataframe into multiindex

here is my DataFrame:
0 1 2
0 0 0.0 20.0 NaN
1 1.0 21.0 NaN
2 2.0 22.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2011.0
1 0 3.0 23.0 NaN
1 4.0 24.0 NaN
2 5.0 25.0 NaN
3 6.0 26.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2012.0
i want to convert the 'ID' and 'Year' rows to dataframe Index with 'ID' being level=0 and 'Year' being level=1. I tried using stack() but still cannot figure it .
Edited: my desired output should look like below:
0 1
11111 2011 0 0.0 20.0
1 1.0 21.0
2 2.0 22.0
2012 0 3.0 23.0
1 4.0 24.0
2 5.0 25.0
3 6.0 26.0
This should work:
df1 = df.loc[pd.IndexSlice[:, ['ID', 'Year']], '2']
dfs = df1.unstack()
dfi = df1.index
dfn = df.drop(dfi).drop('2', axis=1).unstack()
dfn.set_index([dfs.ID, dfs.Year]).stack()

Categories

Resources