Subsetting pandas dataframe and retain original size

Subsetting pandas dataframe and retain original size - python

I am trying to subset a dataframe but want the new dataframe to have same size of original dataframe.
Attaching the input, output and the expected output.
df_input = pd.DataFrame([[1,2,3,4,5], [2,1,4,7,6], [5,6,3,7,0]], columns=["A", "B","C","D","E"])
df_output=pd.DataFrame(df_input.iloc[1:2,:])
df_expected_output=pd.DataFrame([[0,0,0,0,0], [2,1,4,7,6], [0,0,0,0,0]], columns=["A", "B","C","D","E"])
Please suggest the way forward.

Set the index after you subset back to the original with reindex. This will set all the values for the new rows to NaN, which you can replace with 0 via fillna. Since NaN is a float type, you can convert everything back to int with astype.
df_input.iloc[1:2,:].reindex(df_input.index).fillna(0).astype(int)

Setup
df = pd.DataFrame([[1,2,3,4,5], [2,1,4,7,6], [5,6,3,7,0]], columns=["A", "B","C","D","E"])
output = df_input.iloc[1:2,:]
You can create a mask and use multiplication:
m = df.index.isin(output.index)
m[:, None] * df
A B C D E
0 0 0 0 0 0
1 2 1 4 7 6
2 0 0 0 0 0

I will using where + between
df_input.where(df_input.index.to_series().between(1,1),other=0)
Out[611]:
A B C D E
0 0 0 0 0 0
1 2 1 4 7 6
2 0 0 0 0 0

One more option would be to create DataFrame with zero values and then update it with df_input slice
df_output = pd.DataFrame(0, index=df_input.index, columns = df_input.columns)
df_output.update(df_input.iloc[1:2,:])

Related

Convert pandas DataFrame column of comma separated strings to one-hot encoded

I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.
For example,
data = {"mesh": ["A, B, C", "C,B", ""]}
From this I would like to get a dataframe consisting of:
index A B. C
0 1 1 1
1 0 1 1
2 0 0 0
How can I do this?

Note that you're not dealing with OHEs.
str.split + stack + get_dummies + sum
df = pd.DataFrame(data)
df
mesh
0 A, B, C
1 C,B
2
(df.mesh.str.split('\s*,\s*', expand=True)
.stack()
.str.get_dummies()
.sum(level=0))
df
A B C
0 1 1 1
1 0 1 1
2 0 0 0
apply + value_counts
(df.mesh.str.split(r'\s*,\s*', expand=True)
.apply(pd.Series.value_counts, 1)
.iloc[:, 1:]
.fillna(0, downcast='infer'))
A B C
0 1 1 1
1 0 1 1
2 0 0 0
pd.crosstab
x = df.mesh.str.split('\s*,\s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df
col_0 A B C
row_0
0 1 1 1
1 0 1 1
2 0 0 0

Figured there is a simpler answer, or I felt this as more simple compared to multiple operations that we have to make.
Make sure the column has unique values separated be commas
Use get dummies in built parameter to specify the separator as comma. The default for this is pipe separated.
data = {"mesh": ["A, B, C", "C,B", ""]}
sof_df=pd.DataFrame(data)
sof_df.mesh=sof_df.mesh.str.replace(' ','')
sof_df.mesh.str.get_dummies(sep=',')
OUTPUT:
A B C
0 1 1 1
1 0 1 1
2 0 0 0

If categories are controlled (you know how many and who they are), best answer is by #Tejeshar Gurram. But, what if you have lots of potencial categories and you are not interested in all of them. Say:
s = pd.Series(['A,B,C,', 'B,C,D', np.nan, 'X,W,Z'])
0 A,B,C,
1 B,C,D
2 NaN
3 X,W,Z
dtype: object
If you are only interested in categories B and C for the final df of dummies, I've found this workaround does the job:
cat_list = ['B', 'C']
list_of_lists = [ (s.str.contains(cat_, regex=False)==True).astype(bool).astype(int).to_list() for cat_ in cat_list]
data = {k:v for k,v in zip(cat_list,list_of_lists)}
pd.DataFrame(data)
B C
0 1 0
1 0 1
2 0 0
3 0 0

How to create a new pandas dataframe from old dataframe using a list of column names

I have a pandas dataframe with several columns. Bulk of the column names can be looped. So I have made an array of the column names like this:
ycols = ['{}_{}d pred'.format(ticker, i) for i in range(hm_days)]
Now I want to make a new pandas dataframe with only these columns having the index of the parent dataframe. How to do this?

Ok, So you want to create a new dataframe with new column names, with the existing index of the original dataframe.
For some dataframe:
old_df = pd.DataFrame({'x':[0,1,2,3],'y':[10,9,8,7]})
>>>
x y
0 0 10
1 1 9
2 2 8
3 3 7
columns = list(old_df)
>>>
['x', 'y']
You can specify your new columns by doing:
y_cols = ['x_pred','y_pred']
>>> ['x_pred','y_pred']
Here, y_cols is the list of your new column names. In your code, you would replace this step with ycols = ['{}_{}d pred'.format(ticker, i) for i in range(hm_days)].
To get the new columns, you create new columns with a placeholder variable (in this case 0, as it looks like you are using numeric data), with the same index as your old dataframe:
# Iterate over all columns names in y_cols
for i in y_cols:
old_df[i]=0
>>> old_df:
x y x_pred y_pred
0 0 10 0 0
1 1 9 0 0
2 2 8 0 0
3 3 7 0 0
Finally, slice your dataframe to get your new dataframe with new column names, maintaining the index of the old dataframe.
df_new = old_df[y_cols]
>>>
x_pred y_pred
0 0 0
1 0 0
2 0 0
3 0 0
This works even if you have a named index:
x y x_pred y_pred
Date
0 0 10 0 0
1 1 9 0 0
2 2 8 0 0
3 3 7 0 0
df_new = old_df[y_cols]
x_pred y_pred
Date
0 0 0
1 0 0
2 0 0
3 0 0

Add columns to pandas dataframe containing max of each row, AND corresponding column name

My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.

You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a

how to transpose multiple level pandas dataframe based only on outer index

the below is my dataframe with two level indexing. I want 'only' the outer index to be transposed as columns. My desired output would be 2X2 dataframe instead of a 4X1 dataframe as is the case now. Can any of you please help?
0
0 0 232
1 3453
1 0 443
1 3241

Given you have the multi index you can use unstack() on level 0.
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_tuples([(0,0),(0,1),(1,0),(1,1)])
df = pd.DataFrame([[1],[2],[3],[4]] , index=index, columns=[0])
print df.unstack(level=[0])
0
0 1
0 1 3
1 2 4

One way to do this would be to reset the index and then pivot the table indexing on the level_1 of the index, and using level_0 as the columns and 0 as the values. Example -
df.reset_index().pivot(index='level_1',columns='level_0',values=0)
Demo -
In [66]: index = pd.MultiIndex.from_tuples([(0,0),(0,1),(1,0),(1,1)])
In [67]: df = pd.DataFrame([[1],[2],[3],[4]] , index=index, columns=[0])
In [68]: df
Out[68]:
0
0 0 1
1 2
1 0 3
1 4
In [69]: df.reset_index().pivot(index='level_1',columns='level_0',values=0)
Out[69]:
level_0 0 1
level_1
0 1 3
1 2 4
Later on, if you want you can set the .name attribute for the index as well as the columns to empty string or whatever you want , if you don't want the level_* there.

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !

Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.

Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subsetting pandas dataframe and retain original size - python

Setup df = pd.DataFrame([[1,2,3,4,5], [2,1,4,7,6], [5,6,3,7,0]], columns=["A", "B","C","D","E"]) output = df_input.iloc[1:2,:] You can create a mask and use multiplication: m = df.index.isin(output.index) m[:, None] * df A B C D E 0 0 0 0 0 0 1 2 1 4 7 6 2 0 0 0 0 0

I will using where + between df_input.where(df_input.index.to_series().between(1,1),other=0) Out[611]: A B C D E 0 0 0 0 0 0 1 2 1 4 7 6 2 0 0 0 0 0

One more option would be to create DataFrame with zero values and then update it with df_input slice df_output = pd.DataFrame(0, index=df_input.index, columns = df_input.columns) df_output.update(df_input.iloc[1:2,:])

Related

Convert pandas DataFrame column of comma separated strings to one-hot encoded

How to create a new pandas dataframe from old dataframe using a list of column names

Add columns to pandas dataframe containing max of each row, AND corresponding column name

how to transpose multiple level pandas dataframe based only on outer index

Group by value of sum of columns with Pandas

Categories

Resources