I have the foll. dataframe:
c3ann c3nfx c3per c4ann c4per pastr primf
c3ann 1 0 1 0 1 0 1
c3nfx 1 0 1 0 1 0 1
c3per 1 0 1 0 1 0 1
c4ann 1 0 1 0 1 0 1
c4per 1 0 1 0 1 0 1
pastr 1 0 1 0 1 0 1
primf 1 0 1 0 1 0 1
I would like to reorder the rows and columns so that the order is this:
primf pastr c3ann c3nfx c3per c4ann c4per
I can do this for just the columns like this:
cols = ['primf', 'pastr', 'c3ann', 'c3nfx', 'c3per', 'c4ann', 'c4per']
df = df[cols]
How do I do this such that the row headers are also changed appropriately?
You can use reindex to reorder both the columns and index at the same time.
df = df.reindex(index=cols, columns=cols)
Related
I have two dataframes X_dummy and X_var, where X_dummy contains dummies and looks like this:
dummy1 dummy2
1 0
0 1
1 0
The X_var dataframe looks contains variables and looks like this:
var1 var2
4 2
10 5
1 1
Now I want to create a dataframe containing the cellwise product of every column from X_dummy with the complete X_var dataframe. Hence, my resulting dataframe should look like, X_result:
var1dummy1 var2dummy1 var1dummy2 var2dummy2
4 2 0 0
0 0 10 5
1 1 0 0
Does anyone know how to do this without using multiple for loops?
Something like numpy broadcast
new = pd.DataFrame(np.concatenate(df2.T.values * df1.T.values[:,None]).T)
new
Out[161]:
0 1 2 3
0 4 2 0 0
1 0 0 10 5
2 1 1 0 0
##new.columns = pd.MultiIndex.from_product([df1.columns,df2.columns]).map('_'.join)
Try:
pd.concat([(df1[i]*df2[j]).rename(f'{i}{j}') for i in df1 for j in df2], axis=1)
Output:
dummy1var1 dummy1var2 dummy2var1 dummy2var2
0 4 2 0 0
1 0 0 10 5
2 1 1 0 0
You can definitely do it with one loop:
dummies = X_dummy.astype(bool)
pd.concat([X_var.loc[dummies[c]] for c in dummies], axis=1).fillna(0).astype(int)
# var1 var2 var1 var2
#0 4 2 0 0
#1 0 0 10 5
#2 1 1 0 0
Note that because one of your dataframes contains dummies, you do not need multiplication at all.
a = [[0,0,0,0],[0,-1,1,0],[1,-1,1,0],[1,-1,1,0]]
df = pd.DataFrame(a, columns=['A','B','C','D'])
df
Output:
A B C D
0 0 0 0 0
1 0 -1 1 0
2 1 -1 1 0
3 1 -1 1 0
So reading down vertically per column, values in the columns all begin at 0 on the first row, once they change they can never change back and can either become a 1 or a -1. I would like to re arrange the dataframe columns so that the columns in this order:
Order columns that hit 1 in the earliest row as possible
Order columns that hit -1 in the earliest row as possible
Finally the remaining rows that never changed values and remained as zero (if there are even any left)
Desired Output:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
The my main data frame is 3000 rows and 61 columns long, is there any way of doing this quickly?
We have to handle the positive and negative values seperately. One way is take sum of the columns , then using sort_values , we can adjust the ordering:
a = df.sum().sort_values(ascending=False)
b = pd.concat((a[a.gt(0)],a[a.lt(0)].sort_values(),a[a.eq(0)]))
out = df.reindex(columns=b.index)
print(out)
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
Try with pd.Series.first_valid_index
s = df.where(df.ne(0))
s1 = s.apply(pd.Series.first_valid_index)
s2 = s.bfill().iloc[0]
out = df.loc[:,pd.concat([s2,s1],axis=1,keys=[0,1]).sort_values([0,1],ascending=[False,True]).index]
out
Out[35]:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
I have a large dataset with many columns of numeric data and want to be able to count all the zeros in each of the rows. The following will generate a small sample of the data.
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df
While I can create a column to sum all the values in the rows with the following code:
df2=df.sum(axis=1)
df2
And I can get a count of the zeros in a column:
df.loc[df.a==1].count()
I haven't been able to figure out how to get a count of the zeros across each of the rows. Any assistance would be greatly appreciated.
For count matched values is possible use sum of Trues of boolean mask.
If need new column:
df['sum of 1'] = df.eq(1).sum(axis=1)
#alternative
#df['sum of 1'] = (df == 1).sum(axis=1)
Sample:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df['sum of 1'] = df.eq(1).sum(axis=1)
print (df)
a b c sum of 1
0 0 0 2 0
1 1 0 1 2
2 0 0 0 0
3 2 1 2 1
4 2 2 1 1
5 0 0 0 0
6 0 2 0 0
7 1 1 1 3
If need new row:
df.loc['sum of 1'] = df.eq(1).sum()
#alternative
#df.loc['sum of 1'] = (df == 1).sum()
Sample:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df.loc['sum of 1'] = df.eq(1).sum()
print (df)
a b c
0 0 0 2
1 1 0 1
2 0 0 0
3 2 1 2
4 2 2 1
5 0 0 0
6 0 2 0
7 1 1 1
sum of 1 2 2 3
Ok, I admit, I had troubles to really formulate a good header for that. So I will try to make give an example.
This is my sample dataframe:
df = pd.DataFrame([
(1,"a","good"),
(1,"a","good"),
(1,"b","good"),
(1,"c","bad"),
(2,"a","good"),
(2,"b","bad"),
(3,"a","none")], columns=["id", "type", "eval"])
What I do with it is the following:
df.groupby(["id", "type"])["id"].agg({'id':'count'})
This results in:
id
id type
1 a 2
b 1
c 1
2 a 1
b 1
3 a 1
This is fine, although what I will need later on is that e.g. the id would be repeated in every row. But this is not the most important part.
What I would need now is something like this:
id good bad none
id type
1 a 2 2 0 0
b 1 1 0 0
c 1 0 1 0
2 a 1 1 0 0
b 1 0 1 0
3 a 1 0 0 1
And even better would be a result like this, because I will need this back in a dataframe (and finally in an Excel sheet) with all fields populated. In reality, there will be many more columns I am grouping by. They would have to be completely populated as well.
id good bad none
id type
1 a 2 2 0 0
1 b 1 1 0 0
1 c 1 0 1 0
2 a 1 1 0 0
2 b 1 0 1 0
3 a 1 0 0 1
Thank you for helping me out.
You can use groupby + size (last column was added) or value_counts with unstack:
df1 = df.groupby(["id", "type", 'eval'])
.size()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
df1 = df.groupby(["id", "type"])[ 'eval']
.value_counts()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
But for write to excel get:
df1.to_excel('file.xlsx')
So need reset_index last.
df1.reset_index().to_excel('file.xlsx', index=False)
EDIT:
I forget for id column, but it is duplicate column name, so need id1:
df1.insert(0, 'id1', df1.sum(axis=1))
I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.
I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge
# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0