I have a data frame with many rows, for illustration I'll use the following sample:
df = pd.DataFrame([[2,1,3,3],[2,3,3,4],[4,1,3,2]],columns=['A','B','C','D'])
This results:
A B C D
0 2 1 3 3
1 2 3 3 4
2 4 1 3 2
I would to get a new dataframe consisted of pair-wise equality results between the original dataframe rows.
I expect to get the following result:
A B C D
0 1 0 1 0
1 0 1 1 0
2 0 0 1 0
as:
index 0- is row 0 vs row 1,
index 1- is row 0 vs row 2,
index 2- is row 1 vs row 2
A naive way to implement this would be:
new_df = pd.DataFrame()
for i in range(0, len(df)-1):
for j in range(i+1, len(df)):
new_df = new_df.append(df.iloc[i,:] == df.iloc[j,:], ignore_index=True)
Is there any efficient way to implement this operation?
This will do what you want:
import pandas as pd
from itertools import combinations
df = pd.DataFrame([[2,1,3,3],[2,3,3,4],[4,1,3,2]],columns=['A','B','C','D'])
combos = list(combinations(df.index, 2))
newData = {'{} v {}'.format(*combo): (df.iloc[combo[0]] == df.iloc[combo[1]]).astype(int).to_dict() for combo in combos}
pd.DataFrame.from_dict(newData, orient='index')
# A C B D
#0 v 1 1 1 0 0
#0 v 2 0 1 1 0
#1 v 2 0 1 0 0
So it uses the unique combinations of index values paired in 2 - then builds the rows based on those criteria.
And if you wish to reuse this data use the following as it makes df easier to query:
newData = {combo: (df.iloc[combo[0]] == df.iloc[combo[1]]).astype(int).to_dict() for combo in combos}
pd.DataFrame.from_dict(newData, orient='index')
# A C B D
#0 1 1 1 0 0
# 2 0 1 1 0
#1 2 0 1 0 0
And to get the result in accordance with your latest request use:
newData = [(df.iloc[combo[0]] == df.iloc[combo[1]]).astype(int).to_dict() for combo in combos]
pd.DataFrame(newData)
# A B C D
#0 1 0 1 0
#1 0 1 1 0
#2 0 0 1 0
Related
I have a Dataset with several columns and a row named "Total" that stores values between 1 and 4.
I want to iterate over each column, and based on the number stored in the row "Total", add a new row with "yes" or "No".
I also have a list "columns" for iteration.
All data are float64
I´m new at python and i don't now if i'm doing the righ way because i´m getting all "yes".
for c in columns:
if dados_por_periodo.at['Total',c] < 4:
dados_por_periodo.loc['VA'] = "yes"
else:
dados_por_periodo.loc['VA'] = "no"
My dataset:
Thanks.
You can try this, I hope it works for you:
import pandas as pd
import numpy as np
#creation of a dummy df
columns='A B C D E F G H I'.split()
data=[np.random.choice(2, len(columns)).tolist() for col in range(3)]
data.append([1,8,1,1,2,4,1,4,1]) #not real sum of values, just dummy values for testing
index=['Otono','Inverno', 'Primavera','Totals']
df=pd.DataFrame(data, columns=columns, index=index)
df.index.name='periodo' #just adding index name
print(df)
####Adition of the new 'yes/no' row
df = pd.concat([ df, pd.DataFrame([np.where(df.iloc[len(df.index)-1,:].lt(4),'Yes','No')], columns=df.columns, index=['VA'])])
df.index.name='periodo' #just adding index name
print(df)
Output:
df
A B C D E F G H I
periodo
Otono 1 0 0 1 1 0 1 1 0
Inverno 0 1 1 1 0 1 1 1 1
Primavera 1 1 0 0 1 1 1 1 0
Totals 1 8 1 1 2 4 1 4 1
df(with added row)
A B C D E F G H I
periodo
Otono 1 0 0 1 1 0 1 1 0
Inverno 0 1 1 1 0 1 1 1 1
Primavera 1 1 0 0 1 1 1 1 0
Totals 1 8 1 1 2 4 1 4 1
VA Yes No Yes Yes Yes No Yes No Yes
Also try to put some data sample the next times instead of images of the dataset, so someone can help you in a better way :)
I have a large dataset with many columns of numeric data and want to be able to count all the zeros in each of the rows. The following will generate a small sample of the data.
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df
While I can create a column to sum all the values in the rows with the following code:
df2=df.sum(axis=1)
df2
And I can get a count of the zeros in a column:
df.loc[df.a==1].count()
I haven't been able to figure out how to get a count of the zeros across each of the rows. Any assistance would be greatly appreciated.
For count matched values is possible use sum of Trues of boolean mask.
If need new column:
df['sum of 1'] = df.eq(1).sum(axis=1)
#alternative
#df['sum of 1'] = (df == 1).sum(axis=1)
Sample:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df['sum of 1'] = df.eq(1).sum(axis=1)
print (df)
a b c sum of 1
0 0 0 2 0
1 1 0 1 2
2 0 0 0 0
3 2 1 2 1
4 2 2 1 1
5 0 0 0 0
6 0 2 0 0
7 1 1 1 3
If need new row:
df.loc['sum of 1'] = df.eq(1).sum()
#alternative
#df.loc['sum of 1'] = (df == 1).sum()
Sample:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df.loc['sum of 1'] = df.eq(1).sum()
print (df)
a b c
0 0 0 2
1 1 0 1
2 0 0 0
3 2 1 2
4 2 2 1
5 0 0 0
6 0 2 0
7 1 1 1
sum of 1 2 2 3
I am trying to get the frequency distribution of column which is a list of words against the class labels.
Label Numbers
0 [(a,b,c)]
0 [(d)]
0 [(e,f,g)]
1 [(a,z)]
1 [(d,x,y)]
The output should be:
0 1
a 1 1
b 1 0
c 1 0
d 1 1
e 1 0
f 1 0
g 1 0
x 0 1
y 0 1
z 0 1
The list of sets in the 'Numbers' column makes manipulating the DataFrame as-is very difficult (this is not tidy data). The solution is to expand out the DataFrame so that you only have one number in the 'Numbers' column corresponding to one value in the 'Label' column. Assuming your data is in a DataFrame called df, the following code performs that operation:
rows_list = []
for index, row in df.iterrows():
for element in row['Numbers'][0]:
dict1 = {}
dict1.update(key=row['Label'], value=element)
rows_list.append(dict1)
new_df = pd.DataFrame(rows_list)
new_df.columns = ['Label', 'Numbers']
The result is
Label Numbers
0 0 a
1 0 b
2 0 c
3 0 d
4 0 e
5 0 f
6 0 g
7 1 a
8 1 z
9 1 d
10 1 x
11 1 y
Now it's a matter of pivoting:
print(new_df.pivot_table(index='Numbers', columns='Label', aggfunc=len,
fill_value=0))
The result is
Label 0 1
Numbers
a 1 1
b 1 0
c 1 0
d 1 1
e 1 0
f 1 0
g 1 0
x 0 1
y 0 1
z 0 1
See the first answer here for the last line of code.
I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.
I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge
# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0
I'm trying to figure out how to compare the element of the previous row of a column to a different column on the current row in a Pandas DataFrame. For example:
data = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','0']})
Output:
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 0
And now I want to make a new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column.
Theoretically:
data['c'] = np.where(data['a']==( the previous row value of data['a'] ),min((data['b']+( the previous row value of data['c'] )),1),data['b'])
So that I can theoretically output:
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 1
4 1 0 1
I'm wondering how to do this because I'm trying to recreate this excel conditional statement: =IF(A70=A69,MIN((P70+Q69),1),P70)
where data['a'] = column A and data['b'] = column P.
If anyone has any ideas on how to do this, I'd greatly appreciate your advice.
According to your statement: 'new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column' I can suggest you to solve it by this way:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','3']})
>>> df
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 3
>>> df['c'] = np.where(df['a']+df['b'] > df['a'].shift(1)+df['b'].shift(1), 1, 0)
>>> df
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 0
4 1 3 1
But it doesn't looking for 'previous value of that same column'.
If you would try to write df['c'].shift(1) in np.where(), it gonna to raise KeyError: 'c'.