I am trying to get the frequency distribution of column which is a list of words against the class labels.
Label Numbers
0 [(a,b,c)]
0 [(d)]
0 [(e,f,g)]
1 [(a,z)]
1 [(d,x,y)]
The output should be:
0 1
a 1 1
b 1 0
c 1 0
d 1 1
e 1 0
f 1 0
g 1 0
x 0 1
y 0 1
z 0 1
The list of sets in the 'Numbers' column makes manipulating the DataFrame as-is very difficult (this is not tidy data). The solution is to expand out the DataFrame so that you only have one number in the 'Numbers' column corresponding to one value in the 'Label' column. Assuming your data is in a DataFrame called df, the following code performs that operation:
rows_list = []
for index, row in df.iterrows():
for element in row['Numbers'][0]:
dict1 = {}
dict1.update(key=row['Label'], value=element)
rows_list.append(dict1)
new_df = pd.DataFrame(rows_list)
new_df.columns = ['Label', 'Numbers']
The result is
Label Numbers
0 0 a
1 0 b
2 0 c
3 0 d
4 0 e
5 0 f
6 0 g
7 1 a
8 1 z
9 1 d
10 1 x
11 1 y
Now it's a matter of pivoting:
print(new_df.pivot_table(index='Numbers', columns='Label', aggfunc=len,
fill_value=0))
The result is
Label 0 1
Numbers
a 1 1
b 1 0
c 1 0
d 1 1
e 1 0
f 1 0
g 1 0
x 0 1
y 0 1
z 0 1
See the first answer here for the last line of code.
Related
a = [[0,0,0,0],[0,-1,1,0],[1,-1,1,0],[1,-1,1,0]]
df = pd.DataFrame(a, columns=['A','B','C','D'])
df
Output:
A B C D
0 0 0 0 0
1 0 -1 1 0
2 1 -1 1 0
3 1 -1 1 0
So reading down vertically per column, values in the columns all begin at 0 on the first row, once they change they can never change back and can either become a 1 or a -1. I would like to re arrange the dataframe columns so that the columns in this order:
Order columns that hit 1 in the earliest row as possible
Order columns that hit -1 in the earliest row as possible
Finally the remaining rows that never changed values and remained as zero (if there are even any left)
Desired Output:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
The my main data frame is 3000 rows and 61 columns long, is there any way of doing this quickly?
We have to handle the positive and negative values seperately. One way is take sum of the columns , then using sort_values , we can adjust the ordering:
a = df.sum().sort_values(ascending=False)
b = pd.concat((a[a.gt(0)],a[a.lt(0)].sort_values(),a[a.eq(0)]))
out = df.reindex(columns=b.index)
print(out)
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
Try with pd.Series.first_valid_index
s = df.where(df.ne(0))
s1 = s.apply(pd.Series.first_valid_index)
s2 = s.bfill().iloc[0]
out = df.loc[:,pd.concat([s2,s1],axis=1,keys=[0,1]).sort_values([0,1],ascending=[False,True]).index]
out
Out[35]:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
I have a Dataset with several columns and a row named "Total" that stores values between 1 and 4.
I want to iterate over each column, and based on the number stored in the row "Total", add a new row with "yes" or "No".
I also have a list "columns" for iteration.
All data are float64
I´m new at python and i don't now if i'm doing the righ way because i´m getting all "yes".
for c in columns:
if dados_por_periodo.at['Total',c] < 4:
dados_por_periodo.loc['VA'] = "yes"
else:
dados_por_periodo.loc['VA'] = "no"
My dataset:
Thanks.
You can try this, I hope it works for you:
import pandas as pd
import numpy as np
#creation of a dummy df
columns='A B C D E F G H I'.split()
data=[np.random.choice(2, len(columns)).tolist() for col in range(3)]
data.append([1,8,1,1,2,4,1,4,1]) #not real sum of values, just dummy values for testing
index=['Otono','Inverno', 'Primavera','Totals']
df=pd.DataFrame(data, columns=columns, index=index)
df.index.name='periodo' #just adding index name
print(df)
####Adition of the new 'yes/no' row
df = pd.concat([ df, pd.DataFrame([np.where(df.iloc[len(df.index)-1,:].lt(4),'Yes','No')], columns=df.columns, index=['VA'])])
df.index.name='periodo' #just adding index name
print(df)
Output:
df
A B C D E F G H I
periodo
Otono 1 0 0 1 1 0 1 1 0
Inverno 0 1 1 1 0 1 1 1 1
Primavera 1 1 0 0 1 1 1 1 0
Totals 1 8 1 1 2 4 1 4 1
df(with added row)
A B C D E F G H I
periodo
Otono 1 0 0 1 1 0 1 1 0
Inverno 0 1 1 1 0 1 1 1 1
Primavera 1 1 0 0 1 1 1 1 0
Totals 1 8 1 1 2 4 1 4 1
VA Yes No Yes Yes Yes No Yes No Yes
Also try to put some data sample the next times instead of images of the dataset, so someone can help you in a better way :)
I have a data frame with many rows, for illustration I'll use the following sample:
df = pd.DataFrame([[2,1,3,3],[2,3,3,4],[4,1,3,2]],columns=['A','B','C','D'])
This results:
A B C D
0 2 1 3 3
1 2 3 3 4
2 4 1 3 2
I would to get a new dataframe consisted of pair-wise equality results between the original dataframe rows.
I expect to get the following result:
A B C D
0 1 0 1 0
1 0 1 1 0
2 0 0 1 0
as:
index 0- is row 0 vs row 1,
index 1- is row 0 vs row 2,
index 2- is row 1 vs row 2
A naive way to implement this would be:
new_df = pd.DataFrame()
for i in range(0, len(df)-1):
for j in range(i+1, len(df)):
new_df = new_df.append(df.iloc[i,:] == df.iloc[j,:], ignore_index=True)
Is there any efficient way to implement this operation?
This will do what you want:
import pandas as pd
from itertools import combinations
df = pd.DataFrame([[2,1,3,3],[2,3,3,4],[4,1,3,2]],columns=['A','B','C','D'])
combos = list(combinations(df.index, 2))
newData = {'{} v {}'.format(*combo): (df.iloc[combo[0]] == df.iloc[combo[1]]).astype(int).to_dict() for combo in combos}
pd.DataFrame.from_dict(newData, orient='index')
# A C B D
#0 v 1 1 1 0 0
#0 v 2 0 1 1 0
#1 v 2 0 1 0 0
So it uses the unique combinations of index values paired in 2 - then builds the rows based on those criteria.
And if you wish to reuse this data use the following as it makes df easier to query:
newData = {combo: (df.iloc[combo[0]] == df.iloc[combo[1]]).astype(int).to_dict() for combo in combos}
pd.DataFrame.from_dict(newData, orient='index')
# A C B D
#0 1 1 1 0 0
# 2 0 1 1 0
#1 2 0 1 0 0
And to get the result in accordance with your latest request use:
newData = [(df.iloc[combo[0]] == df.iloc[combo[1]]).astype(int).to_dict() for combo in combos]
pd.DataFrame(newData)
# A B C D
#0 1 0 1 0
#1 0 1 1 0
#2 0 0 1 0
I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.
I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge
# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0
I have seen similar questions, but nothing that really matchs my problem. If I have a table of values such as:
value
a
b
b
c
I want to use pandas to add in columns to the table to show for example:
value a b
a 1 0
b 0 1
c 0 0
I have tried the following:
df['a'] = 0
def string_count(indicator):
if indicator == 'a':
df['a'] == 1
df['a'].apply(string_count)
But this produces:
0 None
1 None
2 None
3 None
I would like to at least get to the point where the choices are hardcoded in (i.e I already know that a,b and c appear), but would even better if I could look set the column of strings and then insert a column for each unique string.
Am I approaching this the wrong way?
dummies = pd.get_dummies(df.value)
a b c
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
If you only want to display unique occurrences, you can add:
dummies.index = df.value
dummies.drop_duplicates()
a b c
value
a 1 0 0
b 0 1 0
c 0 0 1
Alternatively:
df = df.join(pd.get_dummies(df.value))
value a b c
0 a 1 0 0
1 b 0 1 0
2 b 0 1 0
3 c 0 0 1
Where you could again .drop_duplicates() to only see unique entries from the value column.