Iteration over columns problems on condition - python

I have a Dataset with several columns and a row named "Total" that stores values between 1 and 4.
I want to iterate over each column, and based on the number stored in the row "Total", add a new row with "yes" or "No".
I also have a list "columns" for iteration.
All data are float64
I´m new at python and i don't now if i'm doing the righ way because i´m getting all "yes".
for c in columns:
if dados_por_periodo.at['Total',c] < 4:
dados_por_periodo.loc['VA'] = "yes"
else:
dados_por_periodo.loc['VA'] = "no"
My dataset:
Thanks.

You can try this, I hope it works for you:
import pandas as pd
import numpy as np
#creation of a dummy df
columns='A B C D E F G H I'.split()
data=[np.random.choice(2, len(columns)).tolist() for col in range(3)]
data.append([1,8,1,1,2,4,1,4,1]) #not real sum of values, just dummy values for testing
index=['Otono','Inverno', 'Primavera','Totals']
df=pd.DataFrame(data, columns=columns, index=index)
df.index.name='periodo' #just adding index name
print(df)
####Adition of the new 'yes/no' row
df = pd.concat([ df, pd.DataFrame([np.where(df.iloc[len(df.index)-1,:].lt(4),'Yes','No')], columns=df.columns, index=['VA'])])
df.index.name='periodo' #just adding index name
print(df)
Output:
df
A B C D E F G H I
periodo
Otono 1 0 0 1 1 0 1 1 0
Inverno 0 1 1 1 0 1 1 1 1
Primavera 1 1 0 0 1 1 1 1 0
Totals 1 8 1 1 2 4 1 4 1
df(with added row)
A B C D E F G H I
periodo
Otono 1 0 0 1 1 0 1 1 0
Inverno 0 1 1 1 0 1 1 1 1
Primavera 1 1 0 0 1 1 1 1 0
Totals 1 8 1 1 2 4 1 4 1
VA Yes No Yes Yes Yes No Yes No Yes
Also try to put some data sample the next times instead of images of the dataset, so someone can help you in a better way :)

Related

How to itertate into pandas dataframe column and delete specific row?

This is my csv file:
A B C D
0 1 5 5
1 0 3 0
0 0 0 0
2 1 3 4
I want it check The B column if I foud 0 I delete all the row so this what I need as output:
A B C D
0 1 5 5
2 1 3 4
I tried this code :
import pandas as pd
df=pd.read_csv('Book1.csv', sep=',', error_bad_lines=False, dtype='unicode')
for index, row in df.iterrows():
if row['B'] == 0:
df.drop(row,index=False)
df.to_csv('hello.csv')
It return for me :
A B C D
0 0 1 5 5
1 1 0 3 0
2 0 0 0 0
3 2 1 3 4
It did not delete any thing I don't know where is the problem
Any help please !
You could check which rows in B are not equal to 1, and perform boolean indexing with the result:
df[df.B.ne(0)]
A B C D
0 0 1 5 5
3 2 1 3 4
Note that in your approach, in order to drop a given row, you need to specify the index, so you should be doing something like:
for index, row in df.iterrows():
if row['B'] == 0:
df.drop(index, inplace=True)
df.to_csv('hello.csv')
Also don't forget to reassign the result after dropping a row. This can be done setting inplace to True or reassigning back to df.

Frequency distribution of list against another column

I am trying to get the frequency distribution of column which is a list of words against the class labels.
Label Numbers
0 [(a,b,c)]
0 [(d)]
0 [(e,f,g)]
1 [(a,z)]
1 [(d,x,y)]
The output should be:
0 1
a 1 1
b 1 0
c 1 0
d 1 1
e 1 0
f 1 0
g 1 0
x 0 1
y 0 1
z 0 1
The list of sets in the 'Numbers' column makes manipulating the DataFrame as-is very difficult (this is not tidy data). The solution is to expand out the DataFrame so that you only have one number in the 'Numbers' column corresponding to one value in the 'Label' column. Assuming your data is in a DataFrame called df, the following code performs that operation:
rows_list = []
for index, row in df.iterrows():
for element in row['Numbers'][0]:
dict1 = {}
dict1.update(key=row['Label'], value=element)
rows_list.append(dict1)
new_df = pd.DataFrame(rows_list)
new_df.columns = ['Label', 'Numbers']
The result is
Label Numbers
0 0 a
1 0 b
2 0 c
3 0 d
4 0 e
5 0 f
6 0 g
7 1 a
8 1 z
9 1 d
10 1 x
11 1 y
Now it's a matter of pivoting:
print(new_df.pivot_table(index='Numbers', columns='Label', aggfunc=len,
fill_value=0))
The result is
Label 0 1
Numbers
a 1 1
b 1 0
c 1 0
d 1 1
e 1 0
f 1 0
g 1 0
x 0 1
y 0 1
z 0 1
See the first answer here for the last line of code.

pair-wise equality of pandas dataframe rows

I have a data frame with many rows, for illustration I'll use the following sample:
df = pd.DataFrame([[2,1,3,3],[2,3,3,4],[4,1,3,2]],columns=['A','B','C','D'])
This results:
A B C D
0 2 1 3 3
1 2 3 3 4
2 4 1 3 2
I would to get a new dataframe consisted of pair-wise equality results between the original dataframe rows.
I expect to get the following result:
A B C D
0 1 0 1 0
1 0 1 1 0
2 0 0 1 0
as:
index 0- is row 0 vs row 1,
index 1- is row 0 vs row 2,
index 2- is row 1 vs row 2
A naive way to implement this would be:
new_df = pd.DataFrame()
for i in range(0, len(df)-1):
for j in range(i+1, len(df)):
new_df = new_df.append(df.iloc[i,:] == df.iloc[j,:], ignore_index=True)
Is there any efficient way to implement this operation?
This will do what you want:
import pandas as pd
from itertools import combinations
df = pd.DataFrame([[2,1,3,3],[2,3,3,4],[4,1,3,2]],columns=['A','B','C','D'])
combos = list(combinations(df.index, 2))
newData = {'{} v {}'.format(*combo): (df.iloc[combo[0]] == df.iloc[combo[1]]).astype(int).to_dict() for combo in combos}
pd.DataFrame.from_dict(newData, orient='index')
# A C B D
#0 v 1 1 1 0 0
#0 v 2 0 1 1 0
#1 v 2 0 1 0 0
So it uses the unique combinations of index values paired in 2 - then builds the rows based on those criteria.
And if you wish to reuse this data use the following as it makes df easier to query:
newData = {combo: (df.iloc[combo[0]] == df.iloc[combo[1]]).astype(int).to_dict() for combo in combos}
pd.DataFrame.from_dict(newData, orient='index')
# A C B D
#0 1 1 1 0 0
# 2 0 1 1 0
#1 2 0 1 0 0
And to get the result in accordance with your latest request use:
newData = [(df.iloc[combo[0]] == df.iloc[combo[1]]).astype(int).to_dict() for combo in combos]
pd.DataFrame(newData)
# A B C D
#0 1 0 1 0
#1 0 1 1 0
#2 0 0 1 0

Pandas DataFrame Groupby to get Unique row condition and identify with increasing value up to Number of Groups

I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.
I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge
# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0

How to use trailing rows on a column for calculations on that same column | Pandas Python

I'm trying to figure out how to compare the element of the previous row of a column to a different column on the current row in a Pandas DataFrame. For example:
data = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','0']})
Output:
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 0
And now I want to make a new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column.
Theoretically:
data['c'] = np.where(data['a']==( the previous row value of data['a'] ),min((data['b']+( the previous row value of data['c'] )),1),data['b'])
So that I can theoretically output:
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 1
4 1 0 1
I'm wondering how to do this because I'm trying to recreate this excel conditional statement: =IF(A70=A69,MIN((P70+Q69),1),P70)
where data['a'] = column A and data['b'] = column P.
If anyone has any ideas on how to do this, I'd greatly appreciate your advice.
According to your statement: 'new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column' I can suggest you to solve it by this way:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','3']})
>>> df
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 3
>>> df['c'] = np.where(df['a']+df['b'] > df['a'].shift(1)+df['b'].shift(1), 1, 0)
>>> df
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 0
4 1 3 1
But it doesn't looking for 'previous value of that same column'.
If you would try to write df['c'].shift(1) in np.where(), it gonna to raise KeyError: 'c'.

Categories

Resources