How to delete row when iterating into pandas Dataframe column? - python

This is my csv file:
A B C D J
0 1 0 0 0
0 0 0 0 0
1 1 1 0 0
0 0 0 0 0
0 0 7 0 7
I need each time to select two columns and I verify this condition if I have Two 0 I delete the row so for exemple I select A and B
Input
A B
0 1
0 0
1 1
0 0
0 0
Output
A B
0 1
1 1
And Then I select A and C ..
I used This code for A and B but it return errors
import pandas as pd
df = pd.read_csv('Book1.csv')
a=df['A']
b=df['B']
indexes_to_drop = []
for i in df.index:
if df[(a==0) & (b==0)] :
indexes_to_drop.append(i)
df.drop(df.index[indexes_to_drop], inplace=True )
Any help please!

First we make your desired combinations of column A with all the rest, then we use iloc to select the correct rows per column combination:
idx_ranges = [[0,i] for i in range(1, len(df.columns))]
dfs = [df[df.iloc[:, idx].ne(0).any(axis=1)].iloc[:, idx] for idx in idx_ranges]
print(dfs[0], '\n')
print(dfs[1], '\n')
print(dfs[2], '\n')
print(dfs[3])
A B
0 0 1
2 1 1
A C
2 1 1
4 0 7
A D
2 1 0
A J
2 1 0
4 0 7

Do not iterate. Create a Boolean Series to slice your DataFrame:
cols = ['A', 'B']
m = df[cols].ne(0).any(1)
df.loc[m]
A B C D J
0 0 1 0 0 0
2 1 1 1 0 0
You can get all combinations and store them in a dict with itertools.combinations. Use .loc to select both the rows and columns you care about.
from itertools import combinations
d = {c: df.loc[df[list(c)].ne(0).any(1), list(c)]
for c in list(combinations(df.columns, 2))}
d[('A', 'B')]
# A B
#0 0 1
#2 1 1
d[('C', 'J')]
# C J
#2 1 0
#4 7 7

Related

Write value to column if row string contain column name

I have a dataframe which contains many pre-defined column names. One column of this dataframe contains the name of these columns.
I want to write the value 1 where the string name is equal to the column name.
For example, I have this current situation:
df = pd.DataFrame(0,index=[0,1,2,3],columns = ["string","a","b","c","d"])
df["string"] = ["b", "b", "c", "a"]
string a b c d
------------------------------
b 0 0 0 0
b 0 0 0 0
c 0 0 0 0
a 0 0 0 0
And this is what I would like the desired result to be like:
string a b c d
------------------------------
b 0 1 0 0
b 0 1 0 0
c 0 0 1 0
a 1 0 0 0
You can use get_dummies on df['string'] and update the DataFrame in place:
df.update(pd.get_dummies(df['string']))
updated df:
string a b c d
0 b 0 1 0 0
1 b 0 1 0 0
2 c 0 0 1 0
3 a 1 0 0 0
you can also use this
df.loc[ df[“column_name”] == “some_value”, “column_name”] = “value”
In your case
df.loc[ df["string"] == "b", "b"] = 1

Create new column to existing column and add value at certain interval

I am trying to create a new column based one on my first column. For example,
I have a list of a = ["A", "B", "C"] and existing dataframe
Race Boy Girl
W 0 1
B 1 0
H 1 1
W 1 0
B 0 0
H 0 1
W 1 0
B 1 1
H 0 1
My goal is to create a new column and add value to it base on W, B, H interval. So that the end result looks like:
Race Boy Girl New Column
W 0 1 A
B 1 0 A
H 1 1 A
W 1 0 B
B 0 0 B
H 0 1 B
W 1 0 C
B 1 1 C
H 0 1 C
The W,B,H interval is consistent, and I want to add new value to the new column every time I see W. The data is longer than this.
I have tried all possible ways and I couldn't come up with a code. I will be glad if someone can help and also explain the process. Thanks
Here is what you can do:
Use a loop to create a list that is repetitive for the column.
for i in len(dataframe['Race']):
#Create list for last column
Once you have that list you can add it to the list by using:
dataframe['New Column'] = list
maybe this works..
list = ['A','B','C',....]
i=-1
for entry in dataframe:
if entry['Race'] = 'W':
i+=1
entry['new column'] = list[i]
also if the new column list is very big to type you can use list comprehension:
list = [x for x in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ']
If your W, B, H is in this exact order and complete inteval, you may use np.repeat. As in your comment, np.repeat only would suffice.
import numpy as np
a = ["A", "B", "C"] #list
n = df.Race.nunique() # length of each interval
df['New Col'] = np.repeat(a, n)
In [20]: df
Out[20]:
Race Boy Girl New Col
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
Here is a way with pandas. It increments each time you see a new 'W' and handles missing values of Race.
# use original post's definition of df
df['New Col'] = (
(df['Race'] == 'W') # True (1) for W; False (0) otherwise
.cumsum() # increments each time you hit True (1)
.map({1: 'A', 2: 'B', 3: 'C'}) # 1->A, 2->B, ...
)
print(df)
Race Boy Girl New Col
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
There are multiple ways to solve this problem statement. You can iterate through the DataFrame and assign values to the new column at each interval.
Here's an approach I think will work.
#setting up the DataFrame you referred in the example
import pandas as pd
df = pd.DataFrame({'Race':['W','B','H','W','B','H','W','B','H'],
'Boy':[0,1,1,1,0,0,1,1,0],
'Girl':[1,0,1,0,0,1,0,1,1]})
#if you have 3 values to assign, create a list say A, B, C
#By creating a list, you have to manage only the list and the frequency
a = ['A','B','C']
#iterate thru the dataframe and assign the values in batches
for (i,row) in df.iterrows(): #the trick is to assign for loc[i]
df.loc[i,'New'] = a[int(i/3)] #where i is the index and assign value in list a
#note: dividing by 3 will distribute equally
print(df)
The output of this will be:
Race Boy Girl New
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
I see that you are trying to get a solution that works for 17 sets of records. Here's the code and it works correctly.
import pandas as pd
df = pd.DataFrame({'Race':['W','B','H']*17,
'Boy':[0,1,1]*17,
'Girl':[1,0,1]*17})
#in the DataFrame, you can define the Boy and Girl value
#I think Race values are repeating so I just repeated it 17 times
#define a variable from a thru z
a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
for (i,row) in df.iterrows():
df.loc[i,'New'] = a[int(i/3)] #still dividing it by 3 equal batches
print(df)
I didn't print for all 17 sets. I just did with 7 sets. It is still the same result.
Race Boy Girl New
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 0 1 B
4 B 1 0 B
5 H 1 1 B
6 W 0 1 C
7 B 1 0 C
8 H 1 1 C
9 W 0 1 D
10 B 1 0 D
11 H 1 1 D
12 W 0 1 E
13 B 1 0 E
14 H 1 1 E
15 W 0 1 F
16 B 1 0 F
17 H 1 1 F
18 W 0 1 G
19 B 1 0 G
20 H 1 1 G
The old pythonic fashion: use a function !
In [18]: df
Out[18]:
Race Boy Girl
0 W 0 1
1 B 1 0
2 H 1 1
3 W 1 0
4 B 0 0
5 H 0 1
6 W 1 0
7 B 1 1
8 H 0 1
The function:
def make_new_col(race_col, abc):
race_col = iter(race_col)
abc = iter(abc)
new_col = []
while True:
try:
race = next(race_col)
except:
break
if race == 'W':
abc_value = next(abc)
new_col.append(abc_value)
else:
new_col.append(abc_value)
return new_col
Then do:
abc = ['A', 'B', 'C']
df['New Column'] = make_new_col(df['Race'], abc)
You get:
In [20]: df
Out[20]:
Race Boy Girl New Column
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C

Convert python dictionary to dataframe with dict values(list) as columns and 1,0 if that column is in dict list

I want to create a dataframe from a dictionary which is of the format
Dictionary_ = {'Key1': ['a', 'b', 'c', 'd'],'Key2': ['d', 'f'],'Key3': ['a', 'c', 'm', 'n']}
I am using
df = pd.DataFrame.from_dict(Dictionary_, orient ='index')
But it creates its own columns till max length of values and put values of dictionary as values in a dataframe.
I want a df with keys as rows and values as columns like
a b c d e f m n
Key 1 1 1 1 1 0 0 0 0
Key 2 0 0 0 1 0 1 0 0
Key 3 1 0 1 0 0 0 1 1
I can do it by appending all values of dict and create an empty dataframe with dict keys as rows and values as columns and then iterating over each row to fetch values from dict and put 1 where it matches with column, but this will be too slow as my data has 200 000 rows and .loc is slow. I feel i can use pandas dummies somehow but don't know how to apply it here.
I feel there will be a smarter way to do this.
If performance is important, use MultiLabelBinarizer and pass keys and values:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(Dictionary_.values()),
columns=mlb.classes_,
index=Dictionary_.keys()))
print (df)
a b c d f m n
Key1 1 1 1 1 0 0 0
Key2 0 0 0 1 1 0 0
Key3 1 0 1 0 0 1 1
Alternative, but slowier is create Series, then str.join for strings and last call str.get_dummies:
df = pd.Series(Dictionary_).str.join('|').str.get_dummies()
print (df)
a b c d f m n
Key1 1 1 1 1 0 0 0
Key2 0 0 0 1 1 0 0
Key3 1 0 1 0 0 1 1
Alternative with input DataFrame - use pandas.get_dummies, but then is necessary aggregate max per columns:
df1 = pd.DataFrame.from_dict(Dictionary_, orient ='index')
df = pd.get_dummies(df1, prefix='', prefix_sep='').max(axis=1, level=0)
print (df)
a d b c f m n
Key1 1 1 1 1 0 0 0
Key2 0 1 0 0 1 0 0
Key3 1 0 0 1 0 1 1
Use get_dummies:
>>> pd.get_dummies(df).rename(columns=lambda x: x[2:]).max(axis=1, level=0)
a d b c f m n
Key1 1 1 1 1 0 0 0
Key2 0 1 0 0 1 0 0
Key3 1 0 0 1 0 1 1
>>>

Pandas select on multiple columns then replace

I am trying to do a multiple column select then replace in pandas
df:
a b c d e
0 1 1 0 none
0 0 0 1 none
1 0 0 0 none
0 0 0 0 none
select where any or all of a, b, c, d are non zero
i, j = np.where(df)
s=pd.Series(dict(zip(zip(i, j),
df.columns[j]))).reset_index(-1, drop=True)
s:
0 b
0 c
1 d
2 a
Now I want to replace the values in column e by the series:
df['e'] = s.values
so that e looks like:
e:
b, c
d
a
none
But the problem is that the lengths of the series are different to the number of rows in the dataframe.
Any idea on how I can do this?
Use DataFrame.dot for product with columns names, add rstrip, last add numpy.where for replace empty strings to None:
e = df.dot(df.columns + ', ').str.rstrip(', ')
df['e'] = np.where(e.astype(bool), e, None)
print (df)
a b c d e
0 0 1 1 0 b, c
1 0 0 0 1 d
2 1 0 0 0 a
3 0 0 0 0 None
You can locate the 1's and use their locations as boolean indexes into the dataframe columns:
df['e'] = (df==1).apply(lambda x: df.columns[x], axis=1)\
.str.join(",").replace('','none')
# a b c d e
#0 0 1 1 0 b,c
#1 0 0 0 1 d
#2 1 0 0 0 a
#3 0 0 0 0 none

Changing values in multiple columns of a pandas DataFrame using known column values

Suppose I have a dataframe like this:
Knownvalue A B C D E F G H
17.3413 0 0 0 0 0 0 0 0
33.4534 0 0 0 0 0 0 0 0
what I wanna do is that when Knownvalue is between 0-10, A is changed from 0 to 1. And when Knownvalue is between 10-20, B is changed from 0 to 1,so on so forth.
It should be like this after changing:
Knownvalue A B C D E F G H
17.3413 0 1 0 0 0 0 0 0
33.4534 0 0 0 1 0 0 0 0
Anyone know how to apply a method to change it?
I first bucket the Knownvalue Series into a list of integers equal to its truncated value divided by ten (e.g. 27.87 // 10 = 2). These buckets represent the integer for the desired column location. Because the Knownvalue is in the first column, I add one to these values.
Next, I enumerate through these bin values which effectively gives me tuple pairs of row and column integer indices. I use iat to set the value of the these locations equal to 1.
import pandas as pd
import numpy as np
# Create some sample data.
df_vals = pd.DataFrame({'Knownvalue': np.random.random(5) * 50})
df = pd.concat([df_vals, pd.DataFrame(np.zeros((5, 5)), columns=list('ABCDE'))], axis=1)
# Create desired column locations based on the `Knownvalue`.
bins = (df.Knownvalue // 10).astype('int').tolist()
>>> bins
[4, 3, 0, 1, 0]
# Set these locations equal to 1.
for idx, col in enumerate(bins):
df.iat[idx, col + 1] = 1 # The first column is the `Knownvalue`, hence col + 1
>>> df
Knownvalue A B C D E
0 47.353937 0 0 0 0 1
1 37.460338 0 0 0 1 0
2 3.797964 1 0 0 0 0
3 18.323131 0 1 0 0 0
4 7.927030 1 0 0 0 0
A different approach would be to reconstruct the frame from the Knownvalue column using get_dummies:
>>> import string
>>> new_cols = pd.get_dummies(df["Knownvalue"]//10).loc[:,range(8)].fillna(0)
>>> new_cols.columns = list(string.ascii_uppercase)[:len(new_cols.columns)]
>>> pd.concat([df[["Knownvalue"]], new_cols], axis=1)
Knownvalue A B C D E F G H
0 17.3413 0 1 0 0 0 0 0 0
1 33.4534 0 0 0 1 0 0 0 0
get_dummies does the hard work:
>>> (df.Knownvalue//10)
0 1
1 3
Name: Knownvalue, dtype: float64
>>> pd.get_dummies((df.Knownvalue//10))
1 3
0 1 0
1 0 1

Categories

Resources