I have a CSV file with many columns in it. Let me give you people an example.
A B C D
1 1 0
1 1 1
0 0 0
I want to do this.
if col-A first row value == 1 AND col-B first row value == 1 AND col-C first row value == 1;
then put "FIC" in first row of Col-D
else:
enter "PI"
I am using pandas.
There are more than 1500 rows and I want to do this for every row. How can I do this? Please help
If need test if all values are 1 per filtered columns use:
df['D'] = np.where(df[['A','B','C']].eq(1).all(axis=1), 'FIC','PI')
Or if only 0,1 values in filtered columns:
df['D'] = np.where(df[['A','B','C']].all(axis=1), 'FIC','PI')
EDIT:
print (df)
A B C D
0 1 1 NaN NaN
1 1 1 1.0 NaN
2 0 0 0.0 NaN
m1 = df[['A','B','C']].all(axis=1)
m2 = df[['A','B','C']].isna().any(axis=1)
df['D'] = np.select([m2, m1], ['ZD', 'FIC'],'PI')
print (df)
A B C D
0 1 1 NaN ZD
1 1 1 1.0 FIC
2 0 0 0.0 PI
Without numpy you can use:
df['D'] = df[['A', 'B', 'C']].astype(bool).all(1).replace({True: 'FIC', False: 'PI'})
print(df)
# Output
A B C D
0 1 1 0 PI
1 1 1 1 FIC
2 0 0 0 PI
Related
How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2
I have a DataFrame
df = columnA=[1,2,3,4,5,6]
columnB=['Apple AA','Banana BB',NaN,'Strawberry DD',NaN,'Blueberry EE']
I want to create a new column if columnB contains values
df = columnA=[1,2,3,4,5,6]
columnB=['Apple AA','Banana BB',NaN,'Strawberry DD',NaN,'Blueberry EE']
columnC=[1,1,0,1,0,1]
My code:
df[columnC] = df[columnB].map(lambda x: 1 if len(x) > 0 else 0 if len(x) == 0)
Or
columnC = np.repeat(0, df.shape[0]
for i in df:
if len(df[columnB]) > 0:
df[columnC] = 1
Neither are working.
You can use .notnull() to test if your values are not NaN
import pandas as pd
import numpy as np
df = pd.DataFrame({'columnA':[1,2,3,4,5,6],
'columnB':['Apple AA','Banana BB',np.NaN,'Strawberry DD',np.NaN,'Blueberry EE']})
df['columnC'] = df['columnB'].notnull()*1
The multiplication by 1 is used to convert booleans to binary values.
Also be careful not to forget quotes around your column names.
Here is the code:
df = pd.DataFrame({'A': [1,2,3,4,5,6], 'B': ['Apple', 'Ba', np.nan, 'St', np.nan, 'e']})
df['C'] = df['B'].isna()
A B C
0 1 Apple False
1 2 Ba False
2 3 NaN True
3 4 St False
4 5 NaN True
5 6 e False
Then convert the boolean value to 0 1
df['C'] = df['C'].apply(lambda x: 1 if not x else 0)
A B C
0 1 Apple 1
1 2 Ba 1
2 3 NaN 0
3 4 St 1
4 5 NaN 0
5 6 e 1
Could you please try following, using np.where from numpy.
df['C']=np.where(df['B'].notnull(),1,0)
Output will be as follows.
A B C
1 Apple AA 1
2 Banana BB 1
3 NaN None 0
4 Strawberry DD 1
5 NaN None 0
6 Blueberry EE 1
This is my csv file:
A B C D J
0 1 0 0 0
0 0 0 0 0
1 1 1 0 0
0 0 0 0 0
0 0 7 0 7
I need each time to select two columns and I verify this condition if I have Two 0 I delete the row so for exemple I select A and B
Input
A B
0 1
0 0
1 1
0 0
0 0
Output
A B
0 1
1 1
And Then I select A and C ..
I used This code for A and B but it return errors
import pandas as pd
df = pd.read_csv('Book1.csv')
a=df['A']
b=df['B']
indexes_to_drop = []
for i in df.index:
if df[(a==0) & (b==0)] :
indexes_to_drop.append(i)
df.drop(df.index[indexes_to_drop], inplace=True )
Any help please!
First we make your desired combinations of column A with all the rest, then we use iloc to select the correct rows per column combination:
idx_ranges = [[0,i] for i in range(1, len(df.columns))]
dfs = [df[df.iloc[:, idx].ne(0).any(axis=1)].iloc[:, idx] for idx in idx_ranges]
print(dfs[0], '\n')
print(dfs[1], '\n')
print(dfs[2], '\n')
print(dfs[3])
A B
0 0 1
2 1 1
A C
2 1 1
4 0 7
A D
2 1 0
A J
2 1 0
4 0 7
Do not iterate. Create a Boolean Series to slice your DataFrame:
cols = ['A', 'B']
m = df[cols].ne(0).any(1)
df.loc[m]
A B C D J
0 0 1 0 0 0
2 1 1 1 0 0
You can get all combinations and store them in a dict with itertools.combinations. Use .loc to select both the rows and columns you care about.
from itertools import combinations
d = {c: df.loc[df[list(c)].ne(0).any(1), list(c)]
for c in list(combinations(df.columns, 2))}
d[('A', 'B')]
# A B
#0 0 1
#2 1 1
d[('C', 'J')]
# C J
#2 1 0
#4 7 7
I have a dataframe as follows:
data
0 a
1 a
2 a
3 a
4 a
5 b
6 b
7 b
8 b
9 b
I want to group the repeating values of a and b into a single row element as follows:
data
0 a
a
a
a
a
1 b
b
b
b
b
How do I go about doing this? I tried the following but it puts each repeating value in its own column
df.groupby('data')
Seems like a pivot problem, but since you missing the column(create by cumcount) and index(create by factorize) columns , it is hard to figure out
pd.crosstab(pd.factorize(df.data)[0],df.groupby('data').cumcount(),df.data,aggfunc='sum')
Out[358]:
col_0 0 1 2 3 4
row_0
0 a a a a a
1 b b b b b
Something like
index = ((df['data'] != df['data'].shift()).cumsum() - 1).rename(columns= {'data':''})
df = df.set_index(index)
data
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b
You can use pd.factorize followed by set_index:
df = df.assign(key=pd.factorize(df['data'], sort=False)[0]).set_index('key')
print(df)
data
key
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b
I have a dataframe that looks like this one:
df = pd.DataFrame(np.nan, index=[0,1,2,3], columns=['A','B','C'])
df.iloc[0,0] = 'a'
df.iloc[1,0] = 'b'
df.iloc[1,1] = 'c'
df.iloc[2,0] = 'b'
df.iloc[3,0] = 'c'
df.iloc[3,1] = 'b'
df.iloc[3,2] = 'd'
df
out : A B C
0 a NaN NaN
1 b c NaN
2 b NaN NaN
3 c b d
And I would like to add new columns to it which names are the values inside the dataframe (here 'a','b','c',and 'd'). Those columns are binary, and reflect if the values 'a','b','c',and 'd' are in the row.
In one picture, the output I'd like is:
A B C a b c d
0 a NaN NaN 1 0 0 0
1 b c NaN 0 1 1 0
2 b NaN NaN 0 1 0 0
3 c b d 0 1 1 1
To do this I first create the columns filled with zeros:
cols = pd.Series(df.values.ravel()).value_counts().index
for col in cols:
df[col] = 0
(It doesn't create the columns in the right order, but that doesn't matter)
Then I...use a loop over the rows and columns...
for row in df.index:
for col in cols:
if col in df.loc[row].values:
df.ix[row,col] = 1
You'll get why I'm looking for another way to do it, even if my dataframe is relatively small (76k rows), it still takes around 8 minutes, which is far too long.
Any idea?
You're looking for get_dummies. Here I choose to use the .str version:
df.fillna('', inplace=True)
(df.A + '|' + df.B + '|' + df.C).str.get_dummies()
Output:
a b c d
0 1 0 0 0
1 0 1 1 0
2 0 1 0 0
3 0 1 1 1