I have a pandas DataFrame that looks like the following example:
tags tag1 tag2 tag3
0 [a,b,c] 0 0 0
1 [a,b] 0 0 0
2 [b,d] 0 0 0
...
n [a,b,d] 0 0 0
I want to encade the tags as 1s in the rows for tag1, tag2, tag3 if they are present in the tags array for that row index.
However, I can't quite figure out to iterate over properly; my idea so far is as follows:
for i, row in dataset.iterrows():
for tag in row[0]:
for column in range (1,4):
if dataset.iloc[:,column].index == tag:
dataset.set_value(i, column, 1)
However, upon returning the dataset from this method, the columns are still all at 0 value.
Thank you!
It seems you need:
astype for convert column if contains lists to strings
str.strip for remove []
str.get_dummies
df1 = df['tags'].astype(str).str.strip('[]').str.get_dummies(', ')
print (df1)
'a' 'b' 'c' 'd'
0 1 1 1 0
1 1 1 0 0
2 0 1 0 1
3 1 1 0 1
Last add df1 to original DataFrame by concat:
df = pd.concat([df,df1], axis=1)
print (df)
tags tag1 tag2 tag3 'a' 'b' 'c' 'd'
0 [a, b, c] 0 0 0 1 1 1 0
1 [a, b] 0 0 0 1 1 0 0
2 [b, d] 0 0 0 0 1 0 1
3 [a, b, d] 0 0 0 1 1 0 1
Related
I have a dataframe which contains many pre-defined column names. One column of this dataframe contains the name of these columns.
I want to write the value 1 where the string name is equal to the column name.
For example, I have this current situation:
df = pd.DataFrame(0,index=[0,1,2,3],columns = ["string","a","b","c","d"])
df["string"] = ["b", "b", "c", "a"]
string a b c d
------------------------------
b 0 0 0 0
b 0 0 0 0
c 0 0 0 0
a 0 0 0 0
And this is what I would like the desired result to be like:
string a b c d
------------------------------
b 0 1 0 0
b 0 1 0 0
c 0 0 1 0
a 1 0 0 0
You can use get_dummies on df['string'] and update the DataFrame in place:
df.update(pd.get_dummies(df['string']))
updated df:
string a b c d
0 b 0 1 0 0
1 b 0 1 0 0
2 c 0 0 1 0
3 a 1 0 0 0
you can also use this
df.loc[ df[“column_name”] == “some_value”, “column_name”] = “value”
In your case
df.loc[ df["string"] == "b", "b"] = 1
This is my csv file:
A B C D J
0 1 0 0 0
0 0 0 0 0
1 1 1 0 0
0 0 0 0 0
0 0 7 0 7
I need each time to select two columns and I verify this condition if I have Two 0 I delete the row so for exemple I select A and B
Input
A B
0 1
0 0
1 1
0 0
0 0
Output
A B
0 1
1 1
And Then I select A and C ..
I used This code for A and B but it return errors
import pandas as pd
df = pd.read_csv('Book1.csv')
a=df['A']
b=df['B']
indexes_to_drop = []
for i in df.index:
if df[(a==0) & (b==0)] :
indexes_to_drop.append(i)
df.drop(df.index[indexes_to_drop], inplace=True )
Any help please!
First we make your desired combinations of column A with all the rest, then we use iloc to select the correct rows per column combination:
idx_ranges = [[0,i] for i in range(1, len(df.columns))]
dfs = [df[df.iloc[:, idx].ne(0).any(axis=1)].iloc[:, idx] for idx in idx_ranges]
print(dfs[0], '\n')
print(dfs[1], '\n')
print(dfs[2], '\n')
print(dfs[3])
A B
0 0 1
2 1 1
A C
2 1 1
4 0 7
A D
2 1 0
A J
2 1 0
4 0 7
Do not iterate. Create a Boolean Series to slice your DataFrame:
cols = ['A', 'B']
m = df[cols].ne(0).any(1)
df.loc[m]
A B C D J
0 0 1 0 0 0
2 1 1 1 0 0
You can get all combinations and store them in a dict with itertools.combinations. Use .loc to select both the rows and columns you care about.
from itertools import combinations
d = {c: df.loc[df[list(c)].ne(0).any(1), list(c)]
for c in list(combinations(df.columns, 2))}
d[('A', 'B')]
# A B
#0 0 1
#2 1 1
d[('C', 'J')]
# C J
#2 1 0
#4 7 7
I want to create a dataframe from a dictionary which is of the format
Dictionary_ = {'Key1': ['a', 'b', 'c', 'd'],'Key2': ['d', 'f'],'Key3': ['a', 'c', 'm', 'n']}
I am using
df = pd.DataFrame.from_dict(Dictionary_, orient ='index')
But it creates its own columns till max length of values and put values of dictionary as values in a dataframe.
I want a df with keys as rows and values as columns like
a b c d e f m n
Key 1 1 1 1 1 0 0 0 0
Key 2 0 0 0 1 0 1 0 0
Key 3 1 0 1 0 0 0 1 1
I can do it by appending all values of dict and create an empty dataframe with dict keys as rows and values as columns and then iterating over each row to fetch values from dict and put 1 where it matches with column, but this will be too slow as my data has 200 000 rows and .loc is slow. I feel i can use pandas dummies somehow but don't know how to apply it here.
I feel there will be a smarter way to do this.
If performance is important, use MultiLabelBinarizer and pass keys and values:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(Dictionary_.values()),
columns=mlb.classes_,
index=Dictionary_.keys()))
print (df)
a b c d f m n
Key1 1 1 1 1 0 0 0
Key2 0 0 0 1 1 0 0
Key3 1 0 1 0 0 1 1
Alternative, but slowier is create Series, then str.join for strings and last call str.get_dummies:
df = pd.Series(Dictionary_).str.join('|').str.get_dummies()
print (df)
a b c d f m n
Key1 1 1 1 1 0 0 0
Key2 0 0 0 1 1 0 0
Key3 1 0 1 0 0 1 1
Alternative with input DataFrame - use pandas.get_dummies, but then is necessary aggregate max per columns:
df1 = pd.DataFrame.from_dict(Dictionary_, orient ='index')
df = pd.get_dummies(df1, prefix='', prefix_sep='').max(axis=1, level=0)
print (df)
a d b c f m n
Key1 1 1 1 1 0 0 0
Key2 0 1 0 0 1 0 0
Key3 1 0 0 1 0 1 1
Use get_dummies:
>>> pd.get_dummies(df).rename(columns=lambda x: x[2:]).max(axis=1, level=0)
a d b c f m n
Key1 1 1 1 1 0 0 0
Key2 0 1 0 0 1 0 0
Key3 1 0 0 1 0 1 1
>>>
I have a dataframe that looks like this :
A B C
1 0 0
1 1 0
0 1 0
0 0 1
I want to replace all values with the respective column name, so that the data looks like:
A B C
A 0 0
A B 0
0 B 0
0 0 C
Afterwards, I want to create a column that is a list of all column values like so:
A B C D
A 0 0 ['A','0','0']
A B 0 ['A','B','0']
0 B 0 ['0','B','0']
0 0 C ['0','0','C']
Finally, I want to group by column D and count the number of occurrences for each pattern.
You can do with mul
df.mul(df.columns).replace('',0)
Out[63]:
A B C
0 A 0 0
1 A B 0
2 0 B 0
3 0 0 C
#df['D']=df.mul(df.columns).replace('',0).values.tolist()
There must be cleaner ways to achieve this, but the you can use:
for column in df:
df[column] = df[column].astype(str).replace("1", column)
df["D"] = df.values.tolist()
Output:
A B C D
0 A 0 0 [A, 0, 0]
1 A B 0 [A, B, 0]
2 0 B 0 [0, B, 0]
3 0 0 C [0, 0, C]
PS: W-B's answer is the cleaner way.
I am trying to do a multiple column select then replace in pandas
df:
a b c d e
0 1 1 0 none
0 0 0 1 none
1 0 0 0 none
0 0 0 0 none
select where any or all of a, b, c, d are non zero
i, j = np.where(df)
s=pd.Series(dict(zip(zip(i, j),
df.columns[j]))).reset_index(-1, drop=True)
s:
0 b
0 c
1 d
2 a
Now I want to replace the values in column e by the series:
df['e'] = s.values
so that e looks like:
e:
b, c
d
a
none
But the problem is that the lengths of the series are different to the number of rows in the dataframe.
Any idea on how I can do this?
Use DataFrame.dot for product with columns names, add rstrip, last add numpy.where for replace empty strings to None:
e = df.dot(df.columns + ', ').str.rstrip(', ')
df['e'] = np.where(e.astype(bool), e, None)
print (df)
a b c d e
0 0 1 1 0 b, c
1 0 0 0 1 d
2 1 0 0 0 a
3 0 0 0 0 None
You can locate the 1's and use their locations as boolean indexes into the dataframe columns:
df['e'] = (df==1).apply(lambda x: df.columns[x], axis=1)\
.str.join(",").replace('','none')
# a b c d e
#0 0 1 1 0 b,c
#1 0 0 0 1 d
#2 1 0 0 0 a
#3 0 0 0 0 none