I want to load columns with specific prefixes into separate DataFrames.
The columns I want have specific prefixes i.e.
A_1 A_2 B_1 B_2 C_1 C_2
1 0 0 0 0 0
1 0 0 1 1 1
0 1 1 1 1 0
I have a list of all the prefixes:
prefixes = ["A", "B", "C"]
I want to do something like this:
for prefix in prefixes:
f"df_{prefix}" = pd.read_csv("my_file.csv",
usecols=[f"{prefix}_1,
f"{prefix}_2,
f"{prefix}_3,])
So each DataFrame has the prefix in the name, but I'm not quite sure of the best way to do this or the syntax required.
You could try it with a different approach. Load the full csv once. Create three dfs out of it by dropping the columns don't mach your prefix.
x = pd.read_csv("my_file.csv")
notA = [c for c in x.columns if 'A' not in c]
notB = [c for c in x.columns if 'B' not in c]
notC = [c for c in x.columns if 'C' not in c]
a = x.drop(notA,1)
b = x.drop(notB,1)
c = x.drop(notC,1)
Considering you have a big dataframe like this:
In [1341]: df
Out[1341]:
A_1 A_2 B_1 B_2 C_1 C_2
0 1 0 0 0 0 0
1 1 0 0 1 1 1
2 0 1 1 1 1 0
Have a master list of prefixes:
In [1374]: master_list = ['A','B','C']
Create an empty dictionary to hold multiple subsets of dataframe:
In [1377]: dct = {}
Loop through the master list and store the column names in the above dict:
In [1378]: for i in master_list:
...: dct['{}_list'.format(i)] = [e for e in df.columns if e.startswith('{}'.format(i))]
Now, the dct has below keys with values:
A_list : ['A_1', 'A_2']
B_list : ['B_1', 'B_2']
C_list : ['C_1', 'C_2']
Then, subset your dataframes like below:
In [1381]: for k in dct:
...: dct[k] = df[dct[k]]
Now, the dictionary has actual rows of dataframe against every key:
In [1384]: for k in dct:
...: print dct[k]
In [1347]: df_A
Out[1347]:
A_1 A_2
0 1 0
1 1 0
2 0 1
In [1350]: df_B
Out[1350]:
B_1 B_2
0 0 0
1 0 1
2 1 1
In [1355]: df_C
Out[1355]:
C_1 C_2
0 0 0
1 1 1
2 1 0
First filter out not matched columns with startswith with boolean indexing and loc, because filter columns:
print (df)
A_1 A_2 B_1 B_2 C_1 D_2
0 1 0 0 0 0 0
1 1 0 0 1 1 1
2 0 1 1 1 1 0
prefixes = ["A", "B", "C"]
df = df.loc[:, df.columns.str.startswith(tuple(prefixes))]
print (df)
A_1 A_2 B_1 B_2 C_1
0 1 0 0 0 0
1 1 0 0 1 1
2 0 1 1 1 1
Then create Multiindex by split and then dictionary with groupby for dictioanry of DataFrames:
df.columns = df.columns.str.split('_', expand=True)
print (df)
A B C
1 2 1 2 1
0 1 0 0 0 0
1 1 0 0 1 1
2 0 1 1 1 1
d = {k: v[k] for k, v in df.groupby(level=0, axis=1)}
print (d['A'])
1 2
0 1 0
1 1 0
2 0 1
Or use lambda function with split:
d = {k: v for k, v in df.groupby(lambda x: x.split('_')[0], axis=1)}
print (d['A'])
A_1 A_2
0 1 0
1 1 0
2 0 1
Related
I have a dataframe which contains many pre-defined column names. One column of this dataframe contains the name of these columns.
I want to write the value 1 where the string name is equal to the column name.
For example, I have this current situation:
df = pd.DataFrame(0,index=[0,1,2,3],columns = ["string","a","b","c","d"])
df["string"] = ["b", "b", "c", "a"]
string a b c d
------------------------------
b 0 0 0 0
b 0 0 0 0
c 0 0 0 0
a 0 0 0 0
And this is what I would like the desired result to be like:
string a b c d
------------------------------
b 0 1 0 0
b 0 1 0 0
c 0 0 1 0
a 1 0 0 0
You can use get_dummies on df['string'] and update the DataFrame in place:
df.update(pd.get_dummies(df['string']))
updated df:
string a b c d
0 b 0 1 0 0
1 b 0 1 0 0
2 c 0 0 1 0
3 a 1 0 0 0
you can also use this
df.loc[ df[“column_name”] == “some_value”, “column_name”] = “value”
In your case
df.loc[ df["string"] == "b", "b"] = 1
I have this DataFrame with both categorical and non-categorical data and I would like to dummy encode it but not all dummy values that I know are possible are present in the data.
For example let's use the following DataFrame:
>>> df = pd.DataFrame({"a": [1,2,3], "b": ["x", "y", "x"], "c": ["h", "h", "i"]})
>>> df
a b c
0 1 x h
1 2 y h
2 3 x i
Column a has a non-categorical values but both column b and c are categorical.
Now let's say column b can contain the categories x, y and z and column c the categories h, i, j and k
>>> dummy_map = {"b": ["x", "y", "z"], "c": ["h", "i", "j", "k"]}
I want to encode it so that the resulting dataframe is as follows:
>>> df_encoded
a b_x b_y b_z c_h c_i c_j c_k
0 1 1 0 0 1 0 0 0
1 2 0 1 0 1 0 0 0
2 3 1 0 0 0 1 0 0
My current solution is as follows:
df_encoded = pd.get_dummies(df)
for k, v in dummy_map.items():
for cat in v:
name = k + "_" + cat
if name not in result:
df_encoded[name] = 0
But it seems to me a bit inefficient and inelegant.
So is there a better solution for this?
Use Index.union with vae values generated by list comprehension and f-strings and DataFrame.reindex:
c = [f'{k}_{x}' for k, v in dummy_map.items() for x in v]
print (c)
['b_x', 'b_y', 'b_z', 'c_h', 'c_i', 'c_j', 'c_k']
df_encoded = pd.get_dummies(df)
vals = df_encoded.columns.union(c, sort=False)
df_encoded = df_encoded.reindex(vals, axis=1, fill_value=0)
print (df_encoded)
a b_x b_y c_h c_i b_z c_j c_k
0 1 1 0 1 0 0 0 0
1 2 0 1 1 0 0 0 0
2 3 1 0 0 1 0 0 0
If values should be sorted in union:
df_encoded = pd.get_dummies(df)
vals = df_encoded.columns.union(c)
df_encoded = df_encoded.reindex(vals, axis=1, fill_value=0)
print (df_encoded)
a b_x b_y b_z c_h c_i c_j c_k
0 1 1 0 0 1 0 0 0
1 2 0 1 0 1 0 0 0
2 3 1 0 0 0 1 0 0
I want to create a dataframe from a dictionary which is of the format
Dictionary_ = {'Key1': ['a', 'b', 'c', 'd'],'Key2': ['d', 'f'],'Key3': ['a', 'c', 'm', 'n']}
I am using
df = pd.DataFrame.from_dict(Dictionary_, orient ='index')
But it creates its own columns till max length of values and put values of dictionary as values in a dataframe.
I want a df with keys as rows and values as columns like
a b c d e f m n
Key 1 1 1 1 1 0 0 0 0
Key 2 0 0 0 1 0 1 0 0
Key 3 1 0 1 0 0 0 1 1
I can do it by appending all values of dict and create an empty dataframe with dict keys as rows and values as columns and then iterating over each row to fetch values from dict and put 1 where it matches with column, but this will be too slow as my data has 200 000 rows and .loc is slow. I feel i can use pandas dummies somehow but don't know how to apply it here.
I feel there will be a smarter way to do this.
If performance is important, use MultiLabelBinarizer and pass keys and values:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(Dictionary_.values()),
columns=mlb.classes_,
index=Dictionary_.keys()))
print (df)
a b c d f m n
Key1 1 1 1 1 0 0 0
Key2 0 0 0 1 1 0 0
Key3 1 0 1 0 0 1 1
Alternative, but slowier is create Series, then str.join for strings and last call str.get_dummies:
df = pd.Series(Dictionary_).str.join('|').str.get_dummies()
print (df)
a b c d f m n
Key1 1 1 1 1 0 0 0
Key2 0 0 0 1 1 0 0
Key3 1 0 1 0 0 1 1
Alternative with input DataFrame - use pandas.get_dummies, but then is necessary aggregate max per columns:
df1 = pd.DataFrame.from_dict(Dictionary_, orient ='index')
df = pd.get_dummies(df1, prefix='', prefix_sep='').max(axis=1, level=0)
print (df)
a d b c f m n
Key1 1 1 1 1 0 0 0
Key2 0 1 0 0 1 0 0
Key3 1 0 0 1 0 1 1
Use get_dummies:
>>> pd.get_dummies(df).rename(columns=lambda x: x[2:]).max(axis=1, level=0)
a d b c f m n
Key1 1 1 1 1 0 0 0
Key2 0 1 0 0 1 0 0
Key3 1 0 0 1 0 1 1
>>>
I am trying to do a multiple column select then replace in pandas
df:
a b c d e
0 1 1 0 none
0 0 0 1 none
1 0 0 0 none
0 0 0 0 none
select where any or all of a, b, c, d are non zero
i, j = np.where(df)
s=pd.Series(dict(zip(zip(i, j),
df.columns[j]))).reset_index(-1, drop=True)
s:
0 b
0 c
1 d
2 a
Now I want to replace the values in column e by the series:
df['e'] = s.values
so that e looks like:
e:
b, c
d
a
none
But the problem is that the lengths of the series are different to the number of rows in the dataframe.
Any idea on how I can do this?
Use DataFrame.dot for product with columns names, add rstrip, last add numpy.where for replace empty strings to None:
e = df.dot(df.columns + ', ').str.rstrip(', ')
df['e'] = np.where(e.astype(bool), e, None)
print (df)
a b c d e
0 0 1 1 0 b, c
1 0 0 0 1 d
2 1 0 0 0 a
3 0 0 0 0 None
You can locate the 1's and use their locations as boolean indexes into the dataframe columns:
df['e'] = (df==1).apply(lambda x: df.columns[x], axis=1)\
.str.join(",").replace('','none')
# a b c d e
#0 0 1 1 0 b,c
#1 0 0 0 1 d
#2 1 0 0 0 a
#3 0 0 0 0 none
I have a DataFrame like this
id val1 val2
0 A B
1 B B
2 A A
3 A A
And I would like swap values such as:
id val1 val2
0 B A
1 A A
2 B B
3 B B
I need to consider that the df could have other columns that I would like to keep unchanged.
You can use pd.DataFrame.applymap with a dictionary:
d = {'B': 'A', 'A': 'B'}
df = df.applymap(d.get).fillna(df)
print(df)
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
For performance, in particular memory usage, you may wish to use categorical data:
for col in df.columns[1:]:
df[col] = df[col].astype('category')
df[col] = df[col].cat.rename_categories(d)
Try stacking, mapping, and then unstacking:
df[['val1', 'val2']] = (
df[['val1', 'val2']].stack().map({'B': 'A', 'A': 'B'}).unstack())
df
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
For a (much) faster solution, use a nested list comprehension.
mapping = {'B': 'A', 'A': 'B'}
df[['val1', 'val2']] = [
[mapping.get(x, x) for x in row] for row in df[['val1', 'val2']].values]
df
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
You can swap two values efficiently using numpy.where. However, if there are more than two values, this method stops working.
a = df[['val1', 'val2']].values
df[['val1', 'val2']] = np.where(a=='A', 'B', 'A')
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
To adapt this keep other values the same, you can use np.select:
c1 = a=='A'
c2 = a=='B'
np.select([c1, c2], ['B', 'A'], a)
Use factorize and roll the corresponding values
def swaparoo(col):
i, r = col.factorize()
return pd.Series(r[(i + 1) % len(r)], col.index)
df[['id']].join(df[['val1', 'val2']].apply(swaparoo))
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
Alternative gymnastics using the same function. This incorporates the whole dataframe into the factorization.
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index()
Examples
df = pd.DataFrame(dict(id=range(4), val1=[*'ABAA'], val2=[*'BBAA']))
print(
df,
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index(),
sep='\n\n'
)
id val1 val2
0 0 A B
1 1 B B
2 2 A A
3 3 A A
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
df = pd.DataFrame(dict(id=range(4), val1=[*'AAAA'], val2=[*'BBBB']))
print(
df,
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index(),
sep='\n\n'
)
id val1 val2
0 0 A B
1 1 A B
2 2 A B
3 3 A B
id val1 val2
0 0 B A
1 1 B A
2 2 B A
3 3 B A
df = pd.DataFrame(dict(id=range(4), val1=[*'AAAA'], val2=[*'BBBB'], val3=[*'CCCC']))
print(
df,
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index(),
sep='\n\n'
)
id val1 val2 val3
0 0 A B C
1 1 A B C
2 2 A B C
3 3 A B C
id val1 val2 val3
0 0 B C A
1 1 B C A
2 2 B C A
3 3 B C A
df = pd.DataFrame(dict(id=range(4), val1=[*'ABCD'], val2=[*'BCDA'], val3=[*'CDAB']))
print(
df,
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index(),
sep='\n\n'
)
id val1 val2 val3
0 0 A B C
1 1 B C D
2 2 C D A
3 3 D A B
id val1 val2 val3
0 0 B C D
1 1 C D A
2 2 D A B
3 3 A B C
Using replace : why we need a C here , check this
df[['val1','val2']].replace({'A':'C','B':'A','C':'B'})
Out[263]:
val1 val2
0 B A
1 A A
2 B B
3 B B