I have created a data frame as follows:
np.random.seed(0)
lvl0 = ['A','B']
lvl1 = ['x', 'y', 'z']
idx = pd.MultiIndex.from_product([lvl0, lvl1])
cols = ['c1', 'c2']
df = pd.DataFrame(index=idx, columns=cols)
df.loc[:] = np.random.randint(0,2, size=df.shape)
for v in lvl0:
df.loc[(v, 'mode'), :] = np.nan
df.sort_index(inplace=True)
Which gives
c1 c2
A mode NaN NaN
x 0 1
y 1 0
z 1 1
B mode NaN NaN
x 1 1
y 1 1
z 1 0
I then calculate the modes for group A and B for both columns as follows
for c in cols:
modes = df.groupby(level=0)[c].agg(pd.Series.mode)
df.loc[(lvl0, 'default'), c] = modes.values
which results in
c1 c2
A mode 1 1
x 0 1
y 1 0
z 1 1
B mode 1 1
x 1 1
y 1 1
z 1 0
I now want to remove all rows where the values are equal to the mode of the corresponding first level group (A-B), which in this case should result in
c1 c2
A mode 1 1
x 0 1
y 1 0
B mode 1 1
z 1 0
I could achieve the desired result by looping over lvl0, but I was wondering if there was a more elegant solution, perhaps using groupby.
Additionally, I wonder if there is a way to add the mode rows when the modes are calculated, as opposed to adding empty (NaN) rows beforehand. If I don't add the empty rows beforehand, the line df.loc[(lvl0, 'default'), c] = modes.values gives me a KeyError on 'default'.
Use:
np.random.seed(0)
lvl0 = ['A','B']
lvl1 = ['x', 'y', 'z']
idx = pd.MultiIndex.from_product([lvl0, lvl1])
cols = ['c1', 'c2']
df = pd.DataFrame(index=idx, columns=cols)
df.loc[:] = np.random.randint(0,2, size=df.shape)
df.sort_index(inplace=True)
#get first modes per groups
modes = df.groupby(level=0).agg(lambda x: x.mode().iat[0])
#compare modes per groups with removed second level and append modes df
df = (pd.concat([modes.assign(tmp='mode').set_index('tmp', append=True), \
df[df.droplevel(1).ne(modes).any(axis=1).to_numpy()]])
.sort_index())
print (df)
c1 c2
tmp
A mode 1 1
x 0 1
y 1 0
B mode 1 1
z 1 0
Related
Let's say I have the following df:
data = [{'c1':a, 'c2':x}, {'c1':b,'c2':y}, {'c1':c,'c2':z}]
df = pd.DataFrame(data)
Output:
c1 c2
0 a x
1 b y
2 c z
Now I want to use pd.get_dummies() to one hot encode the two categorical columns c1 and c2 and drop the first category of each col pd.get_dummies(df, columns = ['c1', 'c2'], drop_first=True). How can I decide which category to drop, without knowing the rows' order? Is there any command I missed?
EDIT:
So my goal would be to e.g., drop category b from c1 and z from c2
Output:
a c x y
0 1 0 1 0
1 0 0 0 1
2 0 1 0 0
One trick is replace values to NaNs - here is removed one value per rows:
#columns with values for avoid
d = {'c1':'b', 'c2':'z'}
d1 = {k:{v: np.nan} for k, v in d.items()}
df = pd.get_dummies(df.replace(d1), columns = ['c1', 'c2'], prefix='', prefix_sep='')
print (df)
a c x y
0 1 0 1 0
1 0 0 0 1
2 0 1 0 0
If need multiple values for remove per column use lists like:
d = {'c1':['b','c'], 'c2':['z']}
d1 = {k:{x: np.nan for x in v} for k, v in d.items()}
print (d1)
{'c1': {'b': nan, 'c': nan}, 'c2': {'z': nan}}
df = pd.get_dummies(df.replace(d1), columns = ['c1', 'c2'], prefix='', prefix_sep='')
print (df)
a x y
0 1 1 0
1 0 0 1
2 0 0 0
EDIT:
If values are unique per columns simplier is them removed in last step:
df = (pd.get_dummies(df, columns = ['c1', 'c2'], prefix='', prefix_sep='')
.drop(['b','z'], axis=1))
print (df)
a c x y
0 1 0 1 0
1 0 0 0 1
2 0 1 0 0
I'd highly recommend using sklearn instead! https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
You can view the categories accessing the <your_fitted_instance_name>.categories_ attribute after you've fitted the one hot encoder, and it also has a inverse_transform() function to reverse the one hot encoding!
As for column dropping.. the default is not to drop any. However, you can use OneHotEncoder(drop='first') in order to drop one.
Edit: Also note that sklearn offers Pipelines which can help you ensure consistent pre-processing throughout your project!
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
I've found how to remove column with zeros for all the rows using the command df.loc[:, (df != 0).any(axis=0)], and I need to do the same but given the row number.
For example, for the folowing df
In [75]: df = pd.DataFrame([[1,1,0,0], [1,0,1,0]], columns=['a','b','c','d'])
In [76]: df
Out[76]:
a b c d
0 1 1 0 0
1 1 0 1 0
Give me the columns with non-zeros for the row 0 and I would expect the result:
a b
0 1 1
And for the row 1 get:
a c
1 1 1
I tried a lot of combinations of commands but I couldn't find a solution.
UPDATE:
I have a 300x300 matrix, I need to better visualize its result.
Below a pseudo-code trying to show what I need
for i in range(len(df[rows])):
_df = df.iloc[i]
_df = _df.filter(remove_zeros_columns)
print('Row: ', i)
print(_df)
Result:
Row: 0
a b
0 1 1
Row: 1
a c f
1 1 5 10
Row: 2
e
2 20
Best Regards.
Kleyson Rios.
You can change data structure:
df = df.reset_index().melt('index', var_name='columns').query('value != 0')
print (df)
index columns value
0 0 a 1
1 1 a 1
2 0 b 1
5 1 c 1
If need new column by values joined by , compare values for not equal by DataFrame.ne and use matrix multiplication by DataFrame.dot:
df['new'] = df.ne(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
a b c d new
0 1 1 0 0 a, b
1 1 0 1 0 a, c
EDIT:
for i in df.index:
row = df.loc[[i]]
a = row.loc[:, (row != 0).any()]
print ('Row {}'.format(i))
print (a)
Or:
def f(x):
print ('Row {}'.format(x.name))
print (x[x!=0].to_frame().T)
df.apply(f, axis=1)
Row 0
a b
0 1 1
Row 1
a c
1 1 1
df = pd.DataFrame([[1, 1, 0, 0], [1, 0, 1, 0]], columns=['a', 'b', 'c', 'd'])
def get(row):
return list(df.columns[row.ne(0)])
df['non zero column'] = df.apply(lambda x: get(x), axis=1)
print(df)
also if you want single liner use this
df['non zero column'] = [list(df.columns[i]) for i in df.ne(0).values]
output
a b c d non zero column
0 1 1 0 0 [a, b]
1 1 0 1 0 [a, c]
I think this answers your question more strictly.
Just change the value of given_row as needed.
given_row = 1
mask_all_rows = df.apply(lambda x: x!=0, axis=0)
mask_row = mask_all_rows.loc[given_row]
cols_to_keep = mask_row.index[mask_row == True].tolist()
df_filtered = df[cols_to_keep]
# And if you only want to keep the given row
df_filtered = df_filtered[df_filtered.index == given_row]
So in Pandas I have the following dataframe
A B C D
0 X
1 Y
0 Y
1 Y
0 X
1 X
I want to move the value in A to either C or D depending on B. The output should be something like this;
A B C D
0 X 0
1 Y 1
0 Y 0
1 Y 1
0 X 0
1 X 1
I've tried using multiple where statements like
df['C'] = np.where(str(df.B).find('X'), df.A, '')
df['D'] = np.where(str(df.B).find('Y'), df.A, '')
But that results in;
A B C D
0 X 0 0
1 Y 1 1
0 Y 0 0
1 Y 1 1
0 X 0 0
1 X 1 1
So I guess it's checking if the value exists in the column at all, which makes sense. Do I need to iterate row by row?
Dont convert to str with find, because it return scalar and 0 is convert to False and another integers to Trues:
print (str(df.B).find('X'))
5
Simpliest is compare values for boolean Series:
print (df.B == 'X')
0 True
1 False
2 False
3 False
4 True
5 True
Name: B, dtype: bool
df['C'] = np.where(df.B == 'X', df.A, '')
df['D'] = np.where(df.B == 'Y', df.A, '')
Another solution with assign + where:
df = df.assign(C=df.A.where(df.B == 'X', ''),
D=df.A.where(df.B == 'Y', ''))
And if need check substrings use str.contains:
df['C'] = np.where(df.B.str.contains('X'), df.A, '')
df['D'] = np.where(df.B.str.contains('Y'), df.A, '')
Or:
df['C'] = df.A.where(df.B.str.contains('X'), '')
df['D'] = df.A.where(df.B.str.contains('Y'), '')
All return:
print (df)
A B C D
0 0 X 0
1 1 Y 1
2 0 Y 0
3 1 Y 1
4 0 X 0
5 1 X 1
Using slice assignment
n = len(df)
f, u = pd.factorize(df.B.values)
a = np.empty((n, 2), dtype=object)
a.fill('')
a[np.arange(n), f] = df.A.values
df.loc[:, ['C', 'D']] = a
df
A B C D
0 0 X 0
1 1 Y 1
2 0 Y 0
3 1 Y 1
4 0 X 0
5 1 X 1
Consider the following hdfstore and dataframes df and df2
import pandas as pd
store = pd.HDFStore('test.h5')
midx = pd.MultiIndex.from_product([range(2), list('XYZ')], names=list('AB'))
df = pd.DataFrame(dict(C=range(6)), midx)
df
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
midx2 = pd.MultiIndex.from_product([range(2), list('VWX')], names=list('AB'))
df2 = pd.DataFrame(dict(C=range(6)), midx2)
df2
C
A B
0 V 0
W 1
X 2
1 V 3
W 4
X 5
I want to first write df to the store.
store.append('df', df)
store.get('df')
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
At a later point in time I will have another dataframe that I want to update the store with. I want to overwrite the rows with the same index values as are in my new dataframe while keeping the old ones.
When I do
store.append('df', df2)
store.get('df')
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
0 V 0
W 1
X 2
1 V 3
W 4
X 5
This isn't at all what I want. Notice that (0, 'X') and (1, 'X') are repeated. I can manipulate the combined dataframe and overwrite, but I expect to be working with a lot data where this wouldn't be feasible.
How do I update the store to get?
C
A B
0 V 0
W 1
X 2
Y 1
Z 2
1 V 3
W 4
X 5
Y 4
Z 5
You'll see that For each level of 'A', 'Y' and 'Z' are the same, 'V' and 'W' are new, and 'X' is updated.
What is the correct way to do this?
Idea: remove matching rows (with matching index values) from the HDF first and then append df2 to HDFStore.
Problem: I couldn't find a way to use where="index in df2.index" for multi-index indexes.
Solution: first convert multiindexes to normal ones:
df.index = df.index.get_level_values(0).astype(str) + '_' + df.index.get_level_values(1).astype(str)
df2.index = df2.index.get_level_values(0).astype(str) + '_' + df2.index.get_level_values(1).astype(str)
this yields:
In [348]: df
Out[348]:
C
0_X 0
0_Y 1
0_Z 2
1_X 3
1_Y 4
1_Z 5
In [349]: df2
Out[349]:
C
0_V 0
0_W 1
0_X 2
1_V 3
1_W 4
1_X 5
make sure that you use format='t' and data_columns=True (this will index save index and index all columns in the HDF5 file, allowing us to use them in the where clause) when you create/append HDF5 files:
store = pd.HDFStore('d:/temp/test1.h5')
store.append('df', df, format='t', data_columns=True)
store.close()
now we can first remove those rows from the HDFStore with matching indexes:
store = pd.HDFStore('d:/temp/test1.h5')
In [345]: store.remove('df', where="index in df2.index")
Out[345]: 2
and append df2:
In [346]: store.append('df', df2, format='t', data_columns=True, append=True)
Result:
In [347]: store.get('df')
Out[347]:
C
0_Y 1
0_Z 2
1_Y 4
1_Z 5
0_V 0
0_W 1
0_X 2
1_V 3
1_W 4
1_X 5
in pandas i have two series of x rows and i want to add a column in which i get the rolling count of times that the value in col1 appears from the first row till the x-1 one.
The df is like this:
col1 col2
0 B A
1 B C
2 A B
3 A B
4 A C
5 B A
The desired output is
col1 col2 freq
0 B A 0
1 B C 1
2 A B 1
3 A B 2
4 A C 3 #A appears 3 times in the two columns from row 0 to 3
5 B A 4 #B appears 4 times in the two columns from row 0 to 4
Thanks in advance from a beginner,
G
Let use some dataframe reshaping, groupby and cumcount:
dfs = df.stack()
df['freq'] = dfs.groupby(dfs).cumcount().unstack()['col1']
print(df)
Output:
col1 col2 freq
0 B A 0
1 B C 1
2 A B 1
3 A B 2
4 A C 3
5 B A 4
This will solve irrespective of number of columns in the df
import pandas as pd
import numpy as np
def add(d1,d2):
# adding two dictionary
for i in d2.keys():
if i in d1.keys():
d1[i] = d1[i] +d2[i]
else:
d1[i] = d2[i]
return d1
if __name__ == '__main__':
counts = {}
df = pd.DataFrame({"a":[1, 2, 3, 1, 2], "b":[2, 1, 2, 3, 1]})
col = list(df)
for ind, it in df.iterrows():
unique,count = np.unique(it,return_counts=True)
unique_dict = dict(zip(unique, count))
counts = add(counts,unique_dict)
df.loc[ind, "freq"] = counts[it[col[0]]]
df["freq"] =df["freq"]-1
from collections import defaultdict
def fn():
d1, d2 = defaultdict(int), defaultdict(int)
x = yield
while True:
x = yield d1[x.col1] + d2[x.col1]
d1[x.col1] += 1
d2[x.col2] += 1
f = fn()
next(f)
df['freq'] = df[['col1', 'col2']].apply(lambda x: f.send(x), axis=1)
print(df)
Prints:
col1 col2 freq
0 B A 0
1 B C 1
2 A B 1
3 A B 2
4 A C 3
5 B A 4
EDIT (solution for arbitrary number of columns):
from collections import defaultdict
def fn(cols):
dd = [defaultdict(int) for _ in cols]
x = yield
while True:
x = yield sum(d[x[0]] for d in dd)
for i, d in enumerate(dd):
d[x[i]] += 1
cols = ['col1', 'col2']
f = fn(cols)
next(f)
df['freq'] = df[cols].apply(lambda x: f.send(x), axis=1)
print(df)