I have a DataFrame like this
id val1 val2
0 A B
1 B B
2 A A
3 A A
And I would like swap values such as:
id val1 val2
0 B A
1 A A
2 B B
3 B B
I need to consider that the df could have other columns that I would like to keep unchanged.
You can use pd.DataFrame.applymap with a dictionary:
d = {'B': 'A', 'A': 'B'}
df = df.applymap(d.get).fillna(df)
print(df)
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
For performance, in particular memory usage, you may wish to use categorical data:
for col in df.columns[1:]:
df[col] = df[col].astype('category')
df[col] = df[col].cat.rename_categories(d)
Try stacking, mapping, and then unstacking:
df[['val1', 'val2']] = (
df[['val1', 'val2']].stack().map({'B': 'A', 'A': 'B'}).unstack())
df
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
For a (much) faster solution, use a nested list comprehension.
mapping = {'B': 'A', 'A': 'B'}
df[['val1', 'val2']] = [
[mapping.get(x, x) for x in row] for row in df[['val1', 'val2']].values]
df
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
You can swap two values efficiently using numpy.where. However, if there are more than two values, this method stops working.
a = df[['val1', 'val2']].values
df[['val1', 'val2']] = np.where(a=='A', 'B', 'A')
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
To adapt this keep other values the same, you can use np.select:
c1 = a=='A'
c2 = a=='B'
np.select([c1, c2], ['B', 'A'], a)
Use factorize and roll the corresponding values
def swaparoo(col):
i, r = col.factorize()
return pd.Series(r[(i + 1) % len(r)], col.index)
df[['id']].join(df[['val1', 'val2']].apply(swaparoo))
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
Alternative gymnastics using the same function. This incorporates the whole dataframe into the factorization.
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index()
Examples
df = pd.DataFrame(dict(id=range(4), val1=[*'ABAA'], val2=[*'BBAA']))
print(
df,
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index(),
sep='\n\n'
)
id val1 val2
0 0 A B
1 1 B B
2 2 A A
3 3 A A
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
df = pd.DataFrame(dict(id=range(4), val1=[*'AAAA'], val2=[*'BBBB']))
print(
df,
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index(),
sep='\n\n'
)
id val1 val2
0 0 A B
1 1 A B
2 2 A B
3 3 A B
id val1 val2
0 0 B A
1 1 B A
2 2 B A
3 3 B A
df = pd.DataFrame(dict(id=range(4), val1=[*'AAAA'], val2=[*'BBBB'], val3=[*'CCCC']))
print(
df,
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index(),
sep='\n\n'
)
id val1 val2 val3
0 0 A B C
1 1 A B C
2 2 A B C
3 3 A B C
id val1 val2 val3
0 0 B C A
1 1 B C A
2 2 B C A
3 3 B C A
df = pd.DataFrame(dict(id=range(4), val1=[*'ABCD'], val2=[*'BCDA'], val3=[*'CDAB']))
print(
df,
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index(),
sep='\n\n'
)
id val1 val2 val3
0 0 A B C
1 1 B C D
2 2 C D A
3 3 D A B
id val1 val2 val3
0 0 B C D
1 1 C D A
2 2 D A B
3 3 A B C
Using replace : why we need a C here , check this
df[['val1','val2']].replace({'A':'C','B':'A','C':'B'})
Out[263]:
val1 val2
0 B A
1 A A
2 B B
3 B B
Related
I have the following example data set
A
B
C
D
foo
0
1
1
bar
0
0
1
baz
1
1
0
How could extract the column names of each 1 occurrence in a row and put that into another column E so that I get the following table:
A
B
C
D
E
foo
0
1
1
C, D
bar
0
0
1
D
baz
1
1
0
B, C
Note that there can be more than two 1s per row.
You can use DataFrame.dot.
df['E'] = df[['B', 'C', 'D']].dot(df.columns[1:] + ', ').str.rstrip(', ')
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C
Inspired by jezrael's answer in this post.
Another way is that you can convert each row to boolean and use it as a selection mask to filter the column names.
cols = pd.Index(['B', 'C', 'D'])
df['E'] = df[cols].astype('bool').apply(lambda row: ", ".join(cols[row]), axis=1)
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C
It has been a long time that I dealt with pandas library. I searched for it but could not come up with an efficient way, which might be a function existed in the library.
Let's say I have the dataframe below:
df1 = pd.DataFrame({'V1':['A','A','B'],
'V2':['B','C','C'],
'Value':[4, 1, 5]})
df1
And I would like to extend this dataset and populate all the combinations of categories and put its corresponding value as exactly the same.
df2 = pd.DataFrame({'V1':['A','B','A', 'C', 'B', 'C'],
'V2':['B','A','C','A','C','B'],
'Value':[4, 4 , 1, 1, 5, 5]})
df2
In other words, in df1, A and B has Value of 4 and I also want to have a row of that B and A has Value of 4 in the second dataframe. It is very similar to melting. I also do not want to use a for loop. I am looking for a more efficient way.
Use:
df = pd.concat([df1, df1.rename(columns={'V2':'V1', 'V1':'V2'})]).sort_index().reset_index(drop=True)
Output:
V1 V2 Value
0 A B 4
1 B A 4
2 A C 1
3 C A 1
4 B C 5
5 C B 5
Or np.vstack:
>>> pd.DataFrame(np.vstack((df1.to_numpy(), df1.iloc[:, np.r_[1:-1:-1, -1]].to_numpy())), columns=df1.columns)
V1 V2 Value
0 A B 4
1 A C 1
2 B C 5
3 B A 4
4 C A 1
5 C B 5
>>>
For correct order:
>>> pd.DataFrame(np.vstack((df1.to_numpy(), df1.iloc[:, np.r_[1:-1:-1, -1]].to_numpy())), columns=df1.columns, index=[*df1.index, *df1.index]).sort_index()
V1 V2 Value
0 A B 4
0 B A 4
1 A C 1
1 C A 1
2 B C 5
2 C B 5
>>>
And index reset:
>>> pd.DataFrame(np.vstack((df1.to_numpy(), df1.iloc[:, np.r_[1:-1:-1, -1]].to_numpy())), columns=df1.columns, index=[*df1.index, *df1.index]).sort_index().reset_index(drop=True)
V1 V2 Value
0 A B 4
1 B A 4
2 A C 1
3 C A 1
4 B C 5
5 C B 5
>>>
You can use methods assign and append:
df1.append(df1.assign(V1=df1.V2, V2=df1.V1), ignore_index=True)
Output:
V1 V2 Value
0 A B 4
1 A C 1
2 B C 5
3 B A 4
4 C A 1
5 C B 5
I want to load columns with specific prefixes into separate DataFrames.
The columns I want have specific prefixes i.e.
A_1 A_2 B_1 B_2 C_1 C_2
1 0 0 0 0 0
1 0 0 1 1 1
0 1 1 1 1 0
I have a list of all the prefixes:
prefixes = ["A", "B", "C"]
I want to do something like this:
for prefix in prefixes:
f"df_{prefix}" = pd.read_csv("my_file.csv",
usecols=[f"{prefix}_1,
f"{prefix}_2,
f"{prefix}_3,])
So each DataFrame has the prefix in the name, but I'm not quite sure of the best way to do this or the syntax required.
You could try it with a different approach. Load the full csv once. Create three dfs out of it by dropping the columns don't mach your prefix.
x = pd.read_csv("my_file.csv")
notA = [c for c in x.columns if 'A' not in c]
notB = [c for c in x.columns if 'B' not in c]
notC = [c for c in x.columns if 'C' not in c]
a = x.drop(notA,1)
b = x.drop(notB,1)
c = x.drop(notC,1)
Considering you have a big dataframe like this:
In [1341]: df
Out[1341]:
A_1 A_2 B_1 B_2 C_1 C_2
0 1 0 0 0 0 0
1 1 0 0 1 1 1
2 0 1 1 1 1 0
Have a master list of prefixes:
In [1374]: master_list = ['A','B','C']
Create an empty dictionary to hold multiple subsets of dataframe:
In [1377]: dct = {}
Loop through the master list and store the column names in the above dict:
In [1378]: for i in master_list:
...: dct['{}_list'.format(i)] = [e for e in df.columns if e.startswith('{}'.format(i))]
Now, the dct has below keys with values:
A_list : ['A_1', 'A_2']
B_list : ['B_1', 'B_2']
C_list : ['C_1', 'C_2']
Then, subset your dataframes like below:
In [1381]: for k in dct:
...: dct[k] = df[dct[k]]
Now, the dictionary has actual rows of dataframe against every key:
In [1384]: for k in dct:
...: print dct[k]
In [1347]: df_A
Out[1347]:
A_1 A_2
0 1 0
1 1 0
2 0 1
In [1350]: df_B
Out[1350]:
B_1 B_2
0 0 0
1 0 1
2 1 1
In [1355]: df_C
Out[1355]:
C_1 C_2
0 0 0
1 1 1
2 1 0
First filter out not matched columns with startswith with boolean indexing and loc, because filter columns:
print (df)
A_1 A_2 B_1 B_2 C_1 D_2
0 1 0 0 0 0 0
1 1 0 0 1 1 1
2 0 1 1 1 1 0
prefixes = ["A", "B", "C"]
df = df.loc[:, df.columns.str.startswith(tuple(prefixes))]
print (df)
A_1 A_2 B_1 B_2 C_1
0 1 0 0 0 0
1 1 0 0 1 1
2 0 1 1 1 1
Then create Multiindex by split and then dictionary with groupby for dictioanry of DataFrames:
df.columns = df.columns.str.split('_', expand=True)
print (df)
A B C
1 2 1 2 1
0 1 0 0 0 0
1 1 0 0 1 1
2 0 1 1 1 1
d = {k: v[k] for k, v in df.groupby(level=0, axis=1)}
print (d['A'])
1 2
0 1 0
1 1 0
2 0 1
Or use lambda function with split:
d = {k: v for k, v in df.groupby(lambda x: x.split('_')[0], axis=1)}
print (d['A'])
A_1 A_2
0 1 0
1 1 0
2 0 1
I have a dataframe as follows:
data
0 a
1 a
2 a
3 a
4 a
5 b
6 b
7 b
8 b
9 b
I want to group the repeating values of a and b into a single row element as follows:
data
0 a
a
a
a
a
1 b
b
b
b
b
How do I go about doing this? I tried the following but it puts each repeating value in its own column
df.groupby('data')
Seems like a pivot problem, but since you missing the column(create by cumcount) and index(create by factorize) columns , it is hard to figure out
pd.crosstab(pd.factorize(df.data)[0],df.groupby('data').cumcount(),df.data,aggfunc='sum')
Out[358]:
col_0 0 1 2 3 4
row_0
0 a a a a a
1 b b b b b
Something like
index = ((df['data'] != df['data'].shift()).cumsum() - 1).rename(columns= {'data':''})
df = df.set_index(index)
data
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b
You can use pd.factorize followed by set_index:
df = df.assign(key=pd.factorize(df['data'], sort=False)[0]).set_index('key')
print(df)
data
key
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b
I have following data frame:
1 A a
1 A b
2 B c
1 A d
How do I append all the values of a row with same values to data frame:
1 A a,c,d
2 B c
You can use groupby and apply function join :
df.columns = ['a','b','c']
print (df)
a b c
0 1 A a
1 1 A b
2 2 B c
3 1 A d
print (df.groupby(['a', 'b'])['c'].apply(', '.join).reset_index())
a b c
0 1 A a, b, d
1 2 B c
Or if first column is index:
df.columns = ['a','b']
print (df)
a b
1 A a
1 A b
2 B c
1 A d
df1 = df.b.groupby([df.index, df.a]).apply(', '.join).reset_index(name='c')
df1.columns = ['a','b','c']
print (df1)
a b c
0 1 A a, b, d
1 2 B c