Pandas groupby and replace duplicates with empty string - python

I have a dataframe like the following:
import pandas as pd
d = {'one':[1,1,1,1,2, 2, 2, 2],
'two':['a','a','a','b', 'a','a','b','b'],
'letter':[' a','b','c','a', 'a', 'b', 'a', 'b']}
df = pd.DataFrame(d)
> one two letter
0 1 a a
1 1 a b
2 1 a c
3 1 b a
4 2 a a
5 2 a b
6 2 b a
7 2 b b
And I am trying to convert it to a dataframe like the following, where empty cells are filled with empty string '':
one two letter
1 a a
b
c
b a
2 a a
b
b a
b
When I perform groupby with all columns I get a series object that is basically exactly what I am looking for, but not a dataframe:
df.groupby(df.columns.tolist()).size()
1 a a 1
b 1
c 1
b a 1
2 a a 1
b 1
b a 1
b 1
How can I get the desired dataframe?

You can mask your columns where the value is not the same as the value below, then use where to change it to a blank string:
df[['one','two']] = df[['one','two']].where(df[['one', 'two']].apply(lambda x: x != x.shift()), '')
>>> df
one two letter
0 1 a a
1 b
2 c
3 b a
4 2 a a
5 b
6 b a
7 b
some explanation:
Your mask looks like this:
>>> df[['one', 'two']].apply(lambda x: x != x.shift())
one two
0 True True
1 False False
2 False False
3 False True
4 True True
5 False False
6 False True
7 False False
All that where is doing is finding the values where that is true, and replacing the rest with ''

The solution to the original problem is to find the dublicated cells in each of the first two columns and set them to empty:
df.loc[df.duplicated(subset=['one', 'two']), 'two'] = ''
df.loc[df.duplicated(subset=['one']), 'one'] = ''
However, the purpose of this transformation is unclear. Perhaps you are trying to solve a wrong problem.

Related

how to swap two columns and flip a third in panda data frame?

I'm conducting an experiment(using python 2.7, panda 0.23.4) where I have three levels of a stimulus {a,b,c} and present all different combinations to participants, and they have to choose which one was rougher? (example: Stimulus 1 = a , Stimulus 2=b, participant choose 1 indicating stimulus 1 was rougher)
After the experiment, I have a data frame with three columns like this:
import pandas as pd
d = {'Stim1': ['a', 'b', 'a', 'c', 'b', 'c'],
'Stim2': ['b', 'a', 'c', 'a', 'c', 'b'],
'Answer': [1, 2, 2, 1, 2, 1]}
df = pd.DataFrame(d)
Stim1 Stim2 Answer
0 a b 1
1 b a 2
2 a c 2
3 c a 1
4 b c 2
5 c b 1
For my analysis, the order of which stimulus came first doesn't matter. Stim1= a, Stim2= b is the same as Stim1= b, Stim2= a. I'm trying to figure out how can I swap Stim1 and Stim2 and flip their Answer to be like this:
Stim1 Stim2 Answer
0 a b 1
1 a b 1
2 a c 2
3 a c 2
4 b c 2
5 b c 2
I read that np.where can be used, but it would do one thing at a time, where I want to do two (swap and flip).
Is there some way to use another function to do swap and flip at the same time?
Can you try if this works for you?
import pandas as pd
import numpy as np
df = pd.DataFrame(d)
# keep a copy of the original Stim1 column
s = df['Stim1'].copy()
# sort the values
df[['Stim1', 'Stim2']] = np.sort(df[['Stim1', 'Stim2']].values)
# exchange the Answer if the order has changed
df['Answer'] = df['Answer'].where(df['Stim1'] == s, df['Answer'].replace({1:2,2:1}))
output:
Stim1 Stim2 Answer
0 a b 1
1 a b 1
2 a c 2
3 a c 2
4 b c 2
5 b c 2
You can start by building a boolean series that indicates which rows should be swapped or not:
>>> swap = df['Stim1'] > df['Stim2']
>>> swap
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
Then build the fully swapped dataframe as follows:
>>> swapped_df = pd.concat([
... df['Stim1'].rename('Stim2'),
... df['Stim2'].rename('Stim1'),
... 3 - df['Answer'],
... ], axis='columns')
>>> swapped_df
Stim2 Stim1 Answer
0 a b 2
1 b a 1
2 a c 1
3 c a 2
4 b c 1
5 c b 2
Finally, use .mask() to select initial rows or swapped rows:
>>> df.mask(swap, swapped_df)
Stim1 Stim2 Answer
0 a b 1
1 a b 1
2 a c 2
3 a c 2
4 b c 2
5 b c 2
NB .mask is roughly the same as .where, but it replaces rows where the series is True instead of keeping the rows that are True. This is exactly the same:
>>> swapped_df.where(swap, df)
Stim2 Stim1 Answer
0 b a 1
1 b a 1
2 c a 2
3 c a 2
4 c b 2
5 c b 2

How do you separate a column with words not separated at all. The data loaded from pd.read_table?

So I uploaded a data from pd.read_table:
df = pd.read_table('Test_Data.txt', delim_whitespace=True, names=('A', 'B'))
and the data is:
A B
0 AAABBABAABBAAABBBBAABBBABAAABAAAAABBBABBBAAABB... True
1 AABAABABBBABAAAAABAAABBAABAABBABABBAAABABBBBAB... True
2 BAAABBBBABABABBBABBAAABAAAAAAABBBBAABABABBBAAB... True
3 BAABBABBABBAAAABABBBAAAAAAAABAAABBAAAABBAABBAA... True
4 ABBABBBABBAABAABABBAAABAAAAABABABAABBAABBBAABA... True
Column A is 100 alphabets. I want to split each in separate columns. I want to have 100 columns of these alphabets and column B as it is. How must I do that?
Thank you!
# for example
df = pd.DataFrame({"A": ["ABB"]*5, "B": [True]*5})
print(df)
A B
0 ABB True
1 ABB True
2 ABB True
3 ABB True
4 ABB True
# split string
df["A"] = df["A"].apply(list)
print(df)
A B
0 [A, B, B] True
1 [A, B, B] True
2 [A, B, B] True
3 [A, B, B] True
4 [A, B, B] True
# new columns' names, here is 3, you could set 100
col_names = list(range(3))
df = pd.concat([df['A'].apply(pd.Series, index=col_names), df["B"]], axis=1)
print(df)
0 1 2 B
0 A B B True
1 A B B True
2 A B B True
3 A B B True
4 A B B True
You can use fixed width text to column converting option.
Data > Text to Column >
Excel will let you put pointer where you want the separation to applied.

How to delete a rows pandas df

I am trying to remove a row in a pandas df plus the following row. For the df below I want to remove the row when the value in Code is equal to X. But I also want to remove the subsequent row as well.
import pandas as pd
d = ({
'Code' : ['A','A','B','C','X','A','B','A'],
'Int' : [0,1,1,2,3,3,4,5],
})
df = pd.DataFrame(d)
If I use this code it removes the desired row. But I can't use the same for value A as there are other rows that contain A, which are required.
df = df[df.Code != 'X']
So my intended output is:
Code Int
0 A 0
1 A 1
2 B 1
3 C 2
4 B 4
5 A 5
I need something like df = df[df.Code != 'X'] +1
Using shift
df.loc[(df.Code!='X')&(df.Code.shift()!='X'),]
Out[99]:
Code Int
0 A 0
1 A 1
2 B 1
3 C 2
6 B 4
7 A 5
You need to find the index of the element you want to delete, and then you can simply delete at that index twice:
>>> i = df[df.Code == 'X'].index
>>> df.drop(df.index[[i]], inplace=True)
>>> df.drop(df.index[[i]], inplace=True, errors='ignore')
>>> df
Code Int
0 A 0
1 A 1
2 B 1
3 C 2
6 B 4
7 A 5

How can I convert ranked table with pandas?

Let me simplify my problem for easy explanation.
I have a pandas DataFrame table with the below format:
a b c
0 1 3 2
1 3 1 2
2 3 2 1
The numbers in each row present ranks of columns.
For example, the order of the first row is {a, c, b}.
How can I convert the above to the below ?
1 2 3
0 a c b
1 c a b
2 c b a
I googled all day long. But I couldn't find any solutions until now.
Looks like you are just mapping one value to another and renaming the columns, e.g.:
>>> df = pd.DataFrame({'a':[1,3,3], 'b':[3,1,2], 'c':[2,2,1]})
>>> df = df.applymap(lambda x: df.columns[x-1])
>>> df.columns = [1,2,3]
>>> df
1 2 3
0 a c b
1 c a b
2 c b a

pandas merge columns to a single time series

I have a data frame with 3 boolean columns:
A B C
0 True False False
1 False True False
2 True Nan False
3 False False True
...
Only one column is true at each time, but there can be Nan.
I would like to get a list of column names where the name is chosen based on the boolean. So for the example above:
['A', 'B', 'A', 'C']
it's a simple matrix operation, not sure how to map it to pandas...
You can use the mul operator between the dataframe and the dataframe columns. That results in True cells containing the column name and False cells empty. Eventually you can just sum the row data:
df.mul(df.columns).sum(axis=1)
Out[44]:
0 A
1 B
2 A
3 C
You can index columns names, i.e. df.columns, with proper indexes:
>>> import numpy as np
>>> df.columns[(df * np.arange(df.values.shape[1])).sum(axis=1)]
Index([u'A', u'B', u'A', u'C'], dtype=object)
Explanation.
Expression
>>> df * np.arange(df.values.shape[1])
A B C
0 0 0 0
1 0 1 0
2 0 0 0
3 0 0 2
calculates for each column a proper index, then matrix is summed row-wize with
>>> (df * np.arange(df.values.shape[1])).sum(axis=1)
0 0
1 1
2 0
3 2
dtype: int32

Categories

Resources