Only allow one to one mapping between two columns in pandas dataframe - python

I have a two column dataframe df, each row are distinct, one element in one column can map to one or more than one elements in another column. I want to filter OUT those elements. So in the final dataframe, one element in one column only map to a unique element in another column.
What I am doing is to groupby one column and count the duplicates, then remove rows with counts more than 1. and do it again for another column. I am wondering if there is a better, simpler way.
Thanks
edit1: I just realize my solution is INCORRECT, removing multi-mapping elements in column A reduces the number of mapping in column B, consider the following example:
A B
1 4
1 3
2 4
1 maps to 3,4 , so the first two rows should be removed, and 4 maps to 1,2. The final table should be empty. However, my solution will keep the last row.
Can anyone provide me a fast and simple solution ? thanks

Well, You could do something like the following:
>>> df
A B
0 1 4
1 1 3
2 2 4
3 3 5
You only want to keep a row if no other row has the value of 'A' and no other row as that value of 'B'. Only row three meets those conditions in this example:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Aone.merge(Bone,on=['A','B'],how='inner')
A B
0 3 5
Explanation:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Aone
A B
2 2 4
3 3 5
The above grabs the rows that may be allowed based on looking at column 'A' alone.
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Bone
A B
1 1 3
3 3 5
The above grabs the rows that may be allowed based on looking at column 'B' alone. And then merging the intersection leaves you with rows that only meet both conditions:
>>> Aone.merge(Bone,on=['A','B'],how='inner')
Note, you could also do a similar thing using groupby/transform. But transform tends to be slowish so I didn't do it as an alternative.

Related

How to add string values of columns with a specific condition in a new column

So I have a dataframe in which there are a couple of columns and a lot of rows.
Now I want to create a new column (C) which adds values of another column (A) as a string together if a third column (B) is identical.
So each 'group' (that is identical in B) should have a different string than the other groups in that column in the end.
A
B
New Column C
First
1
First_Third
Second
22
Second_Fourth
Third
1
First_Third
Fourth
22
Second_Fourth
Something like this pseudo code:
for x in df[B]:
if (x "is identical to" x "of another row"):
df[C] = df[C].cat(df[A])
How do I code an algorithm that can do this?
You can use:
df['C'] = df.groupby('B')['A'].transform('_'.join)
Or, if you want to keep only unique values:
df['C'] = df.groupby('B')['A'].transform(lambda x: '_'.join(x.unique()))
output:
A B C
0 First 1 First_Third
1 Second 22 Second_Fourth
2 Third 1 First_Third
3 Fourth 22 Second_Fourth
Try this:
df['C'] = df.groupby('B')['A'].transform(lambda x: '_'.join(x))

Transpose all rows in one column of dataframe to multiple columns based on certain conditions

I would like to convert one column of data to multiple columns in dataframe based on certain values/conditions.
Please find the code to generate the input dataframe
df1 = pd.DataFrame({'VARIABLE':['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
The data looks like as shown below
Please note that I may not know the column names in advance. But it usually follows this format. What I have shown above is a sample data and real data might have around 600-700 columns and data arranged in this fashion
What I would like to do is convert values which start with non-digits(characters) as new columns in dataframe. It can be a new dataframe.
I attempted to write a for loop but failed to due to the below error. Can you please help me achieve this outcome.
for i in range(3,len(df1)):
#str(df1['VARIABLE'][i].contains('^\d'))
if (df1['VARIABLE'][i].astype(str).contains('^\d') == True):
Through the above loop, I was trying to check whether first char is a digit, if yes, then retain it as a value (ex: 1,2,3 etc) and if it's a character (ex:gender, ethnicity etc), then create a new column. But guess this is an incorrect and lengthy approach
For example, in the above example, the columns would be studyid,age_interview,Gender,Ethnicity.
The final output would look like this
Can you please let me know if there is an elegant approach to do this?
You can use groupby to do something like:
m=~df1['VARIABLE'].str[0].str.isdigit().fillna(True)
new_df=(pd.DataFrame(df1.groupby(m.cumsum()).VARIABLE.apply(list).
values.tolist()).set_index(0).T)
print(new_df.rename_axis(None,axis=1))
studyid age_interview Gender Ethnicity
1 1 65 1.Male 1.Chinese
2 None None 2.Female 2.Indian
3 None None None 3.Malay
Explanation: m is a helper series which helps seperating the groups:
print(m.cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
Then we group this helper series and apply list:
df1.groupby(m.cumsum()).VARIABLE.apply(list)
VARIABLE
1 [studyid, 1]
2 [age_interview, 65]
3 [Gender, 1.Male, 2.Female]
4 [Ethnicity, 1.Chinese, 2.Indian, 3.Malay]
Name: VARIABLE, dtype: object
At this point we have each group as a list with the column name as the first entry.
So we create a dataframe with this and set the first column as index and transpose to get our desired output.
Use itertools.groupby and then construct pd.DataFrame:
import pandas as pd
import itertools
l = ['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']
l = list(map(str, l))
grouped = [list(g) for k, g in itertools.groupby(l, key=lambda x:x[0].isnumeric())]
d = {k[0]: v for k,v in zip(grouped[::2],grouped[1::2])}
pd.DataFrame.from_dict(d, orient='index').T
Output:
Gender studyid age_interview Ethnicity
0 1.Male 1 65 1.Chinese
1 2.Female None None 2.Indian
2 None None None 3.Malay

Pandas DataFrame : selection of multiple elements in several columns

I have this Python Pandas DataFrame DF :
DICT = { 'letter': ['A','B','C','A','B','C','A','B','C'],
'number': [1,1,1,2,2,2,3,3,3],
'word' : ['one','two','three','three','two','one','two','one','three']}
DF = pd.DataFrame(DICT)
Which looks like :
letter number word
0 A 1 one
1 B 1 two
2 C 1 three
3 A 2 three
4 B 2 two
5 C 2 one
6 A 3 two
7 B 3 one
8 C 3 three
And I want to extract the lines
letter number word
A 1 one
B 2 two
C 3 three
First I tired :
DF[(DF['letter'].isin(("A","B","C"))) &
DF['number'].isin((1,2,3)) &
DF['word'].isin(('one','two','three'))]
Of course it didn't work, and everything has been selected
Then I tested :
Bool = DF[['letter','number','word']].isin(("A",1,"one"))
DF[np.all(Bool,axis=1)]
Good, it works ! but only for one line ...
If we take the next step and give an iterable to .isin() :
Bool = DF[['letter','number','word']].isin((("A",1,"one"),
("B",2,"two"),
("C",3,"three")))
Then it fails, the Boolean array is full of False ...
What I'm doing wrong ? Is there a more elegant way to do this selection based on several columns ?
(Anyway, I want to avoid a for loop, because the real DataFrames I'm using are really big, so I'm looking for the fastest optimal way to do the job)
Idea is create new DataFrame with all triple values and then merge with original DataFrame:
L = [("A",1,"one"),
("B",2,"two"),
("C",3,"three")]
df1 = pd.DataFrame(L, columns=['letter','number','word'])
print (df1)
letter number word
0 A 1 one
1 B 2 two
2 C 3 three
df = DF.merge(df1)
print (df)
letter number word
0 A 1 one
1 B 2 two
2 C 3 three
Another idea is create list of tuples, convert to Series and then compare by isin:
s = pd.Series(list(map(tuple, DF[['letter','number','word']].values.tolist())),index=DF.index)
df1 = DF[s.isin(L)]
print (df1)
letter number word
0 A 1 one
4 B 2 two
8 C 3 three

How to keep only the top n% rows of each group of a pandas dataframe?

I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?
You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.
Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1

pandas dataframe groupby and get arbitrary member of each group

This question is similar to this other question.
I have a pandas dataframe. I want to split it into groups, and select an arbitrary member of each group, defined elsewhere.
Example: I have a dataframe that can be divided in 6 groups of 4 observations each. I want to extract the observations according to:
selected = [0,3,2,3,1,3]
This is very similar to
df.groupy('groupvar').nth(n)
But, crucially, n varies for each group according to the selected list.
Thanks!
Typically everything that you do within groupby should be group independent. So, within any groupby.apply(), you will only get the group itself, not the context. An alternative is to compute the index value for the whole sample (following, index) out of the indices for the groups (here, selected). Note that the dataset is sorted by groups, which you need to do if you want to apply the following.
I use test, out of which I want to select selected:
In[231]: test
Out[231]:
score
name
0 A -0.208392
1 A -0.103659
2 A 1.645287
0 B 0.119709
1 B -0.047639
2 B -0.479155
0 C -0.415372
1 C -1.390416
2 C -0.384158
3 C -1.328278
selected = [0, 2, 1]
c = test.groupby(level=1).count()
In[242]: index = c.shift(1).cumsum().add(array([selected]).T, fill_value=0)
In[243]: index
Out[243]:
score
name
A 0
B 5
C 4
In[255]: test.iloc[index.values[:,0]]
Out[255]:
score
name
0 A -0.208392
2 B -0.479155
1 C -1.390416

Categories

Resources