Pandas comparison - python

I'm trying to simplify pandas and python syntax when executing a basic Pandas operation.
I have 4 columns:
a_id
a_score
b_id
b_score
I create a new label called doc_type based on the following:
a >= b, doc_type: a
b > a, doc_type: b
Im struggling in how to calculate in Pandas where a exists but b doesn't, in this case then a needs to be the label. Right now it returns the else statement or b.
I needed to create 2 additional comparison which at scale may be efficient as I already compare the data before. Looking how to improve it.
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
print df
# Replace empty string with NaN
m_score = r['a_score'] >= r['b_score']
m_doc = (r['a_id'].isnull() & r['b_id'].isnull())
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)
# Select type based on higher score
r['doc_type'] = numpy.where(m_score, 'a',
numpy.where(m_doc, numpy.nan, 'b'))
# Additional lines looking for improvement:
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].notnull())] = 'b'
df['doc_type'].loc[(df['a_id'].notnull() & df['b_id'].isnull())] = 'a'
print df

Use numpy.where, assuming your logic is:
Both exist, the doc_type will be the one with higher score;
One missing, the doc_type will be the one not null;
Both missing, the doc_type will be null;
Added an extra edge case at the last line:
import numpy as np
df = df.replace('', np.nan)
df['doc_type'] = np.where(df.b_id.isnull() | (df.a_score >= df.b_score),
np.where(df.a_id.isnull(), None, 'a'), 'b')
df

Not sure I fully understand all conditions or if this has any particular edge cases, but I think you can just do an np.argmax on the columns and swap the values for 'a' or 'b' when you're done:
In [21]: import numpy as np
In [22]: df['doc_type'] = pd.Series(np.argmax(df[["a_score", "b_score"]].values, axis=1)).replace({0: 'a', 1: 'b'})
In [23]: df
Out[23]:
a_id a_score b_id b_score doc_type
0 A 1 a 0.10 a
1 B 2 b 0.20 a
2 C 3 c 3.10 b
3 D 4 d 4.10 b
4 2 e 5.00 b
5 F f 5.99 a
6 G 7 NaN a

Use the apply method in pandas with a custom function, trying out on your dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
df = df.replace('',np.NaN)
def func(row):
if np.isnan(row.a_score) and np.isnan(row.b_score):
return np.NaN
elif np.isnan(row.b_score) and not(np.isnan(row.a_score)):
return 'a'
elif not(np.isnan(row.b_score)) and np.isnan(row.a_score):
return 'a'
elif row.a_score>=row.b_score:
return 'a'
elif row.b_score>row.a_score:
return 'b'
df['doc_type'] = df.apply(func,axis=1)
You can make the function as complicated as you need and include any amount of comparisons and add more conditions later if you need to.

Related

Count sequence within a column in pandas

I have a following problem. Suppose I have this dataframe:
import pandas as pd
d = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3]}
df = pd.DataFrame(data=d)
I need to add a new column that add a number to each project per name. Desired output is:
import pandas as pd
dnew = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3], 'New_column': ['1', '1','1','2', '2','2','2','3','3']}
NEWdf = pd.DataFrame(data=dnew)
In other words: 'aa','ab','bc' in Project occurs in the first rows, so I add 1 to the new column. 'aa', 'ab' is the second Project from the beginning. It occurs for Name 'a' and 'b', so I add 2 to the both new column. 'ca', 'cb' is the third project and it occurs only for name 'd', so I add 3 only to the name 'd'.
I tried to combine groupby with a for loop, but it did not worked to me. Thanks a lot for a help!
Looks like networkx since Name and Project are related , you can use:
import networkx as nx
G=nx.from_pandas_edgelist(df, 'Name', 'Project')
l = list(nx.connected_components(G))
s = pd.Series(map(list,l)).explode()
df['new'] = df['Project'].map({v:k for k,v in s.items()}).add(1)
print(df)
Name Project col2 new
0 a aa 3 1
1 a ab 4 1
2 b bb 6 2
3 b bc 6 2
4 c aa 6 1
5 c ab 6 1

How to string join one column with another columns - pandas

I just came across this question, how do I do str.join by one column to join the other, here is my DataFrame:
>>> df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
a b
0 a hello
1 b good
2 c great
3 d nice
I would like the a column to join the values in the b column, so my desired output is:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
How would I go about that?
Hope you can see the correlation with this, here is one example with the first row that you can do in python:
>>> 'a'.join('hello')
'haealalao'
>>>
Just like in the desired output.
I think it might be useful to know how two columns can interact. join might not be the best example but there are other functions that you could do. It could maybe be useful if you use split to split on the other columns, or replace the characters in the other columns with something else.
P.S. I have a self-answer below.
TL;DR
The below code is the fastest answer I could figure out from this question:
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
The above code first does a generator of the a column, then you can use next for getting the next value every time, then in the list comprehension it joins the two strings.
Long answer:
Going to show my solutions:
Solution 1:
To use a list comprehension and a generator:
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
print(df)
Solution 2:
Group by the index, and apply and str.join the two columns' value:
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
print(df)
Solution 3:
Use a list comprehension that iterates through both columns and str.joins:
df['b'] = [x.join(y) for x, y in df.values.tolist()]
print(df)
These codes all output:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
Timing:
Now it's time to move on to timing with the timeit module, here is the code we use to time:
from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
def u11_2():
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
def u11_3():
df['b'] = [x.join(y) for x, y in df.values.tolist()]
print('Solution 1:', timeit(u11_1, number=5))
print('Solution 2:', timeit(u11_2, number=5))
print('Solution 3:', timeit(u11_3, number=5))
Output:
Solution 1: 0.007374127670871819
Solution 2: 0.05485127553865618
Solution 3: 0.05787154087587698
So the first solution is the quickest, using a generator.
I tried achieving the output using df.apply
>>> df.apply(lambda x: x['a'].join(x['b']), axis=1)
0 haealalao
1 gbobobd
2 gcrcecact
3 ndidcde
dtype: object
Timing it for performance comparison,
from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
def u11_2():
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
def u11_3():
df['b'] = [x.join(y) for x, y in df.values.tolist()]
def u11_4():
df['c'] = df.apply(lambda x: x['a'].join(x['b']), axis=1)
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 1:', timeit(u11_1, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 2:', timeit(u11_2, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 3:', timeit(u11_3, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 4:', timeit(u11_4, number=5))
Note that I am reinitializing df before every line so that all the functions process the same dataframe. It can also be done by passing the df as a parameter to the function.
Here's another solution using zip and list comprehension. Should be better than df.apply:
In [1576]: df.b = [i.join(j) for i,j in zip(df.a, df.b)]
In [1578]: df
Out[1578]:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde

How can I stack rows that share any column value in common?

I'm not sure the wording of the title is optimal, because the problem I have is a little tricky to explain. In code, I have a df that looks something like this:
import pandas as pd
import numpy as np
a = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'E', 'E']
b = [3, 1, 2, 3, 12, 4, 7, 8, 3, 10, 12]
df = pd.DataFrame([a, b]).T
df
Yields
0 1
0 A 3
1 A 1
2 A 2
3 B 3
4 B 12
5 B 4
6 C 7
7 C 8
8 D 3
9 E 10
10 E 12
I'm aware of groupby methods to group by values in a column, but that's not exactly what I want. I kind of want to go a step past that, where any intersection in column 1 between groups of column 0 are grouped together. My wording is terrible (which is probably why I'm having trouble putting this into code), but here's basically what I want as output:
0 1
0 A-B-D-E 3
1 A-B-D-E 1
2 A-B-D-E 2
3 A-B-D-E 3
4 A-B-D-E 12
5 A-B-D-E 4
6 C 7
7 C 8
8 A-B-D-E 3
9 A-B-D-E 10
10 A-B-D-E 12
Basically, A, B, and D all share the value 3 in column 1, so their labels get grouped together in column 0. Now, because B and E share value 12 in column 1, and B shares the value 3 in column 1 with A and D, E gets grouped in with A, B, and D as well. The only value in column 0 that remained independent is C, because it has no intersections with any other group.
In my head this ends up being a recursive loop, but I can't seem to figure out the exact logic. Any help would be appreciated.
If anyone in the future is experiencing the same thing, this works (it's probably not the best solution in the world, though):
import pandas as pd
import numpy as np
a = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'E', 'E']
b = ['3', '1', '2', '3', '12', '4', '7', '8', '3', '10', '12']
df = pd.DataFrame([a, b]).T
df.columns = 'a', 'b'
df2 = df.copy()
def flatten(container):
for i in container:
if isinstance(i, (list,tuple)):
for j in flatten(i):
yield j
else:
yield i
bad = True
i =1
while bad:
print("Round "+str(i))
i = i+1
len_checker = []
for variant in list(set(df.a)):
eGenes = list(set(df.loc[df.a==variant, 'b']))
inter_variants = []
for gene in eGenes:
inter_variants.append(list(set(df.loc[df.b==gene, 'a'])))
if type(inter_variants[0]) is not str:
inter_variants = [x for x in flatten(inter_variants)]
inter_variants = list(set(inter_variants))
len_checker.append(inter_variants)
if len(inter_variants) != 1:
df2.loc[df2.a.isin(inter_variants),'a']='-'.join(inter_variants)
good_checker = max([len(x) for x in len_checker])
df['a'] = df2.a
if good_checker == 1:
bad=False
df.a = df.a.apply(lambda x: '-'.join(list(set(x.split('-')))))
df.drop_duplicates(inplace=True)
The following creates the output you want, without recursions. I have not tested it with other constellations though (other order, more combinations etc.).
a = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'E', 'E']
b = [3, 1, 2, 3, 12, 4, 7, 8, 3, 10, 12]
df = list(zip(a, b))
print(df)
class Bucket:
def __init__(self, keys, values):
self.keys = set(keys)
self.values = set(values)
def contains_key(self, key):
return key in self.keys
def add_if_contained(self, key, value):
if value in self.values:
self.keys.add(key)
return True
elif key in self.keys:
self.values.add(value)
return True
return False
def merge(self, bucket):
self.keys.update(bucket.keys)
self.values.update(bucket.values)
def __str__(self):
return f'{self.keys} :: {self.values}>'
def __repr__(self):
return str(self)
res = []
for tup in df:
added = False
if res:
selected_bucket = None
remove_idx = None
for idx, bucket in enumerate(res):
if not added:
added = bucket.add_if_contained(tup[0], tup[1])
selected_bucket = bucket
elif bucket.contains_key(tup[0]):
selected_bucket.merge(bucket)
remove_idx = idx
if remove_idx is not None:
res.pop(remove_idx)
if not added:
res.append(Bucket({tup[0]}, {tup[1]}))
print(res)
Generates the following output:
$ python test.py
[('A', 3), ('A', 1), ('A', 2), ('B', 3), ('B', 12), ('B', 4), ('C', 7), ('C', 8), ('D', 3), ('E', 10), ('E', 12)]
[{'B', 'D', 'A', 'E'} :: {1, 2, 3, 4, 10, 12}>, {'C'} :: {8, 7}>]

Lowercase columns by name using dataframe method

I have a dataframe containing strings and NaNs. I want to str.lower() certain columns by name to_lower = ['b', 'd', 'e']. Ideally I could do it with a method on the whole dataframe, rather than with a method on df[to_lower]. I have
df[to_lower] = df[to_lower].apply(lambda x: x.astype(str).str.lower())
but I would like a way to do it without assigning to the selected columns.
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b']})
to_lower = ['a']
df2 = df.copy()
df2[to_lower] = df2[to_lower].apply(lambda x: x.astype(str).str.lower())
You can use assign method and unpack the result as keyword argument:
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b'], 'c': ['C', 'c']})
to_lower = ['a', 'b']
df.assign(**df[to_lower].apply(lambda x: x.astype(str).str.lower()))
# a b c
#0 a b C
#1 a b c
You want this:
for column in to_lower:
df[column] = df[column].str.lower()
This is far more efficient assuming you have more rows than columns.

Creating a new column based on selecting by multiple conditions between two pandas dataframes

I have two dataframes that contain (some) common columns (A,B,C), but are ordered differently and have different values for C.
I'd like to replace the 'C' values in first dataframe with those from the second.
I can create a toy example like this:
A = [ 1, 1, 1, 2, 2, 2, 3, 3, 3 ]
B = [ 'x', 'y', 'z', 'x', 'y', 'y', 'x', 'x', 'x' ]
C = [ 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i' ]
df1 = pd.DataFrame( { 'A' : A,
'B' : B,
'C' : C } )
A.reverse()
B.reverse()
C = [ c.upper() for c in reversed(C) ]
df2 = pd.DataFrame( { 'A' : A,
'B' : B,
'C' : C } )
I'd like to update df1 so that it looks like this - i.e. it has the 'C' values from df2:
A = [ 1, 1, 1, 2, 2, 2, 3, 3, 3 ]
B = [ 'x', 'y', 'z', 'x', 'y', 'y', 'x', 'x', 'x' ]
C = [ 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I' ]
I've tried:
df1['C'] = df2[ (df2['A'] == df1['A']) & (df2['B'] == df1['B']) ]['C']
but that doesn't work because, I think, the order of A and B are different.
merge_df = pd.merge(df1, df2, on=['A', 'B'])
df1['C'] = merge_df['C_y']
I think your toy code has a problem in [ c.upper() for c in C.reverse() ].
C.reverse() return None.
It is not easy, because duplicates in columns A and B (3,x).
So I create new columns D by cumcount and then use
merge, last remove unnecessary columns:
df1['D'] = df1.groupby(['A','B']).C.cumcount()
df2['D'] = df2.groupby(['A','B']).C.cumcount(ascending=False)
df3 = pd.merge(df1, df2, on=['A','B','D'], how='right', suffixes=('_',''))
df3 = df3.drop(['C_', 'D'], axis=1)
print (df3)
A B C
0 1 x A
1 1 y B
2 1 z C
3 2 x D
4 2 y E
5 2 y F
6 3 x G
7 3 x H
8 3 x I

Categories

Resources