pandas dataframe with list elements: split, pad - python

I have a pandas dataframe (NROWS x 1) where each row is a list , such as
y
0 [[aa, bb], 0000001]
1 [[uz, mk], 0000011]
I want to flatten the list and split into (in this case three) columns like so:
1 2 3
0 aa bb 0000001
1 uz mk 0000011
Further, different rows have unequal lengths:
y
0 [[aa, bb], 0000001]
1 [[mk], 0000011]
What I really want to end up with is, detect the max length over all rows and pad the rest to empty string ''. In this example,
1 2 3
0 aa bb 0000001
1 '' mk 0000011
I've toyed around with doing .values.tolist() but it doesn't do what I need.
Edit- the answers below are super neat and much appreciated. I'm editing to include a solution for a similar but simpler problem, for completeness.
Read data, use the trim() fn from Strip / trim all strings of a dataframe to make sure there is no left/right whitespace
df = pd.read_csv('data.csv',sep=',',dtype=str)
df = trim_all_columns(df)
Keep categorical/nominal ID and CODE columns, remove all NA
df.dropna(subset=['dg_cd'] , inplace=True) # drop dg_cd is NaN rows from df
df2 = df[['id','dg_cd']]
Turn CODE into sentences by ID keeping all repeated instances
x = df2.groupby('id').apply(lambda x: x['dg_cd'].values.tolist()).apply(pd.Series).replace(np.nan, '', regex=True)
The reason for doing all that is because that feeds into a k-modes cluster search, https://pypi.org/project/kmodes/. NA is not an acceptable input but empty strings
''
allow rows of same length while there is no spurious similarity. For example,
km = KModes(n_clusters=4, init='Cao', n_init=1, verbose=1)
clusters = km.fit_predict( x )

Setup
df = pd.DataFrame(dict(y=[
[['aa', 'bb'], '0000001'],
[['uz', 'mk'], '0000011'],
[['mk'], '0000111']
]))
df
y
0 [[aa, bb], 0000001]
1 [[uz, mk], 0000011]
2 [[mk], 0000111]
flatten
From #wim
def flatten(x):
try:
it = iter(x)
except TypeError:
yield x
return
if isinstance(x, str):
yield x
return
for elem in it:
yield from flatten(elem)
d = dict(zip(df.index, [dict(enumerate([*flatten(x)][::-1])) for x in df.y]))
d = pd.DataFrame.from_dict(d, 'index').fillna('')
d.iloc[:, ::-1].rename(columns=lambda x: d.shape[1] - x)
1 2 3
0 aa bb 0000001
1 uz mk 0000011
2 mk 0000111

After using the same function flatten the list
pd.DataFrame(list(map(lambda x : list(flatten(x)),df.y.tolist()))).apply(lambda x : pd.Series(sorted(x,key=pd.notna)),1)
Out[85]:
0 1 2
0 aa bb 0000001
1 uz mk 0000011
2 None mk 0000111

In case you want to have control over which side to pad the sublists from:
max_len = df['y'].apply(lambda row: len(row[0])).max()
pd.DataFrame([*df['y'].apply(lambda row: ['']*(max_len - len(row[0])) + row[0] + row[1:])])
Which, using #piRSquared's setup gives
0 1 2
0 aa bb 0000001
1 uz mk 0000011
2 mk 0000111
Or, alternatively
pd.DataFrame([*df['y'].apply(lambda row: row[0] + ['']*(max_len - len(row[0])) + row[1:])])
giving you
0 1 2
0 aa bb 0000001
1 uz mk 0000011
2 mk 0000111

Related

Pandas sum last n rows of df.count() results into one row

I am looking for a way to generate nice summary statistics of a dataframe. Consider the following example:
>> df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
>> df['category'].value_counts()
z 4
x 4
y 3
u 2
v 1
w 1
>> ??
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
The result sums the value counts of the n=3 last rows up, deletes them and then adds them as one row to the original value counts. Also it would be nice to have everything as percents. Any ideas how to implement this? Cheers!
For DataFrame with percentages use Series.iloc with indexing, crate DataFrame by Series.to_frame, add new row and new column filled by percentages:
s = df['category'].value_counts()
n= 3
out = s.iloc[:-n].to_frame('count')
out.loc['Other ({n})'] = s.iloc[-n:].sum()
out['pct'] = out['count'].div(out['count'].sum()).apply(lambda x: f"{x:.0%}")
print (out)
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
I would use tail(-3) to get the last values except for the first 3:
counts = df['category'].value_counts()
others = counts.tail(-3)
counts[f'Others ({len(others)})'] = others.sum()
counts.drop(others.index, inplace=True)
counts.to_frame(name='count').assign(pct=lambda d: d['count'].div(d['count'].sum()).mul(100).round())
Output:
count pct
z 4 27.0
x 4 27.0
y 3 20.0
Others (3) 4 27.0
This snippet
df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
cutoff_index = 3
categegory_counts = pd.DataFrame([df['category'].value_counts(),df['category'].value_counts(normalize=True)],index=["Count","Percent"]).T.reset_index()
other_rows = categegory_counts[cutoff_index:].set_index("index")
categegory_counts = categegory_counts[:cutoff_index].set_index("index")
summary_table = pd.concat([categegory_counts,pd.DataFrame(other_rows.sum(),columns=[f"Other ({len(other_rows)})"]).T])
summary_table = summary_table.astype({'Count':'int'})
summary_table['Percent'] = summary_table['Percent'].apply(lambda x: "{0:.2f}%".format(x*100))
print(summary_table)
will give you what you need. Also in a nice format;)
Count Percent
z 4 26.67%
x 4 26.67%
y 3 20.00%
Other (3) 4 26.67%

How to merge strings that have certain number of substrings in common to produce some groups in a data frame in Python

I asked a question like this. But that is a simple one. Which has been resolved. how to merge strings that have substrings in common to produce some groups in a data frame in Python.
But here, I have an advanced version of the similar question:
I have a sample data:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:
group
'b,c', 0
'a', 0
'a,c,d,e', 0
'f,g,h,i', 1
'j,k,l', 2
'k,l,m' 2
So, I can have three groups and there is no common sub strings between any two groups.
Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:
commonWords=1
for i in np.arange(a.shape[0]):
a.loc[:,a.loc[i,'ACTIVITY']]=0
for i in a.loc[:,'ACTIVITY']:
il=i.split(',')
for j in a.loc[:,'ACTIVITY']:
jl=j.split(',')
c=[x in il for x in jl]
c1=[x for x in c if x==True]
a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
a
The result is:
ACTIVITY b,c a a,c,d,e f,g,h,i j,k,l k,l,m
0 b,c 1 0 1 0 0 0
1 a 0 1 1 0 0 0
2 a,c,d,e 1 1 1 0 0 0
3 f,g,h,i 0 0 0 1 0 0
4 j,k,l 0 0 0 0 1 1
5 k,l,m 0 0 0 0 1 1
In this code, commonWords means how many sub strings I hope that two strings have in common. For example, if commonWords=2, then two strings will be merged together only if there are two, or more than two sub strings in them. When commonWords=2, the group should be:
group
'b,c', 0
'a', 1
'a,c,d,e', 2
'f,g,h,i', 3
'j,k,l', 4
'k,l,m' 4
Use:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
from itertools import combinations, chain
from collections import Counter
#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')
commonWords=2
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,commonWords)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))
#convert values to sets
f1 = [set(k) for k, v in Counter(L2).items() if v >= commonWords]
f2 = [set(x) for x in splitted]
#create new columns for matched sets
for val in f1:
j = ','.join(val)
a[j] = [j if len(val & x) == commonWords else np.nan for x in f2]
print (a)
#forward filling values of new columns and use factorize for groups
new = pd.factorize(a[['ACTIVITY']].assign(ACTIVITY = a.index).ffill(axis=1).iloc[:, -1])[0]
a = a[['ACTIVITY']].assign(group = new)
print (a)
ACTIVITY group
0 b,c 0
1 a 1
2 a,c,d,e 2
3 f,g,h,i 3
4 j,k,l 4
5 k,l,m 4

IP Address Duplicate connections cleanup (Dataframes) [duplicate]

from itertools import product
import pandas as pd
df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
# c1 c2
# 0 0 0
# 1 0 1
# 2 0 2
# 3 0 3
# 4 0 4
# .. .. ..
# 85 9 4
# 86 9 5
# 87 9 7
# 88 9 8
# 89 9 9
#
# [90 rows x 2 columns]
How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?
An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'. The latter should be removed.
The algorithm must be fast, so it is recommended to use numpy. Converting to python object is not allowed.
You can sort the values, then groupby:
a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()
Option 2: If you have a lot of pairs c1, c2, groupby can be slow. In that case, we can assign new values and filter by drop_duplicates:
a= np.sort(df.to_numpy(), axis=1)
(df.assign(one=a[:,0], two=a[:,1]) # one and two can be changed
.drop_duplicates(['one','two']) # taken from above
.reindex(df.columns, axis=1)
)
One way is using np.unique with return_index=True and use the result to index the dataframe:
a = np.sort(df.values)
_, ix = np.unique(a, return_index=True, axis=0)
print(df.iloc[ix, :])
c1 c2
0 0 0
1 0 1
20 2 0
3 0 3
40 4 0
50 5 0
6 0 6
70 7 0
8 0 8
9 0 9
11 1 1
21 2 1
13 1 3
41 4 1
51 5 1
16 1 6
71 7 1
...
frozenset
mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()
df[~mask]
I will do
df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]
From pandas and numpy tri
s=pd.crosstab(df.c1,df.c2)
s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()
Here's one NumPy based one for integers -
def remove_symm_pairs(df):
a = df.to_numpy(copy=False)
b = np.sort(a,axis=1)
idx = np.ravel_multi_index(b.T,(b.max(0)+1))
sidx = idx.argsort(kind='mergesort')
p = idx[sidx]
m = np.r_[True,p[:-1]!=p[1:]]
a_out = a[np.sort(sidx[m])]
df_out = pd.DataFrame(a_out)
return df_out
If you want to keep the index data as it is, use return df.iloc[np.sort(sidx[m])].
For generic numbers (ints/floats, etc.), we will use a view-based one -
# https://stackoverflow.com/a/44999009/ #Divakar
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
and simply replace the step to get idx with idx = view1D(b) in remove_symm_pairs.
If this needs to be fast, and if your variables are integer, then the following trick may help: let v,w be the columns of your vector; construct [v+w, np.abs(v-w)] =: [x, y]; then sort this matrix lexicographically, remove duplicates, and finally map it back to [v, w] = [(x+y), (x-y)]/2.

Count Re-occurrence of a value in python

I have a data set which contains something like this:
SNo Cookie
1 A
2 A
3 A
4 B
5 C
6 D
7 A
8 B
9 D
10 E
11 D
12 A
So lets say we have 5 cookies 'A,B,C,D,E'. Now I want to count if any cookie has reoccurred after a new cookie was encountered. For example, in the above example, cookie A was encountered again at 7th place and then at 12th place also. NOTE We wouldn't count A at 2nd place as it came simultaneously, but at position 7th and 12th we had seen many new cookies before seeing A again, hence we count that instance. So essentially I want something like this:
Sno Cookie Count
1 A 2
2 B 1
3 C 0
4 D 2
5 E 0
Can anyone give me logic or python code behind this?
One way to do this would be to first get rid of consecutive Cookies, then find where the Cookie has been seen before using duplicated, and finally groupby cookie and get the sum:
no_doubles = df[df.Cookie != df.Cookie.shift()]
no_doubles['dups'] = no_doubles.Cookie.duplicated()
no_doubles.groupby('Cookie').dups.sum()
This gives you:
Cookie
A 2.0
B 1.0
C 0.0
D 2.0
E 0.0
Name: dups, dtype: float64
Start by removing consecutive duplicates, then count the survivers:
no_dups = df[df.Cookie != df.Cookie.shift()] # Borrowed from #sacul
no_dups.groupby('Cookie').count() - 1
# SNo
#Cookie
#A 2
#B 1
#C 0
#D 2
#E 0
pandas.factorize and numpy.bincount
If immediately repeated values are not counted then remove them.
Do a normal counting of values on what's left.
However, that is one more than what is asked for, so subtract one.
factorize
Filter out immediate repeats
bincount
Produce pandas.Series
i, r = pd.factorize(df.Cookie)
mask = np.append(True, i[:-1] != i[1:])
cnts = np.bincount(i[mask]) - 1
pd.Series(cnts, r)
A 2
B 1
C 0
D 2
E 0
dtype: int64
pandas.value_counts
zip cookies with its lagged self, pulling out non repeats
c = df.Cookie.tolist()
pd.value_counts([a for a, b in zip(c, [None] + c) if a != b]).sort_index() - 1
A 2
B 1
C 0
D 2
E 0
dtype: int64
defaultdict
from collections import defaultdict
def count(s):
d = defaultdict(lambda:-1)
x = None
for y in s:
d[y] += y != x
x = y
return pd.Series(d)
count(df.Cookie)
A 2
B 1
C 0
D 2
E 0
dtype: int64

protein sequence coding

I'm working on a python program to compute a numerical coding of mutated residues and positions of a set of strings (protein sequences), stored in fasta format file with each protein sequence is separated by comma. I'm trying to find the position and sequences which are mutated.
My fasta file is as follows:
MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN
Example:
The following figure (based on another set of fasta file) will explain the algorithm behind this. In this figure first box represents alignment of input file sequences. The last box represents the output file. How can I do this with my fasta file in Python?
example input file:
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
positions 1 2 3 4 5 6 1 2 3 4 5 6
protein sequence1 M T A Q D D T A D
protein sequence2 M T A Q D D T A D
protein sequence3 M T S Q E D T S E
protein sequence4 M T A Q D D T A D
protein sequence5 M K A Q H D K A H
PROTEIN SEQUENCE ALIGNMENT DISCARD NON-VARIABLE REGION
positions 2 2 3 3 5 5 5
protein sequence1 T A D
protein sequence2 T A D
protein sequence3 T S E
protein sequence4 T A D
protein sequence5 K A H
MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN
Output file should be like this:
position+residue 2T 2K 3A 3S 5D 5E 5H
sequence1 1 0 1 0 1 0 0
sequence2 1 0 1 0 1 0 0
sequence3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)
Here are two ways I have tried to do it:
ls= 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'.split(',')
pos = [set(enumerate(x, 1)) for x in ls]
a=set().union(*pos)
alle = sorted(set().union(*pos))
print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
print '\t'.join('1' if key in p else '0' for key in alle)
(here I'm getting columns of mutated as well as non-mutated residues, but I want only columns for mutated residues)
from pandas import *
data = 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'
df = DataFrame([list(row) for row in data.split(',')])
df = DataFrame({str(col+1)+val:(df[col]==val).apply(int) for col in df.columns for val in set(df[col])})
print df.select(lambda x: not df[x].all(), axis = 1)
(here it is giving output ,but not in orderly ie, first 2K then 2T then 3A like that.)
How should I be doing this?
The function get_dummies gets you most of the way:
In [11]: s
Out[11]:
0 T
1 T
2 T
3 T
4 K
Name: 1
In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]:
1K 1T
0 0 1
1 0 1
2 0 1
3 0 1
4 1 0
And those columns which have differing values:
In [21]: (df.ix[0] != df).any()
Out[21]:
0 False
1 True
2 True
3 False
4 True
5 False
Putting these together:
In [31]: I = df.columns[(df.ix[0] != df).any()]
In [32]: J = [pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I]
In [33]: df[[]].join(J)
Out[33]:
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
Note: I created the initial DataFrame as follows, however this may be done more efficiently depending on your situation:
df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))

Categories

Resources