How to split a string to a matrix of characters - python

Assume that we have this array in Python:
import pandas as pd
arr = pd.DataFrame(['aabbc','aabccca','aa'])
I want to split each row to columns of its character. The length of the rows may differ.
It is the output that I expect to have (3*7 matrix in this case):
1 2 3 4 5 6 7
1 a a b b c Na Na
2 a a b c c c a
3 a a Na Na Na Na Na
The number of the rows of my matrix is 20000 and I prefer not to use for loops. The original data is protein sequences.
I read [1], [2], [3], etc, and they didn't help me.

Option 1
One simple way to do this is using a list comprehension.
pd.DataFrame([list(x) for x in arr[0]])
0 1 2 3 4 5 6
0 a a b b c None None
1 a a b c c c a
2 a a None None None None None
Alternatively, use apply(list) which does the same thing.
pd.DataFrame(arr[0].apply(list).tolist())
0 1 2 3 4 5 6
0 a a b b c None None
1 a a b c c c a
2 a a None None None None None
Option 2
Alternative with extractall + unstack. You'll end up with a multi-index of columns. You can drop the first level of the result.
v = arr[0].str.extractall(r'(\w)').unstack()
v.columns = v.columns.droplevel(0)
v
match 0 1 2 3 4 5 6
0 a a b b c None None
1 a a b c c c a
2 a a None None None None None
Option 3
Manipulating view -
v = arr[0].values.astype(str)
pd.DataFrame(v.view('U1').reshape(v.shape[0], -1))
0 1 2 3 4 5 6
0 a a b b c
1 a a b c c c a
2 a a
This gives you empty strings ('') instead of Nones in cells. Use replace if you want to add them back.

Related

Pandas how to count the string appear times in columns?

It would be more easy to explain start from a simple example df:
df1:
A B C D
0 a 6 1 b/5/4
1 a 6 1 a/1/6
2 c 9 3 9/c/3
There were four columns in the df1(ABCD).The task is to find out columns D's strings appeared how many times in columnsABC(3coulumns)?Here is expect output and more explanation:
df2(expect output):
A B C D E (New column)
0 a 6 1 b/5/4 0 <--Found 0 ColumnD's Strings from ColumnABC
1 a 6 1 a/1/6 3 <--Found a、1 & 6 so it should return 3
2 c 9 3 9/c/3 3 <--Found all strings (3 totally)
Anyone has good idea for this? Thanks!
You can use a list comprehension with set operations:
df['E'] = [len(set(l).intersection(s.split('/'))) for l, s in
zip(df.drop(columns='D').astype(str).to_numpy().tolist(),
df['D'])]
Output:
A B C D E
0 a 6 1 b/5/4 0
1 a 6 1 a/1/6 3
2 c 9 3 9/c/3 3
import pandas as pd
from pandas import DataFrame as df
dt = {'A':['a','a','c'], 'B': [6,6,9], 'C': [1,1,3], 'D':['b/5/4', 'a/1/6', 'c/9/3']}
E = []
nu_data =pd.DataFrame(data=dt)
for itxid, itx in enumerate(nu_data['D']):
match = 0
str_list = itx.split('/')
for keyid, keys in enumerate(dt):
if keyid < len(dt)-1:
for seg_str in str_list:
if str(dt[keys][itxid]) == seg_str:
match += 1
E.append(match)
nu_data['E'] = E
print(nu_data)

What does the function np.isreal does in a dataframe?

Can anyone explain the below code?
pima_df[~pima_df.applymap(np.isreal).all(1)]
pima_df is a dataframe.
You are extracting rows in which atleast one complex number occurs.
e.g : pima_df =
a b
0 1 2
1 2 4+3j
2 3 5
result would be :
a b
1 2 (4+3j)
in short :
applymap - apply function on each and every element of dataframe.
np.isreal - returns true for real otherwise false
all - returns true if each element along an axis is true otherwise false.
~ - negates the boolean index
Please look at the doc or help(np.isreal).
Returns a bool array, where True if input element is real.
If element has complex type with zero complex part, the return value
for that element is True.
To be precise Numpy Provides a set of methods for comparing and performing operations on arrays elementwise:
np.isreal : Determines whether each element of array is real.
np.all : Determines whether all array element of a specific array evaluate to True.
tilde(~) : used for Boolean indexing which means not.
applymap: applymap works element-wise on a DataFrame.
all() : used to find rows where all the values are True.
The ~ is the operator equivalent of the invert dunder which has been overridden explicitly for the purpose performing vectorized logical inversions on pd.DataFrame/pd.Series objects.
Example of boolean index (~):
>>> df
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
>>> df.query('a in b')
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
OR
>>> df[~df.a.isin(df.b)] # same as above
a b c d
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
hope this will help.

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

Selecting specific columns with specific values pandas

So I have a data frame of 30 columns and I want to filter it for values found in 10 of those columns and return all the rows that match. In the example below, I want to search for values equal to 1 in all df columns that end with "good..."
df[df[[i for i in df.columns if i.endswith('good')]].isin([1])]
df[df[[i for i in df.columns if i.endswith('good')]] == 1]
Both of these work to find those columns but everything that does not match appears as NaN. My question is how can I query specific columns for specific values and have all the rows that don't match not appear as NaN?
You can filter columns first with str.endswith, select columns by [] and compare by eq. Last add any for at least one 1 per row
cols = df.columns[df.columns.str.endswith('good')]
df1 = df[df[cols].eq(1).any(axis=1)]
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[1,1,4,5,5,1],
'C good':[7,8,9,4,2,3],
'D good':[1,3,5,7,1,0],
'E good':[5,3,6,9,2,1],
'F':list('aaabbb')})
print (df)
A B C good D good E good F
0 a 1 7 1 5 a
1 b 1 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 1 3 0 1 b
cols = df.columns[df.columns.str.endswith('good')]
print (df[cols].eq(1))
C good D good E good
0 False True False
1 False False False
2 False False False
3 False False False
4 False True False
5 False False True
df1 = df[df[cols].eq(1).any(1)]
print (df1)
A B C good D good E good F
0 a 1 7 1 5 a
4 e 5 2 1 2 b
5 f 1 3 0 1 b
You solution was really close, only add any:
df1 = df[df[[i for i in df.columns if i.endswith('good')]].isin([1]).any(axis=1)]
print (df1)
A B C good D good E good F
0 a 1 7 1 5 a
4 e 5 2 1 2 b
5 f 1 3 0 1 b
EDIT:
If need only 1 and all another rows and columns remove:
df1 = df.loc[:, df.columns.str.endswith('good')]
df2 = df1.loc[df1.eq(1).any(1), df1.eq(1).any(0)]
print (df2)
D good E good
0 1 5
4 1 2
5 0 1

Using a column with a boolean to access other columns

I have a pandas dataframe like the following:
A B C
1 2 1
3 4 0
5 2 0
5 3 1
And would like to get the value from A if the value of C is 1 and the value of B if C is zero. How would I do this? Ultimately I'd like to end up with a vector with the values of A if C is one and B if C is 0 which would be [1,4,2,5]
Assuming you mean "from A is the value of C is 1 and from B if the value of C is 0", which makes sense given your intended output, I might use Series.where:
>>> df
A B C
0 1 2 1
1 3 4 0
2 5 2 0
3 5 3 1
>>> df.A.where(df.C, df.B)
0 1
1 4
2 2
3 5
dtype: int64
which is read "make a series using values of A if the corresponding value of C is true, otherwise use the corresponding value of B". Here since 1 is true we can just use df.C, but we could use df.C == 1 or df.C*5+3 < 4 or any other boolean Series.

Categories

Resources