I need to find the count of letters in each column as follows:
String: ATCG
TGCA
AAGC
GCAT
string is a series.
I need to write a program to get the following:
0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1
I have written the following code but I am getting a row in 0 index and column at the end (column index 450, actual column no 451) with nan values. I should not be getting either the row or the column 451. I need to have only 450 columns.
f = zip(*string)
counts = [{letter: column.count(letter) for letter in column} for column in
f]
counts=pd.DataFrame(counts).transpose()
print(counts)
counts = counts.drop(counts.columns[[450]], axis =1)
Can anyone please help me understand the issue?
Here is one way you can implement your logic. If required, you can turn your series into a list via lst = s.tolist().
lst = ['ATCG', 'TGCA', 'AAGC', 'GCAT']
arr = [[i.count(x) for i in zip(*lst)] for x in ('ATCG')]
res = pd.DataFrame(arr, index=list('ATCG'))
Result
0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1
Explanation
In the list comprehension, deal with columns first by iterating the first, second, third and fourth elements of each string sequentially.
Deal with rows second by iterating through 'ATCG' sequentially.
This produces a list of lists which can be fed directly into pd.DataFrame.
With Series.value_counts():
>>> s = pd.Series(['ATCG', 'TGCA', 'AAGC', 'GCAT'])
>>> s.str.join('|').str.split('|', expand=True)\
... .apply(lambda row: row.value_counts(), axis=0)\
... .fillna(0.)\
... .astype(int)
0 1 2 3
A 2 1 1 1
C 0 1 2 1
G 1 1 1 1
T 1 1 0 1
I'm not sure how logically you want to order the index, but you could call .reindex() or .sort_index() on this result.
The first line, s.str.join('|').str.split('|', expand=True) gets you an "expanded" version
0 1 2 3
0 A T C G
1 T G C A
2 A A G C
3 G C A T
which should be faster than calling pd.Series(list(x)) ... on each row.
Related
Let's assume, I have the following data frame.
Id Combinations
1 (A,B)
2 (C,)
3 (A,D)
4 (D,E,F)
5 (F)
I would like to filter out Combination column values with more than value in a set. Something like below. AND I would like count the number of occurrence as whole in Combination column. For example, ID number 2 and 5 should be removed since their value in a set is only 1.
The result I am looking for is:
ID Combination Frequency
1 A 2
1 B 1
3 A 2
3 D 2
4 D 2
4 E 1
4 F 2
Can anyone help to get the above result in Python pandas?
First if necessary convert values to lists:
df['Combinations'] = df['Combinations'].str.strip('(,)').str.split(',')
If need count after filtering only one values by Series.str.len in boolean indexing, then use DataFrame.explode and count values by Series.map with Series.value_counts:
df1 = df[df['Combinations'].str.len().gt(1)].explode('Combinations')
df1['Frequency'] = df1['Combinations'].map(df1['Combinations'].value_counts())
print (df1)
Id Combinations Frequency
0 1 A 2
0 1 B 1
2 3 A 2
2 3 D 2
3 4 D 2
3 4 E 1
3 4 F 1
Or if need count before removing them filter them by Series.duplicated in last step:
df2 = df.explode('Combinations')
df2['Frequency'] = df2['Combinations'].map(df2['Combinations'].value_counts())
df2 = df2[df2['Id'].duplicated(keep=False)]
Alternative:
df2 = df2[df2.groupby('Id').Id.transform('size') > 1]
Or:
df2 = df2[df2['Id'].map(df2['Id'].value_counts() > 1]
print (df2)
Id Combinations Frequency
0 1 A 2
0 1 B 1
2 3 A 2
2 3 D 2
3 4 D 2
3 4 E 1
3 4 F 2
sample and expected data
The block one is current data and block 2 is the expected data that is, when i encounter 1 i need the next row to be incremented by one and for next country b same should happen
First replace all another values after first 1 to 1, so is possible use GroupBy.cumsum:
df = pd.DataFrame({'c':['a']*3 + ['b']*3+ ['c']*3, 'v':[1,0,0,0,1,0,0,0,1]})
s = df.groupby('c')['v'].cumsum()
df['new'] = s.where(s.eq(0), 1).groupby(df['c']).cumsum()
print (df)
c v new
0 a 1 1
1 a 0 2
2 a 0 3
3 b 0 0
4 b 1 1
5 b 0 2
6 c 0 0
7 c 0 0
8 c 1 1
Another solution is replace all not 1 values to missing values and forward filling 1 per groups, then first missing values are replaced to 0, so cumulative sum also working perfectly:
s = df['v'].where(df['v'].eq(1)).groupby(df['c']).ffill().fillna(0).astype(int)
df['new'] = s.groupby(df['c']).cumsum()
This is my csv file:
A B C D
0 1 5 5
1 0 3 0
0 0 0 0
2 1 3 4
I want it check The B column if I foud 0 I delete all the row so this what I need as output:
A B C D
0 1 5 5
2 1 3 4
I tried this code :
import pandas as pd
df=pd.read_csv('Book1.csv', sep=',', error_bad_lines=False, dtype='unicode')
for index, row in df.iterrows():
if row['B'] == 0:
df.drop(row,index=False)
df.to_csv('hello.csv')
It return for me :
A B C D
0 0 1 5 5
1 1 0 3 0
2 0 0 0 0
3 2 1 3 4
It did not delete any thing I don't know where is the problem
Any help please !
You could check which rows in B are not equal to 1, and perform boolean indexing with the result:
df[df.B.ne(0)]
A B C D
0 0 1 5 5
3 2 1 3 4
Note that in your approach, in order to drop a given row, you need to specify the index, so you should be doing something like:
for index, row in df.iterrows():
if row['B'] == 0:
df.drop(index, inplace=True)
df.to_csv('hello.csv')
Also don't forget to reassign the result after dropping a row. This can be done setting inplace to True or reassigning back to df.
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
I'm trying to figure out how to compare the element of the previous row of a column to a different column on the current row in a Pandas DataFrame. For example:
data = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','0']})
Output:
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 0
And now I want to make a new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column.
Theoretically:
data['c'] = np.where(data['a']==( the previous row value of data['a'] ),min((data['b']+( the previous row value of data['c'] )),1),data['b'])
So that I can theoretically output:
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 1
4 1 0 1
I'm wondering how to do this because I'm trying to recreate this excel conditional statement: =IF(A70=A69,MIN((P70+Q69),1),P70)
where data['a'] = column A and data['b'] = column P.
If anyone has any ideas on how to do this, I'd greatly appreciate your advice.
According to your statement: 'new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column' I can suggest you to solve it by this way:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','3']})
>>> df
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 3
>>> df['c'] = np.where(df['a']+df['b'] > df['a'].shift(1)+df['b'].shift(1), 1, 0)
>>> df
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 0
4 1 3 1
But it doesn't looking for 'previous value of that same column'.
If you would try to write df['c'].shift(1) in np.where(), it gonna to raise KeyError: 'c'.