i have a large data set that i read in with pandas and i want to do pairwise alignment by pairwise2.
import pandas as pd
from pandas import DataFrame
from Bio import pairwise2 #for pairwise alignments
from Bio.pairwise2 import format_alignment #for printing alignments out neatly
but here i will use a mock data set:
data = { 'sequence': ['ACAAGAGTGGGACTATACAGTGGGTACAGTTATGACTTC', 'GCACGGGCCCTTGGCTAC', 'GCAACAAGGGGGGATACAGCGGGAACAGTGGACAAGTGGTTCGATGTC']}
data = DataFrame(data)
look like this:
Out[34]:
sequence
0 ACAAGAGTGGGACTATACAGTGGGTACAGTTATGACTTC
1 GCACGGGCCCTTGGCTAC
2 GCAACAAGGGGGGATACAGCGGGAACAGTGGACAAGTGGTTCGATGTC
my goal is to do a pairwise alignment within the 'sequence' column, so the first row compares with the second, then the second compares with the third, the third compares with the first, and so on for a larger data set.
my code :
for seq in data['sequence']:
for a in pairwise2.align.globalxx(seq, seq):
print(format_alignment(*a)) #this is just to print the alignment out neatly.
this prints out:
ACAAGAGTGGGACTATACAGTGGGTACAGTTATGACTTC
|||||||||||||||||||||||||||||||||||||||
ACAAGAGTGGGACTATACAGTGGGTACAGTTATGACTTC
Score=39
GCACGGGCCCTTGGCTAC
||||||||||||||||||
GCACGGGCCCTTGGCTAC
Score=18
GCAACAAGGGGGGATACAGCGGGAACAGTGGACAAGTGGTTCGATGTC
||||||||||||||||||||||||||||||||||||||||||||||||
GCAACAAGGGGGGATACAGCGGGAACAGTGGACAAGTGGTTCGATGTC
Score=48
which is close to what i want but it only compares the first to the first, second to second and third to third.
so i tried this:
for seq in data['sequence']: #for each 'sequence' column value
for index, row in data.iterrows(): #for each row
for a in pairwise2.align.globalxx(seq, row['sequence']): #compare 'sequence' column value to each row of the 'sequence' column
print(format_alignment(*a))
this gave out way too many lines of output i'm not even going to try to post it here.
my idea was to compare the 'sequence' value to the rows of the 'sequence' column, but the output gave way too many alignments than expected. i think the double loop is not the way to go here. i guess my question doesn't even have anything to do with Biopython, just simply how can i do pairwise comparisons within one column?
Use the combinatoric generators from itertools.
for seq0, seq1 in itertools.combinations(data['sequence'], 2):
for a in pairwise2.align.globalxx(seq0, seq1):
print(format_alignment(*a))
Related
I have to process an excel file using pandas. The excel file has three columns as shown here(sample)
excelfile. The 'LicNo' column is not unique.
My Task-1:
I have to group it by 'LicNo' to bring all 'Licensees' in the same row. In the grouped table there will be 'LicNo' and a bunch of 'Licensee' columns only(all in a row) ignoring the middle column in the original excel file.
My Task-2:
Now to identify the duplicates, I have to search if the 'Licensee' in the first column is repeated in the subsequent columns (i.e.axis-1). The search will be based on the first word in the text in the first column; if this is repeated in the next columns to declare it as 'Possible duplicates'
Here is the code that I wrote:
enter code here
import pandas as pd
import numpy as np
from itertools import chain
df=pd.read_excel("compare_usr.xlsx",dtype={'LicNo': int,'ScheduleNo':str, 'Licensee': str})
#df.isna().any()
df=df.dropna(axis='index')
df['Licensee'] = df.Licensee.str.replace(r'\n', '')
#to make the strings uniform as in certain fields the M/s is present.
df["Licensee"]=df.Licensee.str.replace("M/s ","")
# the character ';' is inserted to segregate the duplicate LicNo while transposing into rows
df.loc[:,"Licensee"]=df["Licensee"].astype(str)+";"
def func(x):
ch = chain.from_iterable(y.split(';') for y in x.tolist())
return '\n'.join(((ch)))
dfNew=df.groupby('LicNo') ['Licensee'].apply(func).str.split("\n+",expand=True)#.to_excel("test2.xls")
the following two lines marked as (1) and (2) does not produce the desired output and not in the code. The below function code (fun(row)) however produce the intended result.
#dfNew['Status']=np.where((dfNew[x+1].str.contains(dfNew[0].str,na=False) for x in col),"match","unmatch") # (1)
#dfNew['Status']=np.where((dfNew[x+1].str.apply(lambda y: dfNew[0].str in y) for x in range(6)),"match","unmatch") #(2)
# in the original Excel file there are about 4000 rows and it produced 18 columns of duplicates.
enter code here
def fun(row):
col=[i+1 for i in range(17)]
for i in col:
if row[i] is None:
continue
#to extract the first word from row[0]
if (row[0].split()[0].lower() in row[i].lower()):
return True
return False
dfNew['Status']=np.where(dfNew.apply(fun,axis=1)==True,dfNew.index,"May be duplicate or only in one file")
Now my question is I would like to replace the function 'fun(row)' (this is working and producing the desired result) with either of the two lines marked as (1) and (2) in the code to produce the dfNew['Status'] . I am not able to appreciate what is going wrong in either of these two line as both produces wrong result.
I am a beginner code writer in Python and owe to stackflow.com for copying the codes from some of the answers in some other topic. Would you be able to help me?
Thanks.
Edit:
Desired outcome:-
The result file
I need to process data measured every 20 seconds during the whole 2018 year, the raw file has following structure:
date time a lot of trash
in several rows
amount of samples trash again
data
date time a lot of trash
etc.
I want to make one pandas dataframe of it or at least one dataframe per every block (its size is coded as amount of samples) of data saving the time of measurement.
How can I ignore all other data trash? I know that it is written periodically (period = amount of samples), but:
- I don't know how many strings are in file
- I don't want to use explicit method file.getline() in cycle, because it would work just endlessly (especially in python) and I have no enough computing power to use it
Is there any method to skip rows periodically in pandas or another lib? Or how else can I resolve it?
There is an example of my data:
https://drive.google.com/file/d/1OefLwpTaytL7L3WFqtnxg0mDXAljc56p/view?usp=sharing
I want to get dataframe similar to datatable on the pic + additional column with date-time without technical rows
Use itertools.islice, where N below means read every N lines
from itertools import islice
N = 3
sep = ','
with open(file_path, 'r') as f:
lines_gen = islice(f, None, None, N)
df = pd.DataFrame([x.strip().split(sep) for x in lines_gen])
I repeated your data three times. It sounds like you need every 4th row (not starting at 0) because that is where your data lies. In the documentation for skipsrows it says.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
So what if we pass a not in to the lambda function? that is what I am doing below.
I am creating a list of the values i want to keep. and passing the not in to the skiprows argument. In English, skip all the rows that are not every 4th line.
import pandas as pd
# creating a list of all the 4th row indexes. If you need more than 1 million, just up the range number
list_of_rows_to_keep = list(range(0,1000000))[3::4]
# passing this list to the lambda function using not in.
df = pd.read_csv(r'PATH_To_CSV.csv', skiprows=lambda x: x not in list_of_rows_to_keep)
df.head()
#output
0 data
1 data
2 data
Just count how many lines are in file and put the list of them (may it calls useless_rows) which are supposed to be skiped in pandas.read_csv(..., skiprows=useless_rows).
My problem was a chip rows counting.
There are few ways to do it:
On Linux command "wc -l" (here is an instruction how to put it into your code: Running "wc -l <filename>" within Python Code)
Generators. I have a key in my relevant rows: it is in last column. Not really informative, but rescue for me. So I can count string with it, appears it's abour 500000 lines and it took 0.00011 to count
with open(filename) as f:
for row in f:
if '2147483647' in row:
continue
yield row
This is what I have currently, I get the error int is 'int' object is not iterable. If I understand correctly my issue is that BIKE_AVAILABLE is assigned a number at the top of my project with a number so instead of looking at the column it is looking at that number and hitting an error. How should I go about going through the column? I apologize in advance for the newby question
for i in range(len(stations[BIKES_AVAILABLE]) -1):
most_bikes = max(stations[BIKES_AVAILABLE])
sort(stations[BIKES_AVAILABLE]).remove(max(stations[BIKES_AVAILABLE]))
if most_bikes == max(stations[BIKES_AVAILABLE]):
second_most = max(stations[BIKES_AVAILABLE])
index_1 = index(most_bikes)
index_2 = index(second_most)
most_bikes = max(data[0][index_1], data[0][index_2])
return most_bikes
Another method that might be better for you to use with data manipulation is to try the pandas module.
Then you could do this:
import pandas as pd
data = pd.read_csv('bicycle_data.csv')
# Alternative:
# most_sales = data['sold'].max()
most_sales = max(data['sold'])
Now you don't have to worry about indexing columns with numbers:
You can also do something like this:
sorted_data = data.sort_values(by='sold', ascending=False)
# Displays top 5 sold bicycles.
print(sorted_data.head(5))
More importantly if you enjoy using indexes, there is a function to get you the index of the max value called idxmax built into pandas.
Using a generator inside max()
If you have a CSV file named test.csv, with contents:
line1,3,abc
line2,1,ahc
line3,9,sbc
line4,4,agc
You can use a generator expression inside the max() function for a memory efficient solution (i.e. no list is created).
If you wanted to do this for the second column, then:
max(int(l.split(',')[1]) for l in open("test.csv").readlines())
which would give 9 for this example.
Update
To get the row (index), you need to store the index of the max number in the column so that you can access this:
max(((i,int(l.split(',')[1])) for i,l in enumerate(open("test.csv").readlines())),key=lambda t:t[1])[0]
which gives 2 here as the line in test.csv (above) with the max number in column 2 (which is 9) is 2 (i.e. the third line).
This works fine, but you may prefer to just break it up slightly:
lines = open("test.csv").readlines()
max(((i,int(l.split(',')[1])) for i,l in enumerate(lines)),key=lambda t:t[1])[0]
Assuming a csv structure like so:
data = ['1,blue,15,True',
'2,red,25,False',
'3,orange,35,False',
'4,yellow,24,True',
'5,green,12,True']
If I want to get the max value from the 3rd column I would do this:
largest_number = max([n.split(',')[2] for n in data])
My problem: I have a pandas dataframe and one column in particular which I need to process contains values separated by (":") and in some cases, some of those values between ":" can be value = value, and can appear at the start/middle/end of the string. The length of the string can differ in each cell as we iterate through the row, for e.g.
clickstream['events']
1:3:5:7=23
23=1:5:1:5:3
9:0:8:6=5:65:3:44:56
1:3:5:4
I have a file which contains the lookup values of these numbers,e.g.
event_no,description,event
1,xxxxxx,login
3,ffffff,logout
5,eeeeee,button_click
7,tttttt,interaction
23,ferfef,click1
output required:
clickstream['events']
login:logout:button_click:interaction=23
click1=1:button_click:login:button_click:logout
Is there a pythonic way of looking up these individual values and replacing with the event column corresponding to the event_no row as shown in the output? I have hundreds of events and trying to work out a smart way of doing this. pd.merge would have done the trick if I had a single value, but I'm struggling to work out how I can work across the values and ignore the "=value" part of the string
Edit for to ignore missing keys in Dict:
import pandas as pd
EventsDict = {1:'1:3:5:7',2:'23:45:1:5:3',39:'0:8:46:65:3:44:56',4:'1:3:5:4'}
clickstream = pd.Series(EventsDict)
#Keep this as a dictionary
EventsLookup = {1:'login',3:'logout',5:'button_click',7:'interaction'}
def EventLookup(x):
list1 = [EventsLookup.get(int(item),'Missing') for item in x.split(':')]
return ":".join(list1)
clickstream.apply(EventLookup)
Since you are using a full DF and not just a series, use:
clickstream['events'].apply(EventLookup)
Output:
1 login:logout:button_click:interaction
2 Missing:Missing:login:button_click:logout
4 login:logout:button_click:Missing
39 Missing:Missing:Missing:Missing:logout:Missing...
I asked a question about two hours ago regarding the reading and writing of data from a website. I've spent the last two hours since then trying to find a way to read the maximum date value from column 'A' of the output, comparing that value to the refreshed website data, and appending any new data to the csv file without overriding the old ones or creating duplicates.
The code that is currently 100% working is this:
import requests
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
data = requests.get(url)
with open("trades_{}.csv".format(symbol), "r+") as f:
f.write(data.text)
I've tried various ways of finding the maximum value of column 'A'. I've tried a bunch of different ways of using "Dict" and other methods of sorting/finding max, and even using pandas and numpy libs. None of which seem to work. Could someone point me in the direction of a decent way to find the maximum of a column from the .csv file? Thanks!
if you have it in a pandas DataFrame, you can get the max of any column like this:
>>> max(data['time'])
'2012-01-18 15:52:26'
where data is the variable name for the DataFrame and time is the name of the column
I'll give you two answers, one that just returns the max value, and one that returns the row from the CSV that includes the max value.
import csv
import operator as op
import requests
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
csv_file = "trades_{}.csv".format(symbol)
data = requests.get(url)
with open(csv_file, "w") as f:
f.write(data.text)
with open(csv_file) as f:
next(f) # discard first row from file -- see notes
max_value = max(row[0] for row in csv.reader(f))
with open(csv_file) as f:
next(f) # discard first row from file -- see notes
max_row = max(csv.reader(f), key=op.itemgetter(0))
Notes:
max() can directly consume an iterator, and csv.reader() gives us an iterator, so we can just pass that in. I'm assuming you might need to throw away a header line so I showed how to do that. If you had multiple header lines to discard, you might want to use islice() from the itertools module.
In the first one, we use a "generator expression" to select a single value from each row, and find the max. This is very similar to a "list comprehension" but it doesn't build a whole list, it just lets us iterate over the resulting values. Then max() consumes the iterable and we get the max value.
max() can use a key= argument where you specify a "key function". It will use the key function to get a value and use that value to figure the max... but the value returned by max() will be the unmodified original value (in this case, a row value from the CSV). In this case, the key function is manufactured for you by operator.itemgetter()... you pass in which column you want, and operator.itemgetter() builds a function for you that gets that column.
The resulting function is the equivalent of:
def get_col_0(row):
return row[0]
max_row = max(csv.reader(f), key=get_col_0)
Or, people will use lambda for this:
max_row = max(csv.reader(f), key=lambda row: row[0])
But I think operator.itemgetter() is convenient and nice to read. And it's fast.
I showed saving the data in a file, then pulling from the file again. If you want to go through the data without saving it anywhere, you just need to iterate over it by lines.
Perhaps something like:
text = data.text
rows = [line.split(',') for line in text.split("\n") if line]
rows.pop(0) # get rid of first row from data
max_value = max(row[0] for row in rows)
max_row = max(rows, key=op.itemgetter(0))
I don't know which column you want... column "A" might be column 0 so I used 0 in the above. Replace the column number as you like.
It seems like something like this should work:
import requests
import csv
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
data = requests.get(url)
with open("trades_{}.csv".format(symbol), "r+") as f:
all_values = list(csv.reader(f))
max_value = max([int(row[2]) for row in all_values[1:]])
(write-out-the-value?)
EDITS: I used "row[2]" because that was the sample column I was taking max of in my csv. Also, I had to strip off the column headers, which were all text, which was why I looked at "all_values[1:]" from the second row to the end of the file.