Suggestions on Analyzing Protein Sequences Similarity - python

I want to write code to analyze short protein sequences and determine their similarity. I have no reference sequence but rather I want to write some sort of for loop to compare them all to each other to see how many duplicate sequences I have, as well as regions where they are similar.
I currently have all of their sequences in a csv.
I have taken a bioinformatics course and have done something similar with Illumina sequencing data but I started from an SRA table and had fasta files.
Also, I am trying to use CD hit but but I am running into problems with the makefile and the compatibility of my compiler. I installed homebrew to get around the issue but I am still running into the problem and the make CXX=g++-9 CC=gcc-9 comand won't work.
I was wondering if there was more update to the method than CD-Hit because I have noticed that no one has really used CD Hit since 2020.
Also the only coding languages I know are R and Shell but I am currently learning Python.

https://bioinfo.lifl.fr/yass/index.php
I have used it for SARS-CoV-2, found similarity to many viruses

Related

Split pack of text files into multiple subsets according to the content of the files

I have lot of PDF, DOC[X], TIFF and others files (scans from a shared folder). Each file converted into pack of text files: one text file per page.
Each pack of files could contain multiple documents (for example thee contracts). Document kind could be not only contract.
During the processing the pack of the files I don't know what kind of the documents current pack contains and it's possible that one pack contains multiple document kinds (contracts, invoices, etc).
I'm looking for some possible approaches to solve this programmatically.
I'm tried to search something like that but without any success.
UPD: I tried to create binary classificator with scikit-learn and now looking for another solution.
This at its basis, being they are "scans" sounds more like something that could be approached with computer vision, however this is currently far far above my current level of programming.
E.g. projects like SimpleCV may be a good starting point,
http://www.simplecv.org/
Or possibly you could get away with OCR reading the "scans" and working based on the contents. pytesseract seems popular for this type of task,
https://pypi.org/project/pytesseract/
However that still lacks defining how you would tell your program that this part of the image means that this is 3 separate contracts, Is there anything about these files in particular that make this clear, e.g. "1 of 3" on the pages,, a logo or otherwise? that will be the main part that determines how complex a problem you are trying to solve.
Best solution was to create binary classifier (SGDClassifier) and train it on classes first-page and not-first-page. Each item from the dataset was trimmed to 100 tokens (words)

How to get started with using Python for animal tracking?

I am using Python 3.X on macOS Sierra, and want to write a Python programme that should be able to: (1) load a 1 min video (format probably .avi), (2) identify white larvae (I am a biologist working on fly larvae) on a dark-ish background, (3) track the flies over the 1 min, and (4) output a .csv file with x- and y-coordinates for each larvae.
Now, what I am asking is not for anyone to write the code, I want to do that myself; rather, since I am totally new to working with videos, and only have marginal Python experience, it would help me for people to suggest a strategy I could follow to write that programme, such as specific modules that contain functions useful for this task, etc.

Recognition of a sound (a word) with machine learning in python

I'm preparing an experiment, and I want to write a program using python to recognize certain word spoken by the participants.
I searched a lot about speech recognition in python but the results are complicated.(e.g. CMUSphinx).
What I want to achieve is a program, that receive a sound file (contains only one word, not English), and I tell the program what the sound means and what output I want to see.
I have seen the sklearn example about recognizing hand-written digits. I want to know if I can do something like the example:
training the program to return certain output (e.g. numbers) according to sound files from different people saying same word;
when take in new sound files from other person saying same word, return same values.
Can I do this with python and sklearn?
If so, where should I start?
Thank you!
I've written such program in text recognition. I can tell you if you chose to "teach" your program manually you will have a lot of work think about the variation in voice due to accents etc.
You could start looking for a sound analyzer here (Musical Analysis). try to identify the waves of a simple word like "yes" and write an alghorithm that percentages the variation of the soundfile. this way you can put a margin in to safe yourself from false-positives / vice-versa.
Also you might need to remove background noise from the soundfile first as they may interfer with your identification patterns.

Python interval based sparse container

I am trying to create an interface between structured data and NLTK. NLP libraries generally work with bags of words, hence I need to turn my structured data into bags of words.
I need to associate the offset of a word with it's meta-data.Therefore my best bet is to have some sort of container that holds ranges as keys (allowing nested ranges) and can retrieve all the meta-data (multiple if the word offset is part of a nested range).
What code can I pickup that would do this efficiently (--i.e., sparse represention of the data ) ? Efficient because my global corpus will have at least a few hundred megabytes.
Note :
I am serialising structured forum posts. which will include posts with sections of quotes with them. I want to know which topic a word belonged to, and weather it's a quote or user-text. There will probably be additional metadata as my work progresses. Note that a word belonging to a quote is what I meant by nested meta-data, so the word is part of a quote, that belongs to a post made by a user.
I know that one can tag words in NLTK I haven't looked into it, if its possible to do what I want that way please comment. But I am still looking for the original approach.
There is probably something in numpy that can solve my problem, looking at that now
edit
The input data is far too complex to rip out and post. I have found what I was looking for tho http://packages.python.org/PyICL/. I needed to talk about intervals and not ranges :D I have used boost extensively, however making that a dependency makes me a bit uneasy (Sadly, I am having compiler errors with PyICL :( ).
The question now is: anyone know an interval container library or data structure that can be used to index nested intervals in a sparse fashion. Or put differently provides similar semantics to boost.icl
If you don't want to use PyICL or boost.icl Instead of relying on a specialized library you could just use sqlite3 to do the job ? If you use an in0memory version it will still be a few orders of magnitudes slower than boost.icl (from experience coding other data structures vs sqlite3) but should be more effective than using a c++ std::vector style approach on top of python containers.
You can use two integers and have date_type_low < offset < date_type_high predicate in your where clause. And depending on your table structure this will return nested/overlapping ranges.

Parser generation

i am doing a project on SOFWARE PLAGIARISM DETECTION..i am intended to do it with language C..for that i am supposed to create a token generator, and a parser..but i dont know where to start..any one can help me out with this..
i created a database of tokens and i separated the tokens from my program.Next thing i wanna do is to compare two programs to find out whether it's plagiarized or not. For that i need to create a syntax analyzer.I don't know where to start from...
i.e I want to create a parser for c programs in python
If you want to create a parser in Python you can look at these libraries:
PLY
pyparsing
and Lepl - new but very powerful
Building a real C parser by yourself is a really big task.
I suggest you either find one that is already done, eg. pycparser or you define a really simple subset of C that is easily parsed.
You'll have plenty of work to do for your plagiarism detector after you are done parsing C.
I'm not sure you need to parse the token stream to detect the features you're looking for. In fact, it's probably going to complicate things more than anything.
what you're really looking for is sequences of original source code that have a very strong similarity with a suspect sample code being tested. This sounds very similar to the purpose of a Bayes classifier, like those used in spam filtering and language detection.

Categories

Resources