Objective: To classify each tweet as positive or negative and write it to an output file which will contain the username, original tweet and the sentiment of the tweet.
Code:
import re,math
input_file="raw_data.csv"
fileout=open("Output.txt","w")
wordFile=open("words.txt","w")
expression=r"(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"
fileAFINN = 'AFINN-111.txt'
afinn = dict(map(lambda (w, s): (w, int(s)), [ws.strip().split('\t') for ws in open(fileAFINN)]))
pattern=re.compile(r'\w+')
pattern_split = re.compile(r"\W+")
words = pattern_split.split(input_file.lower())
print "File processing started"
with open(input_file,'r') as myfile:
for line in myfile:
line = line.lower()
line=re.sub(expression," ",line)
words = pattern_split.split(line.lower())
sentiments = map(lambda word: afinn.get(word, 0), words)
#print sentiments
# How should you weight the individual word sentiments?
# You could do N, sqrt(N) or 1 for example. Here I use sqrt(N)
"""
Returns a float for sentiment strength based on the input text.
Positive values are positive valence, negative value are negative valence.
"""
if sentiments:
sentiment = float(sum(sentiments))/math.sqrt(len(sentiments))
#wordFile.write(sentiments)
else:
sentiment = 0
wordFile.write(line+','+str(sentiment)+'\n')
fileout.write(line+'\n')
print "File processing completed"
fileout.close()
myfile.close()
wordFile.close()
Issue: Apparently the output.txt file is
abc some tweet text 0
bcd some more tweets 1
efg some more tweet 0
Question 1: How do I add a comma between the userid tweet-text sentiment? The output should be like;
abc,some tweet text,0
bcd,some other tweet,1
efg,more tweets,0
Question 2: The tweets are in Bahasa Melayu (BM) and the AFINN dictionary that I am using is of English words. So the classification is wrong. Do you know any BM dictionary that I can use?
Question 3: How do I pack this code in a JAR file?
Thank you.
Question 1:
output.txt is currently simply composed of the lines you are reading in because of fileout.write(line+'\n'). Since it is space separated, you can separate the line pretty easily
line_data = line.split(' ') # Split the line into a list, separated by spaces
user_id = line_data[0] # The first element of the list
tweets = line_data[1:-1] # The middle elements of the list
sentiment = line_data[-1] # The last element of the list
fileout.write(user_id + "," + " ".join(tweets) + "," + sentiment +'\n')
Question 2:
A quick google search gave me this. Not sure if it has everything you will need though: https://archive.org/stream/grammardictionar02craw/grammardictionar02craw_djvu.txt
Question 3:
Try Jython http://www.jython.org/archive/21/docs/jythonc.html
Related
I have a function that scores words. I have lots of text from sentences to several page documents. I'm stuck on how to score the words and return the text near its original state.
Here's an example sentence:
"My body lies over the ocean, my body lies over the sea."
What I want to produce is the following:
"My body (2) lies over the ocean (3), my body (2) lies over the sea."
Below is a dummy version of my scoring algorithm. I've figured out how to take text, tear it apart and score it.
However, I'm stuck on how to put it back together into the format I need it in.
Here's a dummy version of my function:
def word_score(text):
words_to_work_with = []
words_to_return = []
passed_text = TextBlob(passed_text)
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word)
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
for word in words to work with:
if word == 'body':
score = 2
if word == 'ocean':
score = 3
else:
score = None
words_to_return.append((word,score))
return words_to_return
I'm a relative newbie so I have two questions:
How can I put the text back together, and
Should that logic be put into the function or outside of it?
I'd really like to be able to feed entire segments (i.e. sentences, documents) into the function and have it return them.
Thank you for helping me!
So basically, you want to attribute a score for each word. The function you give may be improved using a dictionary instead of several if statements.
Also you have to return all scores, instead of just the score of the first wordin words_to_work_with which is the current behavior of the function since it will return an integer on the first iteration.
So the new function would be :
def word_score(text)
words_to_work_with = []
passed_text = TextBlob(text)
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word) # Is this line really useful ?
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
dict_scores = {'body' : 2, 'ocean' : 3, etc ...}
return [dict_scores.get(word, None)] # if word is not recognized, score is None
For the second part, which is reconstructing the string, I would actually do this in the same function (so this answers your second question) :
def word_score_and_reconstruct(text):
words_to_work_with = []
passed_text = TextBlob(text)
reconstructed_text = ''
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word) # Is this line really useful ?
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
dict_scores = {'body': 2, 'ocean': 3}
dict_strings = {'body': ' (2)', 'ocean': ' (3)'}
word_scores = []
for word in words_to_work_with:
word_scores.append(dict_scores.get(word, None)) # we still construct the scores list here
# we add 'word'+'(word's score)', only if the word has a score
# if not, we add the default value '' meaning we don't add anything
reconstructed_text += word + dict_strings.get(word, '')
return reconstructed_text, word_scores
I'm not guaranteeing this code will work at first try, I can't test it but it'll give you the main idea
Hope this would help. Based on your question, it has worked for me.
best regards!!
"""
Python 3.7.2
Input:
Saved text in the file named as "original_text.txt"
My body lies over the ocean, my body lies over the sea.
"""
input_file = open('original_text.txt', 'r') #Reading text from file
output_file = open('processed_text.txt', 'w') #saving output text in file
output_text = []
for line in input_file:
words = line.split()
for word in words:
if word == 'body':
output_text.append('body (2)')
output_file.write('body (2) ')
elif word == 'body,':
output_text.append('body (2),')
output_file.write('body (2), ')
elif word == 'ocean':
output_text.append('ocean (3)')
output_file.write('ocean (3) ')
elif word == 'ocean,':
output_text.append('ocean (3),')
output_file.write('ocean (3), ')
else:
output_text.append(word)
output_file.write(word+' ')
print (output_text)
input_file.close()
output_file.close()
Here's a working implementation. The function first parses the input text as a list, such that each list element is a word or a combination of punctuation characters (eg. a comma followed by a space.) Once the words in the list have been processed, it combines the list back into a string and returns it.
def word_score(text):
words_to_work_with = re.findall(r"\b\w+|\b\W+",text)
for i,word in enumerate(words_to_work_with):
if word.isalpha():
words_to_work_with[i] = inflection.singularize(word).lower()
words_to_work_with[i] = lemmatizer.lemmatize(word)
if word == 'body':
words_to_work_with[i] = 'body (2)'
elif word == 'ocean':
words_to_work_with[i] = 'ocean (3)'
return ''.join(words_to_work_with)
txt = "My body lies over the ocean, my body lies over the sea."
output = word_score(txt)
print(output)
Output:
My body (2) lie over the ocean (3), my body (2) lie over the sea.
If you have more than 2 words that you want to score, using a dictionary instead of if conditions is indeed a good idea.
Just to preface, this code is from a great guy on Github / Youtube:
https://github.com/the-javapocalypse/
I made some minor tweaks for my personal use.
One thing that has always stood between myself and sentiment analysis on twitter is the fact that so many bots posts exist. I figure if I cannot avoid the bots altogether, maybe I can just remove duplication to hedge the impact.
For example - "#bitcoin" or "#btc" - Bot accounts exist under many different handles posting the same exact tweet. It could say "It's going to the moon! Buy now #btc or forever regret it! Buy, buy, buy! Here's a link to my personal site [insert personal site url here]"
This would seem like a positive sentiment post. If 25 accounts post this 2 times per account, we have some inflation if I am only analyzing the recent 500 tweets containing "#btc"
So to my question:
What is an effective way to remove duplication before writing to the csv file? I was thinking of inputting a simple if statement and point to an array to check if it exists already. There is an issue with this. Say I input 1000 tweets to analyze. If 500 of these are duplication from bots, my 1000 tweet analysis just became a 501 tweet analysis. This leads to my next question
What is a way to include a check for duplication and if there is duplication add 1 each time to my total request for tweets to analyze. Example - I want to analyze 1000 tweets. Duplication was found one time, so there are 999 unique tweets to include in the analysis. I want the script to analyze one more to make it 1000 unique tweets (1001 tweets including the 1 duplicate)
Small change, but I think it would be effective to know how to remove all tweets with hyperlinks embedded. This would play into the objective of question 2 by compensating for dropping hyperlink tweets. Example - I want to analyze 1000 tweets. 500 of the 1000 have embedded URLs. The 500 are removed from the analysis. I am now down to 500 tweets. I still want 1000. Script needs to keep fetching non URL, non duplicates until 1000 unique non URL tweets have been accounted for.
See below for the entire script:
import tweepy
import csv
import re
from textblob import TextBlob
import matplotlib.pyplot as plt
class SentimentAnalysis:
def __init__(self):
self.tweets = []
self.tweetText = []
def DownloadData(self):
# authenticating
consumerKey = ''
consumerSecret = ''
accessToken = ''
accessTokenSecret = ''
auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessTokenSecret)
api = tweepy.API(auth)
# input for term to be searched and how many tweets to search
searchTerm = input("Enter Keyword/Tag to search about: ")
NoOfTerms = int(input("Enter how many tweets to search: "))
# searching for tweets
self.tweets = tweepy.Cursor(api.search, q=searchTerm, lang="en").items(NoOfTerms)
csvFile = open('result.csv', 'a')
csvWriter = csv.writer(csvFile)
# creating some variables to store info
polarity = 0
positive = 0
negative = 0
neutral = 0
# iterating through tweets fetched
for tweet in self.tweets:
# Append to temp so that we can store in csv later. I use encode UTF-8
self.tweetText.append(self.cleanTweet(tweet.text).encode('utf-8'))
analysis = TextBlob(tweet.text)
# print(analysis.sentiment) # print tweet's polarity
polarity += analysis.sentiment.polarity # adding up polarities
if (analysis.sentiment.polarity == 0): # adding reaction
neutral += 1
elif (analysis.sentiment.polarity > 0.0):
positive += 1
else:
negative += 1
csvWriter.writerow(self.tweetText)
csvFile.close()
# finding average of how people are reacting
positive = self.percentage(positive, NoOfTerms)
negative = self.percentage(negative, NoOfTerms)
neutral = self.percentage(neutral, NoOfTerms)
# finding average reaction
polarity = polarity / NoOfTerms
# printing out data
print("How people are reacting on " + searchTerm +
" by analyzing " + str(NoOfTerms) + " tweets.")
print()
print("General Report: ")
if (polarity == 0):
print("Neutral")
elif (polarity > 0.0):
print("Positive")
else:
print("Negative")
print()
print("Detailed Report: ")
print(str(positive) + "% positive")
print(str(negative) + "% negative")
print(str(neutral) + "% neutral")
self.plotPieChart(positive, negative, neutral, searchTerm, NoOfTerms)
def cleanTweet(self, tweet):
# Remove Links, Special Characters etc from tweet
return ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t]) | (\w +:\ / \ / \S +)", " ", tweet).split())
# function to calculate percentage
def percentage(self, part, whole):
temp = 100 * float(part) / float(whole)
return format(temp, '.2f')
def plotPieChart(self, positive, negative, neutral, searchTerm, noOfSearchTerms):
labels = ['Positive [' + str(positive) + '%]', 'Neutral [' + str(neutral) + '%]',
'Negative [' + str(negative) + '%]']
sizes = [positive, neutral, negative]
colors = ['yellowgreen', 'gold', 'red']
patches, texts = plt.pie(sizes, colors=colors, startangle=90)
plt.legend(patches, labels, loc="best")
plt.title('How people are reacting on ' + searchTerm +
' by analyzing ' + str(noOfSearchTerms) + ' Tweets.')
plt.axis('equal')
plt.tight_layout()
plt.show()
if __name__ == "__main__":
sa = SentimentAnalysis()
sa.DownloadData()
Answer to your first question
You can remove duplicates by using this one liner.
self.tweets = list(set(self.tweets))
This will remove every duplicated tweet. Just in case if you want to see it working, here is a simple example
>>> tweets = ['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> print(tweets)
['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> tweets = list(set(tweets))
>>> print(tweets)
['this is a tweet', 'Yet another Tweet']
Answer to your second question
Since now you have already removed duplicates, you can get the number of removed tweets by taking the difference of self.tweets and NoOfTerms
tweets_to_further_scrape = NoOfTerms - self.tweets
Now you can scrape tweets_to_further_scrape number of tweets and repeat this process of removing duplication and scraping until you have found desired number of unique tweets.
Answer to your third question
When iterating tweets list, add this line to remove external links.
tweet.text = ' '.join([i for i in tweet.text.split() if 'http' not in i])
Hope this will help you out. Happy coding!
You could simply keep a running count of the tweet instances using a defaultdict. You may want to remove the web addresses as well, in case they are blasting out new shortened URLs.
from collections import defaultdict
def __init__(self):
...
tweet_count = defaultdict(int)
def track_tweet(self, tweet):
t = self.clean_tweet(tweet)
self.tweet_count[t] += 1
def clean_tweet(self, tweet):
t = tweet.lower()
# any other tweet normalization happens here, such as dropping URLs
return t
def DownloadData(self):
...
for tweet in self.tweets:
...
# add logic to check for number of repeats in the dictionary.
I have this code that I've been struggling for a while to optimize.
My dataframe is a csv file with 2 columns, out of which the second column contains texts. Looks like on the picture:
I have a function summarize(text, n) that needs a single text and an integer as input.
def summarize(text, n):
sents = sent_tokenize(text) # text into tokenized sentences
# Checking if there are less sentences in the given review than the required length of the summary
assert n <= len(sents)
list_sentences = [word_tokenize(s.lower()) for s in sents] # word tokenized sentences
frequency = calculate_freq(list_sentences) # calculating the word frequency for all the sentences
ranking = defaultdict(int)
for i, sent in enumerate(list_sentences):
for w in sent:
if w in frequency:
ranking[i] += frequency[w]
# Calling the rank function to get the highest ranking
sents_idx = rank(ranking, n)
# Return the best choices
return [sents[j] for j in sents_idx]
So summarize() all the texts, I first iterate through my dataframe and create a list of all the texts, which I later iterate again to send them one by one to the summarize() function so I can get the summary of the text. These for loops are making my code really, really slow, but I haven't been able to figure out a way to make it more efficient, and I would greatly appreciate any suggestions.
data = pd.read_csv('dataframe.csv')
text = data.iloc[:,2] # ilocating the texts
list_of_strings = []
for t in text:
list_of_strings.append(t) # creating a list of all the texts
our_summary = []
for s in list_of_strings:
for f in summarize(s, 1):
our_summary.append(f)
ours = pd.DataFrame({"our_summary": our_summary})
EDIT:
other two functions are:
def calculate_freq(list_sentences):
frequency = defaultdict(int)
for sentence in list_sentences:
for word in sentence:
if word not in our_stopwords:
frequency[word] += 1
# We want to filter out the words with frequency below 0.1 or above 0.9 (once normalized)
if frequency.values():
max_word = float(max(frequency.values()))
else:
max_word = 1
for w in frequency.keys():
frequency[w] = frequency[w]/max_word # normalize
if frequency[w] <= min_freq or frequency[w] >= max_freq:
del frequency[w] # filter
return frequency
def rank(ranking, n):
# return n first sentences with highest ranking
return nlargest(n, ranking, key=ranking.get)
Input text: Recipes are easy and the dogs love them. I would buy this book again and again. Only thing is that the recipes don't tell you how many treats they make, but I suppose that's because you could make them all different sizes. Great buy!
Output text: I would buy this book again and again.
Have you tried something like this?
# Test data
df = pd.DataFrame({'ASIN': [0,1], 'Summary': ['This is the first text', 'Second text']})
# Example function
def summarize(text, n=5):
"""A very basic summary"""
return (text[:n] + '..') if len(text) > n else text
# Applying the function to the text
df['Result'] = df['Summary'].map(summarize)
# ASIN Summary Result
# 0 0 This is the first text This ..
# 1 1 Second text Secon..
Such a long story...
I'm going to assume since you are performing a text frequency analysis, the order of reviewText don't matter. If that is the case:
Mega_String = ' '.join(data['reviewText'])
This should concat all strings in review text function into one big string, with each review separated with a white space.
You can just throw this result to your functions.
I am relatively new to Python so apologies in advance for sounding a bit ditzy sometimes. I'll try took google and attempt your tips as much as I can before asking even more questions.
Here is my situation: I am working with R and stylometry to find out the (likely) authorship of a text. What I'd like to do is see if there is a difference in the stylometry of a novel in the second edition, after one of the (assumed) co-authors died and therefore could not have contributed. In order to research that I need
Text edition 1
Text edition 2
and for python to output
words that appear in text 1 but not in text 2
words that appear in text 2 but not in text 1
And I would like to have the words each time they appear so not just 'the' once, but every time the program encounters it when it differs from the first edition (yep I know I'm asking for a lot sorry)
I have tried approaching this via
file1 = open("FRANKENST18.txt", "r")
file2 = open("FRANKENST31.txt", "r")
file3 = open("frankoutput.txt", "w")
list1 = file1.readlines()
list2 = file2.readlines()
file3.write("here: \n")
for i in list1:
for j in list2:
if i==j:
file3.write(i)
but of course this doesn't work because the texts are two giant balls of texts and not separate lines that can be compared, plus the first text has far more lines than the second one. Is there a way to go from lines to 'words' or the text in general to overcome that? Can I put an entire novel in a string lol? I assume not.
I have also attempted to use difflib, but I've only started coding a few weeks ago and I find it quite complicated. For example, I used fraxel's script as a base for:
from difflib import Differ
s1 = open("FRANKENST18.txt", "r")
s1 = open("FRANKENST31.txt", "r")
def appendBoldChanges(s1, s2):
#"Adds <b></b> tags to words that are changed"
l1 = s1.split(' ')
l2 = s2.split(' ')
dif = list(Differ().compare(l1, l2))
return " ".join(['<b>'+i[2:]+'</b>' if i[:1] == '+' else i[2:] for i in dif
if not i[:1] in '-?'])
print appendBoldChanges
but I couldn't get it to work.
So my question is is there any way to output the differences between texts that are not similar in lines like this? It sounded quite do-able but I've greatly underestimated how difficult I found Python haha.
Thanks for reading, any help is appreciated!
EDIT: posting my current code just in case it might help fellow learners that are googling for answers:
file1 = open("1stein.txt")
originaltext1 = file1.read()
wordlist1={}
import string
text1 = [x.strip(string.punctuation) for x in originaltext1.split()]
text1 = [x.lower() for x in text1]
for word1 in text1:
if word1 not in wordlist1:
wordlist1[word1] = 1
else:
wordlist1[word1] += 1
for k,v in sorted(wordlist1.items()):
#print "%s %s" % (k, v)
col1 = ("%s %s" % (k, v))
print col1
file2 = open("2stein.txt")
originaltext2 = file2.read()
wordlist2={}
import string
text2 = [x.strip(string.punctuation) for x in originaltext2.split()]
text2 = [x.lower() for x in text2]
for word2 in text2:
if word2 not in wordlist2:
wordlist2[word2] = 1
else:
wordlist2[word2] += 1
for k,v in sorted(wordlist2.items()):
#print "%s %s" % (k, v)
col2 = ("%s %s" % (k, v))
print col2
what I hope still to edit and output is something like this:
using the dictionaries' key and value system (applied to col1 and col2): {apple 3, bridge 7, chair 5} - {apple 1, bridge 9, chair 5} = {apple 2, bridge -2, chair 5}?
You want to output:
words that appear in text 1 but not in text 2
words that appear in
text 2 but not in text 1
Interesting. A set difference is what you need.
import re
s1 = open("FRANKENST18.txt", "r").read()
s1 = open("FRANKENST31.txt", "r").read()
words_s1 = re.findall("[A-Za-z]",s1)
words_s2 = re.findall("[A-Za-z]",s2)
set_s1 = set(words_s1)
set_s2 = set(words_s2)
words_in_s1_but_not_in_s2 = set_s1 - set_s2
words_in_s2_but_not_in_s1 = set_s2 - set_s1
words_in_s1 = '\n'.join(words_in_s1_but_not_in_s2)
words_in_s2 = '\n'.join(words_in_s2_but_not_in_s1)
with open("s1_output","w") as s1_output:
s1_output.write(words_in_s1)
with open("s2_output","w") as s2_output:
s2_output.write(words_in_s2)
Let me know if this isn't exactly what you're looking for, but it seems like you want to iterate through lines of a file, which you can do very easily in python. Here's an example, where I omit the newline character at the end of each line, and add the lines to a list:
f = open("filename.txt", 'r')
lines = []
for line in f:
lines.append(f[:-1])
Hope this helps!
I'm not completely sure if you're trying to compare the differences in words as they occur or lines as they occur, however one way you could do this is by using a dictionary. If you want to see which lines change you could split the lines on periods by doing something like:
text = 'this is a sentence. this is another sentence.'
sentences = text.split('.')
This will split the string you have (which contains the entire text I assume) on the periods and will return an array (or list) of all the sentences.
You can then create a dictionary with dict = {}, loop over each sentence in the previously created array, make it a key in the dictionary with a corresponding value (could be anything since most sentences probably don't occur more than once). After doing this for the first version you can go through the second version and check which sentences are the same. Here is some code that will give you a start (assuming version1 contains all the sentences from the first version):
for sentence in version1:
dict[sentence] = 1 #put a counter for e
You can then loop over the second version and check if the same sentence is found in the first, with something like:
for sentence in version2:
if sentence in dict: #if the sentence is in the dictionary
pass
#or do whatever you want here
else: #if the sentence isn't
print(sentence)
Again not sure if this is what you're looking for but hope it helps
I am currently working with a text file that has a list of DNA extraction sequences (contigs), each with a header followed by lines of nucleotides, which is the nucleotide length of that contig. there are 120 contigs, with each entry marked by a line that starts with ">" to denote the sequence information. after this line, a length of nucleotides of that sequence is given.
example:
>gi|571136972|ref|XM_006625214.1| Plasmodium chabaudi chabaudi small subunit ribosomal protein 5 (Rps5) (rps5) mRNA, complete cds
ATGAGAAATATTTTATTAAAGAAAAAATTATATAATAGTAAAAATATTTATATTTTATATTATATTTTAATAATATTTAAAAGTATTTTTATTATTTTATTTAATAGTAAATATAATGTGAATTATTATTTATATAATAAAATTTATAATTTATTTATTATATATATAAAATTATATTATATTATAAATAATATATATTATAATAATAATTATTATTATATATATAATATGAATTATATA
TATTTTTATATTTATAAATATAATAGTTTAAATAATA
>gi|571136996|ref|XM_006625226.1| Plasmodium chabaudi chabaudi small subunit ribosomal protein 2 (Rps2) (rps2) mRNA, complete cds
ATGTTTATTACATTTAAAGATTTATTAAAATCTAAAATATATATAGGAAATAATTATAAAAATATTTATATTAATAATTATAAATTTATATATAAAATAAAATATAATTATTGTATTTTAAATTTTACATTAATTATATTATATTTATATAAATTATATTTATATATTTATAATATATCTATATTTAATAATAAAATTTTATTTATTATTAATAATAATTTAATTACAAATTTAATTATT
AATATATGTAATTTAACTAATAATTTTTATATTATTA
what I would like to do is make a list of every contig. My problem is, I do not know the syntax needed to tell Python to:
find the line after the line that starts with ">"
take a count of all of the characters in the lines of that sequence
return a value to a list of all contig values (a list that gives a list of length of every contig, ie 126, 300, 25...)
make sure the last contig (which has no ">" to denote its end) is counted.
I would like a list of integers, so that I can calculate things like the mean length of the contigs, standard deviation, cool gene equations etc.
I am relatively new to programming. if I am unclear or further information is needed, please let me know.
Don't reinvent the wheel, use biopython as Martin has suggested. Here's a start for you that will print the sequence ID and length to terminal. You can install biopython with pip, i.e. pip install biopython
from Bio import SeqIO
import sys
FileIn = sys.argv[1]
handle = open(FileIn, 'rU')
SeqRecords = SeqIO.parse(handle, 'fasta')
for record in SeqRecords: #loop through each fasta entry
length = len(record.seq) #get sequence length
print "%s: %i bp" % (record.id, length) #print sequence ID: seq length
Or you could store the results in a dictionary:
handle = open(FileIn, 'rU')
sequence_lengths = {}
SeqRecords = SeqIO.parse(handle, 'fasta')
for record in SeqRecords: #loop through each fasta entry
length = len(record.seq) #get sequence length
sequence_lengths[record.id] = length
#access dictionary outside of loop
print sequence_lengths
This might work for you: it prints the number of ACGT's in the lines that follow a line that includes >:
import re
with open("input.txt") as input_file:
data = input_file.read()
data = re.split(r">.*", data)[1:]
data = [sum(1 for ch in datum if ch in 'ACGT') for datum in data]
print(data)
thanks for all the help. I have looked at the biopython stuff and am excited to understand it and incorporate it. The overall goal of this assignment was to teach me how to understand python, rather than finding the solution outright, or at least if I find the solution, I have to be able to explain it in my own words.
Anyway, I have created a code incorporating that element as well as others. I have a few more things to do, and if I am confused, I will return to ask.
here is my first working code outside of working directly with my supervisor or tutorials that I made and understand (woo!):
import re
with open("COPYFORTESTINGplastid.1.rna.fna") as fasta:
contigs = 0
for line in fasta:
if line.strip().startswith('>'):
contigs = contigs + 1
with open("COPYFORTESTINGplastid.1.rna.fna") as fasta:
data = fasta.read()
data = re.split(r">.*", data)[1:]
data = [sum(1 for ch in datum if ch in 'ACGT') for datum in data]
print "Total number of contigs: %s" %contigs
total_contigs = sum(data)
N50 = sum(data)/2
print "number used to determine N50 = %s" %N50
average = 0
total = 0
for n in data:
total = total + n
mean = total / len(data)
print "mean length of contigs: %s" %mean
print "total nucleotides in fasta = %s" %total_contigs
#print "list of contigs by length: %s" %sorted([data])
l = data
l.sort(reverse = True)
print "list of contigs by length: %s" %l
this does what I want it to do, but if you have any comments or advice, I would love to hear.
next up, determining N50 with this sweet sweet list. thanks again!
I created a function to calculate N50 and it seemed to work nicely. I can parse the command line and run any .fa file through the program
def calc_n50(array):
array.sort(reverse = True)
n50 = 0 #sums lengths
n = 0 #n50 sequence
half = sum(array)/2
for val in array:
n50 += val
if n50 >= half:
n = val
break #breaks loop when condition is met
print "N50 is",n