Splitting synset and removing duplicate words

Splitting synset and removing duplicate words - python

Score SynsetTerms
1 soft#18 mild#3 balmy#2
1 love#2 enjoy#3
0.625 love#1
From the input above, how can i achieve the output as below? I wish to create a file that has the word and their score by removing duplicate words, and splitting each word into a new row. Duplicated words with higher score will be selected instead.
Score SynsetTerms
1 soft
1 mild
1 balmy
1 enjoy
1 love
note that the word 'love' with 0.625 score was removed, only the 'love' with score 1 is kept as it has higher score.

import re
lst = []
dict = {}
i = 1
fhand = open('C:/Users/10648879/Documents/python_prog/data/test.csv', 'r')
for line in fhand:
if i == 1:
i = i + 1
continue
line = re.sub('#[0-9]*', '', line).strip()
line = re.split('\s+', line)
for counter in range(len(line)):
if counter == 0:
score = line[counter]
continue
if line[counter] in dict:
if score > dict[line[counter]]:
dict[line[counter]] = score
else :
dict[line[counter]] = score
i = i + 1
print 'score' + ' ' + 'SynsetTerms'
for k, v in dict.items():
print v, k

Related

Python - Find unique counts of words and letters using dictionaries and tuples

I'm currently trying to create a script that would allow me to run through the text contained in a file and count the number of words, distinct words, list out the top 10 most frequent words and counts, and sort the character frequency from most to least frequent.
Here's what I have so far:
import sys
import os
os.getcwd()
import string
path = ""
os.chdir(path)
#Prompt for user to input filename:
fname = input('Enter the filename: ')
try:
fhand = open(fname)
except IOError:
#Invalid filename error
print('\n')
print("Sorry, file can't be opened! Please check your spelling.")
sys.exit()
#Initialize char counts and word counts dictionary
counts = {}
worddict = {}
#For character and word frequency count
for line in fhand:
#Remove leading spaces
line = line.strip()
#Convert everything in the string to lowercase
line = line.lower()
#Take into account punctuation
line = line.translate(line.maketrans('', '', string.punctuation))
#Take into account white spaces
line = line.translate(line.maketrans('', '', string.whitespace))
#Take into account digits
line = line.translate(line.maketrans('', '', string.digits))
#Splitting line into words
words = line.split(" ")
for word in words:
#Is the word already in the word dictionary?
if word in worddict:
#Increase by 1
worddict[word] += 1
else:
#Add word to dictionary with count of 1 if not there already
worddict[word] = 1
#Character count
for word in line:
#Increase count by 1 if letter
if word in counts:
counts[word] += 1
else:
counts[word] = 1
#Initialize dictionaries
lst = []
countlst = []
freqlst = []
#Count up the number of letters
for ltrs, c in counts.items():
lst.append((c,ltrs))
countlst.append(c)
#Sum up the count
totalcount = sum(countlst)
#Calculate the frequency in each dictionary
for ec in countlst:
efreq = (ec/totalcount) * 100
freqlst.append(efreq)
#Sort lists by count and percentage frequency
freqlst.sort(reverse=True)
lst.sort(reverse=True)
#Print out word counts
for key in list(worddict.keys()):
print(key, ":", worddict[key])
#Print out all letters and counts:
for ltrs, c, in lst:
print(c, '-', ltrs, '-', round(ltrs/totalcount*100, 2), '%')
When I run the script on something like romeo.txt:
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
I get this output:
butsoftwhatlightthroughyonderwindowbreaks : 1
itistheeastandjulietisthesun : 1
arisefairsunandkilltheenviousmoon : 1
whoisalreadysickandpalewithgrief : 1
i - 14 - 10.45 %
t - 12 - 8.96 %
e - 12 - 8.96 %
s - 11 - 8.21 %
a - 11 - 8.21 %
n - 9 - 6.72 %
h - 9 - 6.72 %
o - 8 - 5.97 %
r - 7 - 5.22 %
u - 6 - 4.48 %
l - 6 - 4.48 %
d - 6 - 4.48 %
w - 5 - 3.73 %
k - 3 - 2.24 %
g - 3 - 2.24 %
f - 3 - 2.24 %
y - 2 - 1.49 %
b - 2 - 1.49 %
v - 1 - 0.75 %
p - 1 - 0.75 %
m - 1 - 0.75 %
j - 1 - 0.75 %
c - 1 - 0.75 %
When I run the script on frequency.txt:
I am you you you you you I I I I you you you you I am
I get this output:
iamyouyouyouyouyouiiiiyouyouyouyouiam : 1
y - 9 - 24.32 %
u - 9 - 24.32 %
o - 9 - 24.32 %
i - 6 - 16.22 %
m - 2 - 5.41 %
a - 2 - 5.41 %
Could I get some guidance on how I could think about separating out the words on each line to be distinct, and counts summed up in the manner desired?

line = line.translate(line.maketrans('', '', string.whitespace))
You are removing all whitespaces in the line with this code. Remove it and it should work as you intend.

Your code removes spaces in order to split by space – that doesn't make sense. As you want to extract every word from a given text, I would suggest that you align all words next to each other with a single space inbetween – that means that you have to remove not only new lines, unnecessary spaces, special/undesired characters and digits, but control characters as well.
That should do the trick:
import sys
import os
os.getcwd()
import string
path = "/your/path"
os.chdir(path)
# Prompt for user to input filename:
fname = input("Enter the filename: ")
try:
fhand = open(fname)
except IOError:
# Invalid filename error
print("\n")
print("Sorry, file can't be opened! Please check your spelling.")
sys.exit()
# Initialize char counts and word counts dictionary
counts = {}
worddict = {}
# create one liner with undesired characters removed
text = fhand.read().replace("\n", " ").replace("\r", "")
text = text.lower()
text = text.translate(text.maketrans("", "", string.digits))
text = text.translate(text.maketrans("", "", string.punctuation))
text = " ".join(text.split())
words = text.split(" ")
for word in words:
# Is the word already in the word dictionary?
if word in worddict:
# Increase by 1
worddict[word] += 1
else:
# Add word to dictionary with count of 1 if not there already
worddict[word] = 1
# Character count
for word in text:
# Increase count by 1 if letter
if word in counts:
counts[word] += 1
else:
counts[word] = 1
# Initialize dictionaries
lst = []
countlst = []
freqlst = []
# Count up the number of letters
for ltrs, c in counts.items():
# skip spaces
if ltrs == " ":
continue
lst.append((c, ltrs))
countlst.append(c)
# Sum up the count
totalcount = sum(countlst)
# Calculate the frequency in each dictionary
for ec in countlst:
efreq = (ec / totalcount) * 100
freqlst.append(efreq)
# Sort lists by count and percentage frequency
freqlst.sort(reverse=True)
lst.sort(reverse=True)
# Print out word counts sorted
for key in sorted(worddict.keys(), key=worddict.get, reverse=True)[:10]:
print(key, ":", worddict[key])
# Print out all letters and counts:
for ltrs, c, in lst:
print(c, "-", ltrs, "-", round(ltrs / totalcount * 100, 2), "%")

How to get most common words with a specific value in a dataframe Python

I have a dataframe with score points 0 and 1 and corresponding reviews, I want to find the most common words in reviews with 0 points and 1 points. I tried this but it gives the count of all words:
count = defaultdict(int)
l = df['Summary']
for number in l:
count[number] += 1
print(count)
How can I find the most common values from all the rows with 1 score and 0 score?

Try using a frequency dict. If your columns can be viewed as a list of lists:
data = [[0, "text samle 1"], [0, "text sample 2"], [1, "text sample 3"]]
...then you can:
fd0 = dict()
fd1 = dict()
for list_item in data:
associated_value = list_item[0]
#note the split(' ') splits the string into a list of words
for word in list_item[1].split(' '):
if associated_value == 0:
fd0[word] = 1 if word not in fd0 else fd0[word] + 1
elif associated_value == 1:
fd1[word] = 1 if word not in fd1 else fd1[word] + 1
At the end of the loop your fd0 should have frequency for label 0 and fd1 should have frequency for label 1.

Assuming your data looks like this
review score
0 bad review 0
1 good review 1
2 very bad review 0
3 movie was good 1
You could do something like
words = pd.concat([pd.Series(row['score'], row['review'].split(' '))
for _, row in df.iterrows()]).reset_index()
words.columns = ['word', 'score']
print(words.groupby(['score', 'word']).size())
which gives you
score word
0 bad 2
review 2
very 1
1 good 2
movie 1
review 1
was 1
dtype: int64

most_common_0 = ''
most_common_1 = ''
for text, score in zip(df['Summary'], df['Score']):
if score == 1:
most_common_1 += ' ' + text
else:
most_common_0 += ' ' + text
from collections import Counter
c = Counter(most_common_1.split())
print(c.most_common(2)) # change this 2 to the number you want to analyze
Output
[('good', 2), ('and', 1)]

Program that reads a vote text file and prints the outcome not printing the right numbers

Here is my code I need it to read a line of text that just composes of y's a's and n's y meaning yes n meaning no a meaning abstain, I'm trying to add up the number of yes votes. The text file looks like this:
Aberdeenshire
yyynnnnynynyannnynynanynaanyna
Midlothian
nnnnynyynyanyaanynyanynnnanyna
Berwickshire
nnnnnnnnnnnnnnnnnnnnynnnnnynnnnny
here is my code:
def main():
file = open("votes.txt")
lines = file.readlines()
votes = 0
count = 0
count_all = 0
for m in range(1,len(lines),2):
line = lines[m]
for v in line:
if v == 'a':
votes += 1
elif v == 'y':
count_all += 1
count += 1
votes += 1
else:
count_all += 1
print("percentage:" + (str(count/count_all)))
print("Overall there were ", (count/count_all)," yes votes")
main()

First of all, you should note that your file.readlines() actually gives you the \n at the end of each line, which in your code will both be treated in the else block, so as no's:
>>> with open("votes.txt","r") as f:
... print(f.readlines())
...
['Aberdeenshire\n',
'yyynnnnynynyannnynynanynaanyna\n',
'Midlothian\n',
'nnnnynyynyanyaanynyanynnnanyna\n',
'Berwickshire\n',
'nnnnnnnnnnnnnnnnnnnnynnnnnynnnnny\n']
So that might explain why you don't find the good numbers...
Now, to make the code a bit more efficient, we could look into the count method of str, and maybe also get rid of those \n with a split rather than a readlines:
with open("votes.txt","r") as f:
full = f.read()
lines = full.split("\n")
votes = 0
a = 0
y = 0
n = 0
for m in range(1,len(lines),2):
line = lines[m]
votes += len(line) # I'm counting n's as well here
a += line.count("a")
y += line.count("y")
n += line.count("n")
print("Overall, there were " + str(100 * y / (y + n)) + "% yes votes.")
Hope that helped!

More or less pythonic one liner, it doesn't give you the votes for each person/city tho:
from collections import Counter
l = """Aberdeenshire
yyynnnnynynyannnynynanynaanyna
Midlothian
nnnnynyynyanyaanynyanynnnanyna
Berwickshire
nnnnnnnnnnnnnnnnnnnnynnnnnynnnnny"""
Counter([char for line in l.split('\n')[1::2] for char in line.strip()])
Returns:
Counter({'a': 11, 'n': 60, 'y': 22})

I have a list and I want to count the occurrence of items at a certain position in my list

I have a file which was transformed to a list. Now I want to count the occurrence of specific elements in a list at a specific position. Here is my code so far:
Inpu = open("I.txt","r")
entries = []
for line in Inpu:
line=line.lstrip()
if not line.startswith('#'):
row = line.split()
entries.append(row)
count0 = 0
count1 = 0
for item in entries:
try:
if item[3] == '1':
count1 += 1
if item[3] == '0':
count0 += 1
print item
except IndexError:
continue
This for loop statement works fine and gives me total of count1 and count0 in my file.
for line in Inpu:
line=line.lstrip
if not line.startwith('#'):
rows=line.split()
peptide_count.append(row)
for line in Inpu:
for i in line:
while i in line[0]=='1':
peptide_length+=1
if i in line[3] == '1':
count1= str(line[3]).count(1)
print (count1)
if i in line[3] == '0':
count0=str(item[3]).count(0)
print str(count0)
else:
if i in line[0]=='>':
break
print ("peptide_Lengths|1s|0s")
print(str(peptide_length) + "\t" + str(countones) + "\t" + str(countz))
This on the other hand is suppose to count the number of occurrences of zero's and ones's in line 3 when line[0] position starts with 1 and should break when it comes a cross '>' in line[0] in my file. But here is the out put I get which is obviously wrong:
peptide_Lengths|1s|0s
0 0 0
Is there anything I'm doing wrong or missing?

appending specific words to list from file in python

I am writing a program that reads from a file of 50,000 words and it needs to get the percentage of words that do not have the letter 'e' in them. I can get the program to print all the words without e's but I want to append them to a list so that I can get the sum of the elements within the list. What I have now gives me the result of 0 every time I run it. It also produces the total amount of lines which is correct. Sorry, I am not the best in python.
f=open("hardwords.txt")
def has_no_e(f):
words = []
sum_words= len(words)
total = sum(1 for s in f)
print total
print sum_words
letter = 'e'
for line in f:
for l in letter:
if l in line:
break
else:
words.append(line)
has_no_e(f)

You don't need to collect the words, just count them.
Untested:
total = 0
without_e = 0
with open("hardwords.txt") as f:
for line in f:
total = total + 1
if not 'e' in line:
without_e = without_e + 1
percentage = float(without_e) / float(total)

What about this:
def has_no_e():
with open(path, "r") as f:
words = [word.strip() for line in f.readlines() for word in line.strip().split(',')]
words_without_e = [word for word in words if 'e' not in word]
print len(words), words
print len(words_without_e), words_without_e
has_no_e()
Now you just need to calculate the percentage

This does just so:
def has_no_e(path):
total_words = 0
words_without_e = 0
with open(path, "r") as f:
for line in f:
words = line.lower().split()
total_words += len(words)
words_without_e += sum("e" not in w for w in words)
return (float(words_without_e)/total_words)*100

This a possible way to do it:
with open('G:\Tmp\demo.txt', 'r') as f:
total = 0
count = 0
for line in f:
words = line.split()
total = total + len(words)
count = count + len([w for w in words if w.find('e') > 0])
print 'Total word:{0}, counted:{1}'.format(total, count)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting synset and removing duplicate words - python

Related

Python - Find unique counts of words and letters using dictionaries and tuples

How to get most common words with a specific value in a dataframe Python

Program that reads a vote text file and prints the outcome not printing the right numbers

I have a list and I want to count the occurrence of items at a certain position in my list

appending specific words to list from file in python

Categories

Resources