Converting plural to singular in a text file with Python - python

I have txt files that look like this:
word, 23
Words, 2
test, 1
tests, 4
And I want them to look like this:
word, 23
word, 2
test, 1
test, 4
I want to be able to take a txt file in Python and convert plural words to singular. Here's my code:
import nltk
f = raw_input("Please enter a filename: ")
def openfile(f):
with open(f,'r') as a:
a = a.read()
a = a.lower()
return a
def stem(a):
p = nltk.PorterStemmer()
[p.stem(word) for word in a]
return a
def returnfile(f, a):
with open(f,'w') as d:
d = d.write(a)
#d.close()
print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))
I have also tried these 2 definitions instead of the stem definition:
def singular(a):
for line in a:
line = line[0]
line = str(line)
stemmer = nltk.PorterStemmer()
line = stemmer.stem(line)
return line
def stem(a):
for word in a:
for suffix in ['s']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word
Afterwards I'd like to take duplicate words (e.g. test and test) and merge them by adding up the numbers next to them. For example:
word, 25
test, 5
I'm not sure how to do that. A solution would be nice but not necessary.

If you have complex words to singularize, I don't advise you to use stemming but a proper python package link pattern :
from pattern.text.en import singularize
plurals = ['caresses', 'flies', 'dies', 'mules', 'geese', 'mice', 'bars', 'foos',
'families', 'dogs', 'child', 'wolves']
singles = [singularize(plural) for plural in plurals]
print(singles)
returns:
>>> ['caress', 'fly', 'dy', 'mule', 'goose', 'mouse', 'bar', 'foo', 'foo', 'family', 'family', 'dog', 'dog', 'child', 'wolf']
It's not perfect but it's the best I found. 96% based on the docs : http://www.clips.ua.ac.be/pages/pattern-en#pluralization

It seems like you're pretty familiar with Python, but I'll still try to explain some of the steps. Let's start with the first question of depluralizing words. When you read in a multiline file (the word, number csv in your case) with a.read(), you're going to be reading the entire body of the file into one big string.
def openfile(f):
with open(f,'r') as a:
a = a.read() # a will equal 'soc, 32\nsoc, 1\n...' in your example
a = a.lower()
return a
This is fine and all, but when you want to pass the result into stem(), it will be as one big string, and not as a list of words. This means that when you iterate through the input with for word in a, you will be iterating through each individual character of the input string and applying the stemmer to those individual characters.
def stem(a):
p = nltk.PorterStemmer()
a = [p.stem(word) for word in a] # ['s', 'o', 'c', ',', ' ', '3', '2', '\n', ...]
return a
This definitely doesn't work for your purposes, and there are a few different things we can do.
We can change it so that we read the input file as one list of lines
We can use the big string and break it down into a list ourselves.
We can go through and stem each line in the list of lines one at a time.
Just for expedience's sake, let's roll with #1. This will require changing openfile(f) to the following:
def openfile(f):
with open(f,'r') as a:
a = a.readlines() # a will equal 'soc, 32\nsoc, 1\n...' in your example
b = [x.lower() for x in a]
return b
This should give us b as a list of lines, i.e. ['soc, 32', 'soc, 1', ...]. So the next problem becomes what do we do with the list of strings when we pass it to stem(). One way is the following:
def stem(a):
p = nltk.PorterStemmer()
b = []
for line in a:
split_line = line.split(',') #break it up so we can get access to the word
new_line = str(p.stem(split_line[0])) + ',' + split_line[1] #put it back together
b.append(new_line) #add it to the new list of lines
return b
This is definitely a pretty rough solution, but should adequately iterate through all of the lines in your input, and depluralize them. It's rough because splitting strings and reassembling them isn't particularly fast when you scale it up. However, if you're satisfied with that, then all that's left is to iterate through the list of new lines, and write them to your file. In my experience it's usually safer to write to a new file, but this should work fine.
def returnfile(f, a):
with open(f,'w') as d:
for line in a:
d.write(line)
print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))
When I have the following input.txt
soc, 32
socs, 1
dogs, 8
I get the following stdout:
Please enter a filename: input.txt
['soc, 32\n', 'socs, 1\n', 'dogs, 8\n']
['soc, 32\n', 'soc, 1\n', 'dog, 8\n']
None
And input.txt looks like this:
soc, 32
soc, 1
dog, 8
The second question regarding merging numbers with the same words changes our solution from above. As per the suggestion in the comments, you should take a look at using dictionaries to solve this. Instead of doing this all as one big list, the better (and probably more pythonic) way to do this is to iterate through each line of your input, and stemming them as you process them. I'll write up code about this in a bit, if you're still working to figure it out.

The Nodebox English Linguistics library contains scripts for converting plural form to single form and vice versa. Checkout tutorial: https://www.nodebox.net/code/index.php/Linguistics#pluralization
To convert plural to single just import singular module and use singular() function. It handles proper conversions for words with different endings, irregular forms, etc.
from en import singular
print(singular('analyses'))
print(singular('planetoids'))
print(singular('children'))
>>> analysis
>>> planetoid
>>> child

Related

"Replace" from central file?

I am trying to extend the replace function. Instead of doing the replacements on individual lines or individual commands, I would like to use the replacements from a central text file.
That's the source:
import os
import feedparser
import pandas as pd
pd.set_option('max_colwidth', -1)
RSS_URL = "https://techcrunch.com/startups/feed/"
feed = feedparser.parse(RSS_URL)
entries = pd.DataFrame(feed.entries)
entries = entries[['title']]
entries = entries.to_string(index=False, header=False)
entries = entries.replace(' ', '\n')
entries = os.linesep.join([s for s in entries.splitlines() if s])
print(entries)
I want to be able to replace words from a RSS feed, from a central "Replacement"-file, witch So the source file should have two columns:Old word, New word. Like replace function replace('old','new').
Output/Print Example:
truck
rental
marketplace
D’Amelio
family
launches
to
invest
up
to
$25M
...
In most cases I want to delete the words that are unnecessary for me, so e.g. replace('to',''). But I also want to be able to change special names, e.g. replace('D'Amelio','DAmelio'). The goal is to reduce the number of words and build up a kind of keyword radar.
Is this possible? I can't find any help Googling. But it could well be that I do not know the right terms or can not formulate.
with open('<filepath>','r') as r:
# if you remove the ' marks from around your words, you can remove the [1:-1] part of the below code
words_to_replace = [word.strip()[1:-1] for word in r.read().split(',')]
def replace_words(original_text, words_to_replace):
for word in words_to_replace:
original_text = original_text.replace(word, '')
return original_text
I was unable to understand your question properly but as far as I understand you have strings like cat, dog, etc. and you have a file in which you have data with which you want to replace the string. If this was your requirement, I have given the solution below, so try running it if it satisfies your requirement.
If that's not what you meant, please comment below.
TXT File(Don't use '' around the strings in Text File):
papa, papi
dog, dogo
cat, kitten
Python File:
your_string = input("Type a string here: ") #string you want to replace
with open('textfile.txt',"r") as file1: #open your file
lines = file1.readlines()
for line in lines: #taking the lines of file in one by one using loop
string1 = f'{line}'
string1 = string1.split() #split the line of the file into list like ['cat,', 'kitten']
if your_string == string1[0][:-1]: #comparing the strings of your string with the file
your_string = your_string.replace(your_string, string1[1]) #If string matches like user has given input cat, it will replace it with kitten.
print(your_string)
else:
pass
If you got the correct answer please upvote my answer as it took my time to make and test the python file.

python script not joining strings as expected

I have a list of lists of sequences, and a corresponding list of lists of names.
testSequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
testNames = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
I also have a list of all the identifying parts of the names:
taxonNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
I am trying to produce a new list, where each item in the list will correspond to one of the "identifying parts of the names", and the string will be made up of all the sequences for that name.
If a name and sequence does not appear in one of the lists in the lists (i.e. no redFish or blueFish in the first list of testNames) I want to add in a string of hyphens the same length as the sequences in that list. This would give me this output:
['aaaa--AAAAAA', 'cccc--CCCCCC', '----ttTTTTTT', '----ggGGGG']
I have this piece of code to do this.
complete = [''] * len(taxonNames)
for i in range(len(testSequences)):
for j in range(len(taxonNames)):
sequenceLength = len(testSequences[i][0])
for k in range(len(testSequences[i])):
if taxonNames[j] in testNames[i][k]:
complete[j].join(testSequences[i][k])
if taxonNames[j] not in testNames[i][k]:
hyphenString = "-" * sequenceLength
complete[j].join(hyphenString)
print complete
"complete" should give my final output as explained above, but it comes out looking like this:
['', '', '', '']
How can I fix my code to give me the correct answer?
The main issue with your code, which makes it very hard to understand, is you're not really leveraging the language elements that make Python so strong.
Here's a solution to your problem that works:
test_sequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
test_names = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
taxon_names = ['oneFish', 'twoFish', 'redFish', 'blueFish']
def get_seqs(taxon_name, sequences_list, names_list):
for seqs, names in zip(sequences_list, names_list):
found_seq = None
for seq, name in zip(seqs, names):
if taxon_name in name:
found_seq = seq
break
yield found_seq if found_seq else '-' * len(seqs[0])
result = [''.join(get_seqs(taxon_name, test_sequences, test_names))
for taxon_name in taxon_names]
print(result)
The generator get_seqs pairs up lists from test_sequences and test_names and for each pair, tries to find the sequence (seq) for the name (name) that matches and yields it, or yields a string of the right number of hyphens for that list of sequences.
The generator (a function that yields multiple values) has code that quite literally follows the explanation above.
The result is then simply a matter of, for each taxon_name, getting all the resulting sequences that match in order and joining them together into a string, which is the result = ... line.
You could make it work with list indexing loops and string concatenation, but this is not a PHP question, now is it? :)
Note: for brevity, you could just access the global test_sequences and test_names instead of passing them in as parameters, but I think that would come back to haunt you if you were to actually use this code. Also, I think it makes semantic sense to change the order of names and sequences in the entire example, but I didn't to avoid further deviating from your example.
Here is a solution that may do what you want. It begins, not with your data structures from this post, but with the three example files from your previous post (which you used to build this post's data structures).
The only thing I couldn't figure out was how many hyphens to use for a missing sequence from a file.
differentNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
files = ['f1.txt', 'f2.txt', 'f3.txt']
data = [[] for _ in range(len(differentNames))]
final = []
for file in files:
d = dict()
with open(file, 'r') as fin:
for line in fin:
line = line.rstrip()
if line.startswith('>'): # for ex., >xx_oneFish |xxx
underscore = line.index('_')
space = line.index(' ')
key = line[underscore+1:space]
else:
d[key] = line
for i, key in enumerate(differentNames):
data[i].append(d.get(key, '-' * 4))
for array in data:
final.append(''.join(array))
print(final)
Prints:
['AAAAAAAaaaa----', 'CCCCCCcccc----', 'TTTTTT----tt', 'GGGGGG----gg']

output differences between 2 texts when lines are dissimilar

I am relatively new to Python so apologies in advance for sounding a bit ditzy sometimes. I'll try took google and attempt your tips as much as I can before asking even more questions.
Here is my situation: I am working with R and stylometry to find out the (likely) authorship of a text. What I'd like to do is see if there is a difference in the stylometry of a novel in the second edition, after one of the (assumed) co-authors died and therefore could not have contributed. In order to research that I need
Text edition 1
Text edition 2
and for python to output
words that appear in text 1 but not in text 2
words that appear in text 2 but not in text 1
And I would like to have the words each time they appear so not just 'the' once, but every time the program encounters it when it differs from the first edition (yep I know I'm asking for a lot sorry)
I have tried approaching this via
file1 = open("FRANKENST18.txt", "r")
file2 = open("FRANKENST31.txt", "r")
file3 = open("frankoutput.txt", "w")
list1 = file1.readlines()
list2 = file2.readlines()
file3.write("here: \n")
for i in list1:
for j in list2:
if i==j:
file3.write(i)
but of course this doesn't work because the texts are two giant balls of texts and not separate lines that can be compared, plus the first text has far more lines than the second one. Is there a way to go from lines to 'words' or the text in general to overcome that? Can I put an entire novel in a string lol? I assume not.
I have also attempted to use difflib, but I've only started coding a few weeks ago and I find it quite complicated. For example, I used fraxel's script as a base for:
from difflib import Differ
s1 = open("FRANKENST18.txt", "r")
s1 = open("FRANKENST31.txt", "r")
def appendBoldChanges(s1, s2):
#"Adds <b></b> tags to words that are changed"
l1 = s1.split(' ')
l2 = s2.split(' ')
dif = list(Differ().compare(l1, l2))
return " ".join(['<b>'+i[2:]+'</b>' if i[:1] == '+' else i[2:] for i in dif
if not i[:1] in '-?'])
print appendBoldChanges
but I couldn't get it to work.
So my question is is there any way to output the differences between texts that are not similar in lines like this? It sounded quite do-able but I've greatly underestimated how difficult I found Python haha.
Thanks for reading, any help is appreciated!
EDIT: posting my current code just in case it might help fellow learners that are googling for answers:
file1 = open("1stein.txt")
originaltext1 = file1.read()
wordlist1={}
import string
text1 = [x.strip(string.punctuation) for x in originaltext1.split()]
text1 = [x.lower() for x in text1]
for word1 in text1:
if word1 not in wordlist1:
wordlist1[word1] = 1
else:
wordlist1[word1] += 1
for k,v in sorted(wordlist1.items()):
#print "%s %s" % (k, v)
col1 = ("%s %s" % (k, v))
print col1
file2 = open("2stein.txt")
originaltext2 = file2.read()
wordlist2={}
import string
text2 = [x.strip(string.punctuation) for x in originaltext2.split()]
text2 = [x.lower() for x in text2]
for word2 in text2:
if word2 not in wordlist2:
wordlist2[word2] = 1
else:
wordlist2[word2] += 1
for k,v in sorted(wordlist2.items()):
#print "%s %s" % (k, v)
col2 = ("%s %s" % (k, v))
print col2
what I hope still to edit and output is something like this:
using the dictionaries' key and value system (applied to col1 and col2): {apple 3, bridge 7, chair 5} - {apple 1, bridge 9, chair 5} = {apple 2, bridge -2, chair 5}?
You want to output:
words that appear in text 1 but not in text 2
words that appear in
text 2 but not in text 1
Interesting. A set difference is what you need.
import re
s1 = open("FRANKENST18.txt", "r").read()
s1 = open("FRANKENST31.txt", "r").read()
words_s1 = re.findall("[A-Za-z]",s1)
words_s2 = re.findall("[A-Za-z]",s2)
set_s1 = set(words_s1)
set_s2 = set(words_s2)
words_in_s1_but_not_in_s2 = set_s1 - set_s2
words_in_s2_but_not_in_s1 = set_s2 - set_s1
words_in_s1 = '\n'.join(words_in_s1_but_not_in_s2)
words_in_s2 = '\n'.join(words_in_s2_but_not_in_s1)
with open("s1_output","w") as s1_output:
s1_output.write(words_in_s1)
with open("s2_output","w") as s2_output:
s2_output.write(words_in_s2)
Let me know if this isn't exactly what you're looking for, but it seems like you want to iterate through lines of a file, which you can do very easily in python. Here's an example, where I omit the newline character at the end of each line, and add the lines to a list:
f = open("filename.txt", 'r')
lines = []
for line in f:
lines.append(f[:-1])
Hope this helps!
I'm not completely sure if you're trying to compare the differences in words as they occur or lines as they occur, however one way you could do this is by using a dictionary. If you want to see which lines change you could split the lines on periods by doing something like:
text = 'this is a sentence. this is another sentence.'
sentences = text.split('.')
This will split the string you have (which contains the entire text I assume) on the periods and will return an array (or list) of all the sentences.
You can then create a dictionary with dict = {}, loop over each sentence in the previously created array, make it a key in the dictionary with a corresponding value (could be anything since most sentences probably don't occur more than once). After doing this for the first version you can go through the second version and check which sentences are the same. Here is some code that will give you a start (assuming version1 contains all the sentences from the first version):
for sentence in version1:
dict[sentence] = 1 #put a counter for e
You can then loop over the second version and check if the same sentence is found in the first, with something like:
for sentence in version2:
if sentence in dict: #if the sentence is in the dictionary
pass
#or do whatever you want here
else: #if the sentence isn't
print(sentence)
Again not sure if this is what you're looking for but hope it helps

Using conditionals with variable strings in python

I'm pretty new to python, but I think I catch on fast.
Anyways, I'm making a program (not for class, but to help me) and have come across a problem.
I'm trying to document a list of things, and by things I mean close to a thousand of them, with some repeating. So my problem is this:
I would not like to add redundant names to the list, instead I would just like to add a 2x or 3x before (or after, whichever is simpler) it, and then write that to a txt document.
I'm fine with reading and writing from text documents, but my only problem is the conditional statement, I don't know how to write it, nor can I find it online.
for lines in list_of_things:
if(lines=="XXXX x (name of object here)"):
And then whatever under the if statement. My only problem is that the "XXXX" can be replaced with any string number, but I don't know how to include a variable within a string, if that makes any sense. Even if it is turned into an int, I still don't know how to use a variable within a conditional.
The only thing I can think of is making multiple if statements, which would be really long.
Any suggestions? I apologize for the wall of text.
I'd suggest looping over the lines in the input file and inserting a key in a dictionary for each one you find, then incrementing the value at the key by one for each instance of the value you find thereafter, then generating your output file from that dictionary.
catalog = {}
for line in input_file:
if line in catalog:
catalog[line] += 1
else:
catalog[line] = 1
alternatively
from collections import defaultdict
catalog = defaultdict(int)
for line in input_file:
catalog[line] += 1
Then just run through that dict and print it out to a file.
You may be looking for regular expressions and something like
for line in text:
match = re.match(r'(\d+) x (.*)', line)
if match:
count = int(match.group(1))
object_name = match.group(2)
...
Something like this?
list_of_things=['XXXX 1', 'YYYY 1', 'ZZZZ 1', 'AAAA 1', 'ZZZZ 2']
for line in list_of_things:
for e in ['ZZZZ','YYYY']:
if e in line:
print line
Output:
YYYY 1
ZZZZ 1
ZZZZ 2
You can also use if line.startswith(e): or a regex (if I am understanding your question...)
To include a variable in a string, use format():
>>> i = 123
>>> s = "This is an example {0}".format(i)
>>> s
'This is an example 123'
In this case, the {0} indicates that you're going to put a variable there. If you have more variables, use "This is an example {0} and more {1}".format(i, j)" (so a number for each variable, starting from 0).
This should do it:
a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
from itertools import groupby
print ["%dx %s" % (len(list(group)), key) for key, group in groupby(a)]
There are two options to approach this. 1) something like the following using a dictionary to capture the count of items and then a list to format each item with its count
list_of_things = ['sun', 'moon', 'green', 'grey', 'sun', 'grass', 'green']
listItemCount = {}
countedList = []
for lines in list_of_thing:
if lines in listItemCount:
listItemCount[lines] += 1
else:
listItemCount[lines] = 1
for id in listItemCount:
if listItemCount[id] > 1:
countedList.append(id+' - x'str(listItemCount[id]))
else:
countedList.append(id)
for item in countedList:
print(item)
the output of the above would be
sun - x2
grass
green - x2
grey
moon
or 2) using collections to make things simpler as shown below
import collections
list_of_things = ['sun', 'moon', 'green', 'grey', 'sun', 'grass', 'green']
listItemCount = collections.Counter(list_of_things)
listItemCountDict = dict(listItemCount)
countedList = []
for id in listItemCountDict:
if listItemCountDict[id] > 1:
countedList.append(id+' - x'str(listItemCountDict[id]))
else:
countedList.append(id)
for item in countedList:
print(item)
the output of the above would be
sun - x2
grass
green - x2
grey
moon

manipulating and searching within large files in python

I apologize if this question has been answered already, but I cannot seem to find a page that describes this process. What I am trying to do is to take a large file (The new york times corpus), change it to a list of words using the split function, and then search through that long list for certain words. I have been able to get python to print the file with this code
words=open('nyt.txt')
for line in words:
print (line)
but I would like to be able to use words.split() on this function afterward.
So far, I have been developing the program using a small corpus that I just type in like this
words= ('A B. C D E F G A. B C D E F G A B C D E F G A B C D E F G')
but, rather than copying and pasting the nyt into the parentheses (this doesn't work, the file is too large). I would rather have it source the file into the variable name.
Once again, I am sorry if this has been asked and answered before, as is likely.
What you probably want is called a generator. In your case, it could look like this:
def words(lines):
for line in lines:
for word in line.split():
yield word
This processes the file line by line, so it doesn't have to read the entire file into memory at once. The yield keyword turns the function result into a generator. Usage:
import sys
for word in words(sys.stdin):
print word
Edit: If I understand you correctly this time round, you just want to read all words into a list? Easy enough:
lines = open('nyt.txt')
words = []
for line in lines:
words.extend(line.split())
print words
Take a look at nltk. It's a huge project and it has tools for working with corpora. The project is written in Python and available at http://www.nltk.org/.
I'm not sure what you want, but are you looking for something like this?
words = open('README')
word_list = []
for l in words:
word_list.append(l)
If you're going to be doing anything more sophisticated with words, you should look at the NLTK package.
To count the number of times your target word occurs in the corpus
corpus = ('A B C A B C', 'A D')
target = {'A':0, 'D':0}
# also works for a file
# for line in open('file.txt'):
for line in corpus:
for word in line.split():
if word in target:
target[word] += 1
for (word, count) in target.iteritems():
print 'word "' + word + '" occurs ' + str(count) + ' times'
Begets
word "A" occurs 3 times
word "D" occurs 1 times

Categories

Resources