I am relatively new to Python so apologies in advance for sounding a bit ditzy sometimes. I'll try took google and attempt your tips as much as I can before asking even more questions.
Here is my situation: I am working with R and stylometry to find out the (likely) authorship of a text. What I'd like to do is see if there is a difference in the stylometry of a novel in the second edition, after one of the (assumed) co-authors died and therefore could not have contributed. In order to research that I need
Text edition 1
Text edition 2
and for python to output
words that appear in text 1 but not in text 2
words that appear in text 2 but not in text 1
And I would like to have the words each time they appear so not just 'the' once, but every time the program encounters it when it differs from the first edition (yep I know I'm asking for a lot sorry)
I have tried approaching this via
file1 = open("FRANKENST18.txt", "r")
file2 = open("FRANKENST31.txt", "r")
file3 = open("frankoutput.txt", "w")
list1 = file1.readlines()
list2 = file2.readlines()
file3.write("here: \n")
for i in list1:
for j in list2:
if i==j:
file3.write(i)
but of course this doesn't work because the texts are two giant balls of texts and not separate lines that can be compared, plus the first text has far more lines than the second one. Is there a way to go from lines to 'words' or the text in general to overcome that? Can I put an entire novel in a string lol? I assume not.
I have also attempted to use difflib, but I've only started coding a few weeks ago and I find it quite complicated. For example, I used fraxel's script as a base for:
from difflib import Differ
s1 = open("FRANKENST18.txt", "r")
s1 = open("FRANKENST31.txt", "r")
def appendBoldChanges(s1, s2):
#"Adds <b></b> tags to words that are changed"
l1 = s1.split(' ')
l2 = s2.split(' ')
dif = list(Differ().compare(l1, l2))
return " ".join(['<b>'+i[2:]+'</b>' if i[:1] == '+' else i[2:] for i in dif
if not i[:1] in '-?'])
print appendBoldChanges
but I couldn't get it to work.
So my question is is there any way to output the differences between texts that are not similar in lines like this? It sounded quite do-able but I've greatly underestimated how difficult I found Python haha.
Thanks for reading, any help is appreciated!
EDIT: posting my current code just in case it might help fellow learners that are googling for answers:
file1 = open("1stein.txt")
originaltext1 = file1.read()
wordlist1={}
import string
text1 = [x.strip(string.punctuation) for x in originaltext1.split()]
text1 = [x.lower() for x in text1]
for word1 in text1:
if word1 not in wordlist1:
wordlist1[word1] = 1
else:
wordlist1[word1] += 1
for k,v in sorted(wordlist1.items()):
#print "%s %s" % (k, v)
col1 = ("%s %s" % (k, v))
print col1
file2 = open("2stein.txt")
originaltext2 = file2.read()
wordlist2={}
import string
text2 = [x.strip(string.punctuation) for x in originaltext2.split()]
text2 = [x.lower() for x in text2]
for word2 in text2:
if word2 not in wordlist2:
wordlist2[word2] = 1
else:
wordlist2[word2] += 1
for k,v in sorted(wordlist2.items()):
#print "%s %s" % (k, v)
col2 = ("%s %s" % (k, v))
print col2
what I hope still to edit and output is something like this:
using the dictionaries' key and value system (applied to col1 and col2): {apple 3, bridge 7, chair 5} - {apple 1, bridge 9, chair 5} = {apple 2, bridge -2, chair 5}?
You want to output:
words that appear in text 1 but not in text 2
words that appear in
text 2 but not in text 1
Interesting. A set difference is what you need.
import re
s1 = open("FRANKENST18.txt", "r").read()
s1 = open("FRANKENST31.txt", "r").read()
words_s1 = re.findall("[A-Za-z]",s1)
words_s2 = re.findall("[A-Za-z]",s2)
set_s1 = set(words_s1)
set_s2 = set(words_s2)
words_in_s1_but_not_in_s2 = set_s1 - set_s2
words_in_s2_but_not_in_s1 = set_s2 - set_s1
words_in_s1 = '\n'.join(words_in_s1_but_not_in_s2)
words_in_s2 = '\n'.join(words_in_s2_but_not_in_s1)
with open("s1_output","w") as s1_output:
s1_output.write(words_in_s1)
with open("s2_output","w") as s2_output:
s2_output.write(words_in_s2)
Let me know if this isn't exactly what you're looking for, but it seems like you want to iterate through lines of a file, which you can do very easily in python. Here's an example, where I omit the newline character at the end of each line, and add the lines to a list:
f = open("filename.txt", 'r')
lines = []
for line in f:
lines.append(f[:-1])
Hope this helps!
I'm not completely sure if you're trying to compare the differences in words as they occur or lines as they occur, however one way you could do this is by using a dictionary. If you want to see which lines change you could split the lines on periods by doing something like:
text = 'this is a sentence. this is another sentence.'
sentences = text.split('.')
This will split the string you have (which contains the entire text I assume) on the periods and will return an array (or list) of all the sentences.
You can then create a dictionary with dict = {}, loop over each sentence in the previously created array, make it a key in the dictionary with a corresponding value (could be anything since most sentences probably don't occur more than once). After doing this for the first version you can go through the second version and check which sentences are the same. Here is some code that will give you a start (assuming version1 contains all the sentences from the first version):
for sentence in version1:
dict[sentence] = 1 #put a counter for e
You can then loop over the second version and check if the same sentence is found in the first, with something like:
for sentence in version2:
if sentence in dict: #if the sentence is in the dictionary
pass
#or do whatever you want here
else: #if the sentence isn't
print(sentence)
Again not sure if this is what you're looking for but hope it helps
Related
Some of the unique words in the text file does not count and I've had no idea what's wrong in my code.
file = open('tweets2.txt','r')
unique_count = 0
lines = file.readlines()
line = lines[3]
per_word = line.split()
for i in per_word:
if line.count(i) == 1:
unique_count=unique_count + 1
print(unique_count)
file.close()
Here is the text file:
"I love REDACTED and Fiesta and all but can REDACTED host more academic-related events besides strand days???"
The output of this code is:
16
The expected output of the code came from the text file should be:
17
"i will crack a raw egg on my head if REDACTED move the resumption of classes to Jan 7. im not even kidding."
The output of this code is:
20
The expected output of the code came from the text file should be:
23
If you want to count the number of unique whitespace delimited tokens (case-sensitive) in the entire file then:
with open('myfile.txt') as infile:
print(len(set(infile.read().split())))
Maybe count() works with chars not words, instead use python way with set() function to clear duplicated words?
per_word = set(line.split())
print (len(per_word))
You are counting each word as a substring in the whole line because you do:
for i in per_word:
if line.count(i) == 1:
So now some words are repeated as substrings, but not as words. For example, the first word is "i". line.count("i") gives 7 (it is also in "if", "im", etc.) so you don't count it as a unique word (even though it is). If you do:
for i in per_word:
if per_word.count(i) == 1:
then you will count each word as a whole word and get the output you need.
Anyway this is very inefficient (O(n^2)) as you iterate over each word and then count iterates over the whole list again to count it. Either use a set as suggested in other answers or use a Counter:
from collections import Counter
unique_count = 0
line = "i will crack a raw egg on my head if REDACTED move the resumption of classes to Jan 7. im not even kidding."
per_word = line.split()
counter = Counter(per_word)
for count in counter.values():
if count == 1:
unique_count += 1
# Or simply
unique_count = sum(count == 1 for count in counter.values())
print(unique_count)
from docx import Document
alphaDic = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','!','?','.','~',',','(',')','$','-',':',';',"'",'/']
doc = Document('realexample.docx')
docIndex = 0
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
while docIndex < len(doc.paragraphs):
firstSen = doc.paragraphs[docIndex].text
rep_dic = {ord(k):None for k in alphaDic + [x.upper() for x in alphaDic]}
translation = (firstSen.translate(rep_dic))
removeExcessSpaces = " ".join(translation.split())
if removeExcessSpaces != '':
doc.paragraphs[docIndex].text = removeExcessSpaces
else:
delete_paragraph(doc.paragraphs[docIndex])
docIndex -=1 # go one step back in the loop because of the deleted index
docIndex +=1
So the test document looks like this
Hello
你好
Good afternoon
朋友们
Good evening
晚上好
And I'm trying to achieve this result below.
你好
朋友们
晚上好
Right now the code removes all empty paragraphs and excessive spaces and does this, so I'm kinda stuck here. I only want to erase the line breaks that were caused from the English words.
你好
朋友们
晚上好
what you can do is looking for english words, once you find the english word "WORD", append it with "\n" and then remove this new result "WORD\n" from the document. The way you append strings in python is with + sign. Just do "WORD" + "\n"
I have txt files that look like this:
word, 23
Words, 2
test, 1
tests, 4
And I want them to look like this:
word, 23
word, 2
test, 1
test, 4
I want to be able to take a txt file in Python and convert plural words to singular. Here's my code:
import nltk
f = raw_input("Please enter a filename: ")
def openfile(f):
with open(f,'r') as a:
a = a.read()
a = a.lower()
return a
def stem(a):
p = nltk.PorterStemmer()
[p.stem(word) for word in a]
return a
def returnfile(f, a):
with open(f,'w') as d:
d = d.write(a)
#d.close()
print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))
I have also tried these 2 definitions instead of the stem definition:
def singular(a):
for line in a:
line = line[0]
line = str(line)
stemmer = nltk.PorterStemmer()
line = stemmer.stem(line)
return line
def stem(a):
for word in a:
for suffix in ['s']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word
Afterwards I'd like to take duplicate words (e.g. test and test) and merge them by adding up the numbers next to them. For example:
word, 25
test, 5
I'm not sure how to do that. A solution would be nice but not necessary.
If you have complex words to singularize, I don't advise you to use stemming but a proper python package link pattern :
from pattern.text.en import singularize
plurals = ['caresses', 'flies', 'dies', 'mules', 'geese', 'mice', 'bars', 'foos',
'families', 'dogs', 'child', 'wolves']
singles = [singularize(plural) for plural in plurals]
print(singles)
returns:
>>> ['caress', 'fly', 'dy', 'mule', 'goose', 'mouse', 'bar', 'foo', 'foo', 'family', 'family', 'dog', 'dog', 'child', 'wolf']
It's not perfect but it's the best I found. 96% based on the docs : http://www.clips.ua.ac.be/pages/pattern-en#pluralization
It seems like you're pretty familiar with Python, but I'll still try to explain some of the steps. Let's start with the first question of depluralizing words. When you read in a multiline file (the word, number csv in your case) with a.read(), you're going to be reading the entire body of the file into one big string.
def openfile(f):
with open(f,'r') as a:
a = a.read() # a will equal 'soc, 32\nsoc, 1\n...' in your example
a = a.lower()
return a
This is fine and all, but when you want to pass the result into stem(), it will be as one big string, and not as a list of words. This means that when you iterate through the input with for word in a, you will be iterating through each individual character of the input string and applying the stemmer to those individual characters.
def stem(a):
p = nltk.PorterStemmer()
a = [p.stem(word) for word in a] # ['s', 'o', 'c', ',', ' ', '3', '2', '\n', ...]
return a
This definitely doesn't work for your purposes, and there are a few different things we can do.
We can change it so that we read the input file as one list of lines
We can use the big string and break it down into a list ourselves.
We can go through and stem each line in the list of lines one at a time.
Just for expedience's sake, let's roll with #1. This will require changing openfile(f) to the following:
def openfile(f):
with open(f,'r') as a:
a = a.readlines() # a will equal 'soc, 32\nsoc, 1\n...' in your example
b = [x.lower() for x in a]
return b
This should give us b as a list of lines, i.e. ['soc, 32', 'soc, 1', ...]. So the next problem becomes what do we do with the list of strings when we pass it to stem(). One way is the following:
def stem(a):
p = nltk.PorterStemmer()
b = []
for line in a:
split_line = line.split(',') #break it up so we can get access to the word
new_line = str(p.stem(split_line[0])) + ',' + split_line[1] #put it back together
b.append(new_line) #add it to the new list of lines
return b
This is definitely a pretty rough solution, but should adequately iterate through all of the lines in your input, and depluralize them. It's rough because splitting strings and reassembling them isn't particularly fast when you scale it up. However, if you're satisfied with that, then all that's left is to iterate through the list of new lines, and write them to your file. In my experience it's usually safer to write to a new file, but this should work fine.
def returnfile(f, a):
with open(f,'w') as d:
for line in a:
d.write(line)
print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))
When I have the following input.txt
soc, 32
socs, 1
dogs, 8
I get the following stdout:
Please enter a filename: input.txt
['soc, 32\n', 'socs, 1\n', 'dogs, 8\n']
['soc, 32\n', 'soc, 1\n', 'dog, 8\n']
None
And input.txt looks like this:
soc, 32
soc, 1
dog, 8
The second question regarding merging numbers with the same words changes our solution from above. As per the suggestion in the comments, you should take a look at using dictionaries to solve this. Instead of doing this all as one big list, the better (and probably more pythonic) way to do this is to iterate through each line of your input, and stemming them as you process them. I'll write up code about this in a bit, if you're still working to figure it out.
The Nodebox English Linguistics library contains scripts for converting plural form to single form and vice versa. Checkout tutorial: https://www.nodebox.net/code/index.php/Linguistics#pluralization
To convert plural to single just import singular module and use singular() function. It handles proper conversions for words with different endings, irregular forms, etc.
from en import singular
print(singular('analyses'))
print(singular('planetoids'))
print(singular('children'))
>>> analysis
>>> planetoid
>>> child
I need to parse a multi-line string into a data structure containing (1) the identifier and (2) the text after the identifier (but before the next > symbol). the identifier always comes on its own line, but the text can take up multiple lines.
>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
after execution I might have the data structured something like this:
id = ['identifier1', 'identifier2', 'identifier3']
and
txt =
['lalalalalalalalalalalalalalalalala',
'bababababababababababababababababa',
'wawawawawawawawawawawawawawawawawa']
It seems I would want to use regex to find (1) things after > but before carriage return, and (2) things between >'s, having temporarily deleted the identifier string and EOL, replacing with "".
The thing is I will have hundreds of these identifiers so I need to run the regex sequentially. Any ideas on how to attack this problem? I am working in python but feel free to use whatever language you want in your response.
*Update 1: code from slater getting closer but things are still not partitioned sequentially into id, text, id, text, etc *
teststring = '''>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa'''
# First, split the text into relevant chunks
split_text = teststring.split('>')
#see where we are after split
print split_text
#remove spaces that will mess up the partitioning
while '' in split_text:
split_text.remove('')
#see where we are after removing '', before partitioning
print split_text
id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]
#see where we are after partition
print id
print txt
print len(split_text)
print len(id)
but the output was:
['', 'identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
3
3
note: it needs to work for a multiline string, dealing with all the \n's. a better test case might be:
teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
wawawawawawawawawawawawawawawawawa'''
# First, split the text into relevant chunks
split_text = teststring.split('>')
#see where we are after split
print split_text
#remove spaces that will mess up the partitioning
while '' in split_text:
split_text.remove('')
#see where we are after removing '', before partitioning
print split_text
id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]
#see where we are after partition
print id
print txt
print len(split_text)
print len(id)
current output:
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
4
4
Personally, I feel that you should use regex as little as possible. It's slow, difficult to maintain, and generally unreadable.
That said, solving this in python is extremely straightforward. I'm a little unclear on what exactly you mean by running this "sequentially", but let me know if this solution doesn't fit your needs.
# First, split the text into relevant chunks
split_text = text.split('>')
id = [text.partition('\n')[0] for text in split_text]
txt = [text.partition('\n')[2] for text in split_text]
Obviously, you could make the code more efficient, but if you're only dealing with hundreds of identifiers it really shouldn't be needed.
If you want to remove any blank entries that might occur, you can do the following:
list_with_blanks = ['', 'hello', '', '', 'world']
filter(None, list_with_blanks)
>>> ['hello', 'world']
Let me know if you have any more questions.
Unless I misunderstood the question, it's as easy as
for line in your_file:
if line.startswith('>'):
id.append(line[1:].strip())
else:
text.append(line.strip())
Edit: to concatenate multiple lines:
ids, text = [], []
for line in teststring.splitlines():
if line.startswith('>'):
ids.append(line[1:])
text.append('')
elif text:
text[-1] += line
I found a solution. It's certainly not very pythonic but it works.
======================================================================
======================================================================
teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala\n
lalalalalalalalalalalalalalalalala\n
>identifier2
bababababababababababababababababa\n
bababababababababababababababababa\n
>identifier3
wawawawawawawawawawawawawawawawawa\n
wawawawawawawawawawawawawawawawawa\n'''
i = 0
j = 0
#split the multiline string by line
dsplit = teststring.split('\n')
#the indicies of identifiers
index = list()
for line in dsplit:
if line.startswith('>'):
print line
index.append(i)
j = j + 1
i = i+1
index.append(i) #add this so you get the last block of text
#the text corresponding to each index
thetext = list()
#the names corresponding to each gene
thenames = list()
for n in range(0, len(index)-1):
thetext.append("")
for k in range(index[n]+1, index[n+1]):
thetext[n] = thetext[n] + dsplit[k]
thenames.append(dsplit[index[n]][1:]) # the [1:] removes the first character (>) from the line
print "the indicies", index
print "the text: ", thetext
print "the names", thenames
print "this many text entries: ", len(thetext)
print "this many index entries: ", j
this gives the following output:
>identifier1
>identifier2
>identifier3
the indicies [1, 6, 11, 16]
the text: ['lalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalala', 'babababababababababababababababababababababababababababababababababa', 'wawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawa']
the names ['identifier1', 'identifier2', 'identifier3']
this many text entries: 3
this many index entries: 3
I'm pretty new to python, but I think I catch on fast.
Anyways, I'm making a program (not for class, but to help me) and have come across a problem.
I'm trying to document a list of things, and by things I mean close to a thousand of them, with some repeating. So my problem is this:
I would not like to add redundant names to the list, instead I would just like to add a 2x or 3x before (or after, whichever is simpler) it, and then write that to a txt document.
I'm fine with reading and writing from text documents, but my only problem is the conditional statement, I don't know how to write it, nor can I find it online.
for lines in list_of_things:
if(lines=="XXXX x (name of object here)"):
And then whatever under the if statement. My only problem is that the "XXXX" can be replaced with any string number, but I don't know how to include a variable within a string, if that makes any sense. Even if it is turned into an int, I still don't know how to use a variable within a conditional.
The only thing I can think of is making multiple if statements, which would be really long.
Any suggestions? I apologize for the wall of text.
I'd suggest looping over the lines in the input file and inserting a key in a dictionary for each one you find, then incrementing the value at the key by one for each instance of the value you find thereafter, then generating your output file from that dictionary.
catalog = {}
for line in input_file:
if line in catalog:
catalog[line] += 1
else:
catalog[line] = 1
alternatively
from collections import defaultdict
catalog = defaultdict(int)
for line in input_file:
catalog[line] += 1
Then just run through that dict and print it out to a file.
You may be looking for regular expressions and something like
for line in text:
match = re.match(r'(\d+) x (.*)', line)
if match:
count = int(match.group(1))
object_name = match.group(2)
...
Something like this?
list_of_things=['XXXX 1', 'YYYY 1', 'ZZZZ 1', 'AAAA 1', 'ZZZZ 2']
for line in list_of_things:
for e in ['ZZZZ','YYYY']:
if e in line:
print line
Output:
YYYY 1
ZZZZ 1
ZZZZ 2
You can also use if line.startswith(e): or a regex (if I am understanding your question...)
To include a variable in a string, use format():
>>> i = 123
>>> s = "This is an example {0}".format(i)
>>> s
'This is an example 123'
In this case, the {0} indicates that you're going to put a variable there. If you have more variables, use "This is an example {0} and more {1}".format(i, j)" (so a number for each variable, starting from 0).
This should do it:
a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
from itertools import groupby
print ["%dx %s" % (len(list(group)), key) for key, group in groupby(a)]
There are two options to approach this. 1) something like the following using a dictionary to capture the count of items and then a list to format each item with its count
list_of_things = ['sun', 'moon', 'green', 'grey', 'sun', 'grass', 'green']
listItemCount = {}
countedList = []
for lines in list_of_thing:
if lines in listItemCount:
listItemCount[lines] += 1
else:
listItemCount[lines] = 1
for id in listItemCount:
if listItemCount[id] > 1:
countedList.append(id+' - x'str(listItemCount[id]))
else:
countedList.append(id)
for item in countedList:
print(item)
the output of the above would be
sun - x2
grass
green - x2
grey
moon
or 2) using collections to make things simpler as shown below
import collections
list_of_things = ['sun', 'moon', 'green', 'grey', 'sun', 'grass', 'green']
listItemCount = collections.Counter(list_of_things)
listItemCountDict = dict(listItemCount)
countedList = []
for id in listItemCountDict:
if listItemCountDict[id] > 1:
countedList.append(id+' - x'str(listItemCountDict[id]))
else:
countedList.append(id)
for item in countedList:
print(item)
the output of the above would be
sun - x2
grass
green - x2
grey
moon