Get N characters from string, respecting full words python - python

I am using this code to get the first 4000 characters of a long text.
text = data[0:4000]
print(text)
data is the variable containing the long text, now the problem is when I print text, at the end, I get half a word, for example "con" while the word should be "content".
I am wondering if there is a way to ensure the words aren't truncated.

Find the first space after 4000 characters. You can use max to account for text that ends a few characters past 4000, but with no space at the end.
ix = max(data.find(' ', 4000), 4000)
text = data[:ix]

A simple find statement that looks for a space beginning at character 4000 gets this started
x = txt.find(' ',4000)
But to avoid truncating the last word then you need to test the results of your find statement.
If the starting point of 4000 is within the last word then it will return a -1 and you'll print/return the entire text.
If the starting point is before the last word then it will return the index of the next space and you'll print up to that index
x = txt.find(' ',4000)
if x < 0:
print (txt)
else:
print (txt[:x])
Also remember that the starting point on find is zero based so if the 4000th character is a space it will find the next space. As a simple example, the following code will return "four five" rather than simply "four". If this is not the desired result then consider using 3999 in your find.
txt = "four five six"
x = txt.find(' ',5)
print(txt[:x])
# returns "four five"

Related

str.split() returns a AttributeError: 'NoneType' object has no attribute 'split'

I'm having a hard time splitting a string on \n. I'm passing a ~138M-long character string of Japanese into a tokenizer/word tagger and I'm getting the "AttributeError: 'NoneType' object has no attribute 'split'" error.
The name of the tokenizer is MeCab and what it does it takes a string, finds words in it, and then returns a string with the word characteristics (noun, particle, reading, etc.) the words that the tokenizer tags in the string are split by "\n" and so I want to split it into a list using the new lines.
first 25 characters of the string:
str_text[:25]
output:
'このページは以下にある削除依頼の議論を保存したもの'
When I split the first 1 million characters using the below code I have no errors, but when I expand it to 10 million I get the error I mentioned above.
Code and output of the first 25 characters:
import MeCab
#opening the file containing the long string with Japanese text
file = open('output_text.txt')
str_text = file.read()
#passing string into the MeCab tokenizer/tagger and splitting the long string into a list based on
tagger = MeCab.Tagger()
words = tagger.parse(str_text[:25]).split('\n')[:-2] #last two entries are just some tagger info
for word in words:
temp_str = word.split('\t')
print(temp_str)
output (the first element is the word and the second element contains information about the word):
['この', '連体詞,*,*,*,*,*,この,コノ,コノ']
['ページ', '名詞,一般,*,*,*,*,ページ,ページ,ページ']
['は', '助詞,係助詞,*,*,*,*,は,ハ,ワ']
['以下', '名詞,非自立,副詞可能,*,*,*,以下,イカ,イカ']
['に', '助詞,格助詞,一般,*,*,*,に,ニ,ニ']
['ある', '動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル']
['削除', '名詞,サ変接続,*,*,*,*,削除,サクジョ,サクジョ']
['依頼', '名詞,サ変接続,*,*,*,*,依頼,イライ,イライ']
['の', '助詞,連体化,*,*,*,*,の,ノ,ノ']
['議論', '名詞,サ変接続,*,*,*,*,議論,ギロン,ギロン']
['を', '助詞,格助詞,一般,*,*,*,を,ヲ,ヲ']
['保存', '名詞,サ変接続,*,*,*,*,保存,ホゾン,ホゾン']
['し', '動詞,自立,*,*,サ変・スル,連用形,する,シ,シ']
['た', '助動詞,*,*,*,特殊・タ,基本形,た,タ,タ']
['もの', '名詞,非自立,一般,*,*,*,もの,モノ,モノ']
I replaced all "\n" occurrences in the str_text file so that's not the issue. The string can't really be passed into the tokenizer/tagger by one character as it determines what str is a word based on a long string. The fact that it works on the first 1M characters but fails at 10M tells me that this me be one in a ten million occurrence. I've looked for a solution for a few hours now but can't find anything that would help resolve this. I could potentially pass the string in 1M chunks but it feels wrong losing that much data when there might be a solution somewhere.
Any help will be greatly appreciated.
#mrivanlima thank you for fixing the grammar of my post
#Karl Knechtel advise in the comment led to the solution of my problem. Thank you!
For those that are interested, below is the full code that ended up working:
%%time
#load the txt file with Japanese characters:
file = open('output_text.txt')
str_text = file.read()
#boundries for the text blocks used in the below for loop
lower = 0
upper = 100000
#dictionary for words and kanji characters
counts_words = dict()
counts_kanji = dict()
word_counter = 0
#tokenizer/tagger
tagger = MeCab.Tagger()
#splits strings into a list, used for words that have more than one character to get individual characters
def splitter(word):
return list(word)
#break condition for the loop
condition = 'no'
while True:
if condition == 'yes':
break
#this is for the last block of 100k increments
elif lower > 133400001:
#initiate break condition
condition = 'yes'
words = tagger.parse(str_text[lower:]).split('\n')[:-2]
print('Last block, chief!',lower,':',upper)
lower+=100000
upper+=100000
for word in words:
temp_str = word.split('\t')
word_counter+=1
counts_words[temp_str[0]+' '+temp_str[1]] = counts_words.get(temp_str[0]+' '+temp_str[1], 0) + 1
if len(temp_str[0])>1:
for i in splitter(temp_str[0]):
counts_kanji[i] = counts_kanji.get(i, 0) + 1
break
else:
counts_kanji[temp_str[0]] = counts_kanji.get(temp_str[0], 0) + 1
break
else:
#pass string 100k long string block into a tokenizer/tagger
words = tagger.parse(str_text[lower:upper]).split('\n')[:-2]
#increment the lower and upper boundries of the str blocks
lower+=100000
upper+=100000
#iterate through each word parsed by the tokenizer
for word in words:
temp_str = word.split('\t') #split each word data by tab, [word, info]
word_counter+=1 #count number of words
#check if the entry exists in the words dict, either add or increment the counts
counts_words[temp_str[0]+' '+temp_str[1]] = counts_words.get(temp_str[0]+' '+temp_str[1], 0) + 1
#check if the word has more than one character, if yes split it and add each character to the kanji dict
if len(temp_str[0])>1:
for i in splitter(temp_str[0]):
#check if the entry exists in the words dict, either add or increment the counts
counts_kanji[i] = counts_kanji.get(i, 0) + 1
else:
counts_kanji[temp_str[0]] = counts_kanji.get(temp_str[0], 0) + 1
output:
Last block, chief! 133500000 : 133600000
CPU times: user 3min 7s, sys: 2.83 s, total: 3min 10s
Wall time: 3min 10s
I'm the developer of mecab-python3.
I think you might have mailed me about this, but please, do not pass MeCab 1M character strings. It was developed with the assumption that the input is a single sentence. It's robust, and it handles much longer strings - you won't have trouble with a paragraph, for example - but you're basically running into untested territory for no benefit.
Split your input text into paragraphs or sentences before passing it to MeCab.
Also, regarding this:
I could potentially pass the string in 1M chunks but it feels wrong losing that much data when there might be a solution somewhere.
You will not lose any data by passing shorter strings. I'm not sure what you're referring to here.

Using nested for loop and if statement to replace character with integer

I need to output any repeated character to refer to the previous character.
For example: a(-1)rdv(-4)(-4)k or hel(-1)o
This is my code so far:
text= 'aardvark'
i=0
j=0
for i in range(len(text)-1):
for j in range(i+1, len(text)):
if text[j]==text[i]:
sub= text[j]
val2=text.find(sub, i+1, len(text))
p=val2+1
val=str(i-j)
text= text[:val2] + val + text[p:]
break
print(text)
Output: a-1rdva-4k
The second 'a' is not recognised. And I'm not sure how to include brackets in my print.
By updating the text in-place each time you find a back-reference, you muck up your indices (your text gets longer each time) and you never process the last characters properly. You stop checking when you find the first repeat of the 'current' character, so the 3rd a is never processed. This applies to every 3rd repeat in an input string. In addition, if your input text contains any - characters or digits they'll end up being tested against the -offset references you inserted before them too!
For your specific example of aardvark, a string with 8 characters, what happens is this:
You find the second a and set text to a-1rdvark. The text is now 9 characters long, so the last r will never be checked (you loop to i = 6 at most); this would be a problem if your test string ended in a double letter. You break out of the loop, so the j for loop never comes to the 3rd a, and the second a can't be tested for anymore as it has already been replaced.
Your code finds - (not repeated), 1 (not repeated) and then r (repeated once), so now you replace text with a-1rdva-4k. Now you have a string of 10 characters, so -, and 4 will never be tested. Not a big problem anymore, but what if there was a repeat in just the last 3 positions of the string?
Build a new object for the output (adding both letters you haven't seen before and backreferences). That way you won't cause the text you are looping over to grow, and you will continue to find repeats; for the parentheses you could use more string concatenation. You'll need to scan the part of the string before i, not after, for this to work, and go backwards! Testing i - 1, i - 2, etc, down to 0. Naturally, this means your i loop should then range up to the full length:
output = ''
for i in range(len(text)):
current = text[i]
for j in range(i - 1, -1, -1):
if text[j] == current:
current = '(' + str(j - i) + ')'
break
output = output + current
print(output)
I kept the fix to a minimum here, but ideally I'd also make some more changes:
Add all processed characters and references to a new list instead of a string, then use str.join() to join that list into the output afterwards. This is far more efficient than rebuilding the string each iteration.
Using two loops means you check every character in the string again while looping over the text, so the number of steps the algorithm takes grows exponentially with the length of the input. In Computer Science we talk about the time complexity of algorithms, and yours is a O(N^2) (N squared) exponential algorithm. A text with 1000 letters would take up to 1 million steps to process! Rather than loop an exponential number of times, you can use a dictionary to track indices of letters you have seen. If the current character is in the dictionary you can then trivially calculate the offset. Dictionary lookups take constant time (O(1)), making the whole algorithm take linear time (O(N)), meaning that the time the process takes is directly proportional to the length of the input string.
Use enumerate() to add a counter to the loop so you can just loop over the characters directly, no need to use range().
You can use string formatting to build a "(<offset>)" string; Python 3.6 and newer have formatted string literals, where f'...' strings take {} placeholders that are just expressions. f'({some - calculation + or * other})' will execute the expression and put the result in a string that has(and)characters in it too. For earlier Python versions, you can use the [str.format()method](https://docs.python.org/3/library/stdtypes.html#str.format) to get the same result; the syntax then becomes'({})'.format(some - calculation + or * other)`.
Put together, that becomes:
def add_backrefs(text):
output = []
seen = {}
for i, character in enumerate(text):
if character in seen:
# add a back-reference, we have seen this already
output.append(f'({seen[character] - i})')
else:
# add the literal character instead
output.append(character)
# record the position of this character for later reference
seen[character] = i
return ''.join(output)
Demo:
>>> add_backrefs('aardvark')
'a(-1)rdv(-4)(-4)k'
>>> add_backrefs('hello')
'hel(-1)o'
text= 'aardvark'
d={} # create a dictionary to keep track of index of element last seen at
new_text='' # new text to be generated
for i in range(len(text)): # iterate in text from index 0 up to length of text
c = text[i] # storing a character in temporary element as used frequently
if c not in d: # check if character which is explored is visited before or not
d[c] = i # if character visited first time then just add index value of it in dictionary
new_text += c # concatenate character to result text
else: # visiting alreaady visited character
new_text += '({0})'.format(d[c]-i) # used string formatting which will print value of difference of last seen repeated character with current index instead of {0}
d[c] = i # change last seen character index
print(new_text)
Output:
a(-1)rdv(-4)(-4)k

How can I collect certain characters from a string seperated by new lines?

I have a list of time strings followed by phone numbers:-
00:12:23, 0712313412352
01:14:52, 0712312341256
What's the easiest way to get the time duration only?
duration = S[0:8] # duration is first 8 characters
If you know that all three parts of the time will always be formatted as two digits, meaning the entire time will always be exactly 8 characters long, then I think your way is easiest: duration = S[:8].
Otherwise, if you know that your time will always be followed by a comma, you could split on the comma and take the first element: duration = S.split(',')[0].
Otherwise you could use a regex if you don't know that your time will always be 8 characters long and you don't know that the time will be followed by a comma: r'(\d\d?:\d\d?\d\d?)'
Edit:
In your comment it says you want to read through all lines. If you have a string containing all the lines separated by new lines, first you'll want to split the string into individual lines, by splitting on new line. then you'll want to iterate through and get each time:
# Assume the text is stored in text_string
lines = text_string.split('\n')
times = [] # make an empty list to hold the times
for line in lines:
time = line[:8]
times.append(time) # Add the time to our list
print times # This will print our list of times
Assuming lines.txt contains your lines:
>>> [ x[:8] for x in open('lines.txt').readlines() ]
['00:12:23', '00:12:23', '00:12:23']
Or this, if the first field is variable length:
>>> [ x.split(',')[0] for x in open('lines.txt').readlines() ]
['00:12:23', '00:12:23', '00:12:23']
one of the best ways is to use regex and create a useful pattern to find the needed string part
import re
string = "00:12:23, 0712313412352"
request = re.match(r"(^\d*....\d*)", string)
print request.group()
>>>00:12:23
you can try different regex pattern here on https://regex101.com/, you can also at python as interpreter

Split a string into pieces of max length X - split only at spaces

I have a long string which I would like to break into pieces, of max X characters. BUT, only at a space (if some word in the string is longer than X chars, just put it into its own piece).
I don't even know how to begin to do this ... Pythonically
pseudo code:
declare a list
while still some string left:
take the fist X chars of the string
find the last space in that
write everything before the space to a new list entry
delete everything to the left of the space
Before I code that up, is there some python module that can help me (I don't think that pprint can)?
Use the textwrap module (it will also break on hyphens):
import textwrap
lines = textwrap.wrap(text, width, break_long_words=False)
If you want to code it yourself, this is how I would approach it: First, split the text into words. Start with the first word in a line and iterate the remaining words. If the next word fits on the current line, add it, otherwise finish the current line and use the word as the first word for the next line. Repeat until all the words are used up.
Here's some code:
text = "hello, this is some text to break up, with some reeeeeeeeeaaaaaaally long words."
n = 16
words = iter(text.split())
lines, current = [], next(words)
for word in words:
if len(current) + 1 + len(word) > n:
lines.append(current)
current = word
else:
current += " " + word
lines.append(current)

Python Regex Findall Lookahead

I've created a function which searches a protein string for an open reading frame. Here it is:
def orf_finder(seq,format):
record = SeqIO.read(seq,format) #Reads in the sequence and tells biopython what format it is.
string = [] #creates an empty list
for i in range(3):
string.append(record.seq[i:]) #creates a list of three lists, each holding a different reading frame.
protein_string = [] #creates an empty list
protein_string.append([str(i.translate()) for i in string]) #translates each list in 'string' and combines them into one long list
regex = re.compile('M''[A-Z]'+r'*') #compiles a regular expression pattern: methionine, followed by any amino acid and ending with a stop codon.
res = max(regex.findall(str(protein_string)), key=len) #res is a string of the longest translated orf in the sequence.
print "The longest ORF (translated) is:\n\n",res,"\n"
print "The first blast result for this protein is:\n"
blast_records = NCBIXML.parse(NCBIWWW.qblast("blastp", "nr", res)) #blasts the sequence and puts the results into a 'record object'.
blast_record = blast_records.next()
counter = 0 #the counter is a method for outputting the first blast record. After it is printed, the counter equals '1' and therefore the loop stops.
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if counter < 1: #mechanism for stopping loop
print 'Sequence:', alignment.title
print 'Sength:', alignment.length
print 'E value:', hsp.expect
print 'Query:',hsp.query[0:]
print 'Match:',hsp.match[0:]
counter = 1
The only issue is, I don't think that my regex, re.compile('M''[A-Z]'+r'*'), does not find overlapping matches. I know that a lookahead clause, ?=, might solve my problem, but I can't seem to implement it without returning an error.
Does anyone know how I can get it to work?
The code above uses biopython to read-in the DNA sequence, translate it and then searches for a protein readin frame; a sequence starting with 'M' and ending with '*'.
re.compile(r"M[A-Z]+\*")
Assuming that your searched string starts with 'M', followed by one or more upper case 'A-Z' and ends with an '*'.

Categories

Resources