I have a file which have
hi I am human being I live for money
How my python code can I judge that , "hi I am human being I live" is one string and "for money" is other string. Logic is if there is one space between words it's a string and two space (tab) means new string start. How to do this in python
You can use regular expressions. This way you can split on double spaces and TAB.
import re
text = "hi I am human being I live for money"
re.split('\s{2}|\t', text)
#["hi I am human being I live", "for money"]
This will split on double spaces or TABs, if you want something that catches any amount of spaces over 2 and TABs then use: '\s\s+?|\t' as your regex.
I think what you really want to do is to split your string at instances of double white spaces.
def get_unique_strings(text):
return text.split(' ') # split at a double white space.
You can use this line of code to split() your string and get list of strings
"hi I am human being I live for money".split(" ")
#["hi I am human being I live", "for money"]
Related
I am trying to find a way to parse a string of a transcript into speaker segments (as a list). Speaker labels are denoted by the upper-casing of the speaker's name followed by a colon. The problem I am having is some names have a number of non upper-case characters. Examples might include the following:
OBAMA: said something
O'MALLEY: said something else
GOV. HICKENLOOPER: said something else entirely'
I have written the following regex, but I am struggling to get it to work:
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+(\ |\.|\'|\d)*[A-Z]*:)', mystring)
What I think I have written (and ideally what I want to do) is a command to split the string based on:
1. Find a newline
2. Use positive look-ahead for one or more uppercase characters
3. If upper-case characters are found look for optional characters from the list of periods, apostrophes, single spaces, and digits
4. If these optional characters are found, look for additional uppercase characters.
5. Crucially, find a colon symbol at the end of this sequence.
EDIT: In many cases, the content of the speech will have newline characters contained within it, and possibly colon symbols. As such, the only thing separating the speaker label from the content of speech is the sequence mentioned above.
just change (\ |.|\'|\d) to [\ .\'\d] or (?:\ |.|\'|\d)
import re
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+[\ \.\'\d]*[A-Z]*:)', mystring)
print(parse_turns)
If it's true that the speaker's name and what they said are separated by a colon, then it might be simpler to move away from regex to do your splitting.
list_of_things = []
mystring = "OBAMA: Hi\nO'MALLEY: True Dat\nHUCK FINN: Sure thing\n"
lines = mystring.split("\n")# 1st split the string into lines based on the \n character
for line in lines:
colon_pos = line.find(":",0) # Finds the position of the first colon in the line
speaker, utterance = line[0:colon_pos].strip(), line[colon_pos+1:].strip()
list_of_things.append((speaker, utterance))
At the end, you should have a neat list of tuples containing speakers, and the things they said.
i have a long text which i need to be as clean as possible.
I have collapsed multiple spaces in one space only. I have removed \n and \t. I stripped the resulting string.
I then found characters like \u2003 and \u2019
What are these? How do I make sure that in my text I will have removed all special characters?
Besides the \n \t and the \u2003, should I check for more characters to remove?
I am using python 3.6
Try this:
import re
# string contains the \u2003 character
string = u'This is a test string ’'
# this regex will replace all special characters with a space
re.sub('\W+',' ',string).strip()
Result
'This is a test string'
If you want to preserve ascii special characters:
re.sub('[^!-~]+',' ',string).strip()
This regex reads: select [not characters 34-126] one or more times, where characters 34-126 are the visible range of ascii.
In regex , the ^ says not and the - indicates a range. Looking at an ascii table, 32 is space and all characters below are either a button interrupt or another form of white space like tab and newline. Character 33 is the ! mark and the last displayable character in ascii is 126 or ~.
Thank you Mike Peder, this solution worked for me. However I had to do it for both sides of the comparison
if((re.sub('[^!-~]+',' ',date).strip())==(re.sub('[^!-~]+',' ',calendarData[i]).strip())):
I am using nltk.word_tokenize in Dari language. The problem is that we have space between one word.
For example the word "زنده گی" which means life. And the same; we have many other words. All words which end with the character "ه" we have to give a space for it, otherwise, it can be combined such as "زندهگی".
Can anyone help me using [tag:regex] or any other way that should not tokenize the words that a part of one word ends with "ه" and after that, there will be the "گ " character.
To resolve this problem in Persian we have a character calls Zero-width_non-joiner (or نیمفاصله in Persian or half space or semi space) which has two symbol codes. One is standard and the other is not standard but widely used :
\u200C : http://en.wikipedia.org/wiki/Zero-width_non-joiner
\u200F : Right-to-left mark (http://unicode-table.com/en/#200F)
As I know Dari is very similar to Persian. So first of all you should correct all the words like زنده گی to زندهگی and convert all wrong spaces to half spaces then you can simply use this regex to match all words of a sentence:
[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+
Online demo (the black bullet in test string is half space which is not recognizable for regex101 but if you check the match information part and see Match 5 you will see that is correct)
For converting wrong spaces of a huge text to half spaces there is an add on for Microsoft word calls virastyar which is free and open source. You can install it and refine your whole text. But consider this add on is created for Persian and not Dari. For example In Persian we write زندهگی as زندگی and it can not correct this word for you. But the other words like می شود would easily corrects and converts to میشود. Also you can add custom words to the database.
I have a string that looks like this where the "-" are representing blocks of whitespace or just a newline (it is random on whether it is a number of spaces then a newline or just a newline):
"Hello my name is
Robert and I am trying to figure
-
-
-
out this code
Thanks"
All I really want to do is get rid of all the spaces/newlines between "figure" and "out" the other spaces I would want to keep them the same way if I could. The end string I would want would look like this:
"Hello my name is
Robert and I am trying to figure
out this code
Thanks"
Is there an easy way to do this? Any help is greatly appreciated!
One way to accomplish this would be with regular expressions, which will simply allow us to find runs of spaces and newlines. The following code should work for your purposes:
import re
string = 'lorem ipsum dolor\n\n sic\n\n\n lorem'
string = re.sub(r' +', ' ', string)
string = re.sub(r'\n+', '\n', string)
This will replace all runs of spaces with a single space, and all runs of newlines with a single newline.
I want to grab a word by using a comma as the end of the word indicator using python and also remove the extra quotation marks and white spaces. Also one more thing I also want to make every letter of the word lowercase, then loop to the next word in the text file.
For Example:
Text File:
"Test Word", "The Test", "Word Two", "Word Four", "Alpha", "Bravo", "Charlie"
I am willing to make further clarifications, any help will be appreciated. Thank you
Since you don't have any code to reference, I'll give a high level explanation of what I would do:
Use str.split() with a comma as your delimiter to break up the string into an array of strings.
Since you need to remove both whitespace and quotes, I would use regular expressions via a replace function, re.sub, to adapt these new strings. It would look something like: '\"|\s', replace with "". You can use str.lower() to convert all characters to lower case. Hope that helps.