Convert every line in the string into dictionary key - python

Hi I am new to python and dont know whether can I ask this basic question in this site or not
I want to convert every line in the string into a key and assign 0 as an value
MY string is:
s = '''
sarika
santha
#
akash
nice
'''
I had tried this https://www.geeksforgeeks.org/ways-to-convert-string-to-dictionary/ ways but thought not useful for my requirement
Pls help anyone Thanks in advance
Edit:
Actually I had asked for basic string but I am literally for followed string
s="""
san
francisco
Santha
Kumari
this one
"""
Here it should take {sanfrancisco:0 , santha kumari:0 , this one: 0 }
This is the challenge I am facing
Here in my string if having more than 1 new line gap it should take the nextline string as one word and convert into key

You can do it in the below way:
>>> s="""
... hello
... #
... world
...
... vk
... """
>>> words = s.split("\n")
>>> words
['', 'hello', '#', 'world', '', 'vk', '']
>>> words = words[1:len(words)-1]
>>> words
['hello', '#', 'world', '', 'vk']
>>> word_dic = {}
>>> for word in words:
... if word not in word_dic:
... word_dic[word]=0
...
>>> word_dic
{'': 0, 'world': 0, '#': 0, 'vk': 0, 'hello': 0}
>>>
Please let me know if you have any question.

You could continuously match either all lines followed by 2 newlines, or match all lines followed by a single newline.
^(?:\S.*(?:\n\n\S.*)+|\S.*(?:\n\S.*)*)
The pattern matches
^ Start of string
(?: Non capture group
\S.* Match a non whitespace char and the rest of the line
(?:\n\n\S.*)+ Repeat matching 1+ times 2 newlines, a non whitespace char and the rest of the line
| Or
\S.* Match a single non whitespace char and the rest of the line
(?:\n\S.*)* Optionally match a newline, a non whitespace char and the rest of the line
) Close non capture group
Regex demo | Python demo
For those matches, replace 2 newlines with a space and replace a single newline with an empty string.
Then from the values, create a dictionary and initialize all values with 0.
Example
import re
s="""
san
francisco
Santha
Kumari
this one
"""
pattern = r"^(?:\S.*(?:\n\n\S.*)+|\S.*(?:\n\S.*)*)"
my_dict = dict.fromkeys(
[
re.sub(
r"(\n\n)|\n",
lambda n: " " if n.group(1) else "", s.lower()
) for s in re.findall(pattern, s, re.MULTILINE)
],
0
)
print(my_dict)
Output
{'sanfrancisco': 0, 'santha kumari': 0, 'this one': 0}

You could do it like this:
# Split the string into a list
l = s.split()
dictionary = {}
# iterate through every element of the list and assign a value of 0 to it
n = 0
for word in l:
while n < len(l) - 1:
if word == "#":
continue
w = l[n] + l[n+1]
dictionary.__setitem__(w, 0)
n+=2
print(dictionary)

steps -
Remove punctuations from a string via translate.
split words if they're separated by 2 \n character
remove the spaces from the list
remove \n character and use dict comprehension to generate the required dict
import string
s = '''
sarika
santha
#
akash
nice
'''
s = s.translate(str.maketrans('', '', string.punctuation))
word_list = s.split('\n\n')
while '' in word_list:
word_list.remove('')
result = {word.replace('\n', ''): 0 for word in word_list}
print(result)

Related

Python Split Strings While Preserving Order?

I have a list of strings in python, where I need to preserve order and split some strings.
The condition to split a string is that after first match of : there is a none space/new line/tab char.
For example, this must be split:
example: Test to ['example':, 'Test']
While this stays the same: example: , IGNORE_ME_EXAMPLE
Given an input like this:
['example: Test', 'example: ', 'IGNORE_ME_EXAMPLE']
I'm expecting:
['example:', 'Test', 'example: ', 'IGNORE_ME_EXAMPLE']
Please Note that split strings are yet stick to each other and follow original order.
Plus, whenever I split a string I don't want to check split parts again. In other words, I don't want to check 'Test' after I split it.
To make it more clear, Given an input like this:
['example: Test::YES']
I'm expecting:
['example:', 'Test::YES']
You can use regular expressions for that:
import re
pattern = re.compile(r"(.+:)\s+([^\s].+)")
result = []
for line in lines:
match = pattern.match(line)
if match:
result.append(match.group(1))
result.append(match.group(2))
else:
result.append(line)
You can use nested loop comprehension for the input list:
l = ['example: Test::YES']
l1 = [j.lower().strip() for i in l for j in i.split(":", 1) if j.strip().lower() != '']
print(l1)
Output:
['example', 'Test::YES']
you need to iterate over your list of words, for each word, you need to check if : present or not. if present the then split the word in 2 parts, pre : and post part. append these pre and post to final list and if there is no : in word add that word in the result list and skip other operation for that word
# your code goes here
wordlist = ['example:', 'Test', 'example: ', 'IGNORE_ME_EXAMPLE']
result = []
for word in wordlist:
index = -1
part1, part2 = None, None
if ':' in word:
index = word.index(':')
else:
result.append(word)
continue
part1, part2 = word[:index+1], word[index+1:]
if part1 is not None and len(part1)>0:
result.append(part1)
if part2 is not None and len(part2)>0:
result.append(part2)
print(result)
output
['example:', 'Test', 'example:', ' ', 'IGNORE_ME_EXAMPLE']

Splitting string using different scenarios using regex

I have 2 scenarios so split a string
scenario 1:
"##$hello?? getting good.<li>hii"
I want to be split as 'hello','getting','good.<li>hii (Scenario 1)
'hello','getting','good','li,'hi' (Scenario 2)
Any ideas please??
Something like this should work:
>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[##$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']
This might be what your looking for \w+ it matches any digit or letter from 1 to n times as many times as possible. Here is a working Java-Script
var value = "##$hello?? getting good.<li>hii";
var matches = value.match(
new RegExp("\\w+", "gi")
);
console.log(matches)
It works by using \w+ which matches word characters as many times as possible. You cound also use [A-Za-b] to match only letters which not numbers. As show here.
var value = "##$hello?? getting good.<li>hii777bloop";
var matches = value.match(
new RegExp("[A-Za-z]+", "gi")
);
console.log(matches)
It matches what are in the brackets 1 to n timeas as many as possible. In this case the range a-z of lower case charactors and the range of A-Z uppder case characters. Hope this is what you want.
For first scenario just use regex to find all words that are contain word characters and <>.:
In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']
For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex:
In [61]: import nltk
In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']
In case you are looking for solution without regex. string.punctuation will give you list of all special characters.
Use this list with list comprehension for achieving your desired result as:
>>> import string
>>> my_string = '##$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output
Explanation: Below is the step by step instruction regarding how it works:
import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
my_string = '##$hello?? getting good.<li>hii'
# Generating list of character in sample string with
# special character replaced with whitespace
my_list = [(' ' if item in special_char_string else item) for item in my_string]
# Join the list to form string
my_string = ''.join(my_list)
# Split it based on space
my_desired_list = my_string.strip().split()
The value of my_desired_list will be:
['hello', 'getting', 'good', 'li', 'hii']

Python removing delimiters from strings

I have 2 related questions/ issues.
def remove_delimiters (delimiters, s):
for d in delimiters:
ind = s.find(d)
while ind != -1:
s = s[:ind] + s[ind+1:]
ind = s.find(d)
return ' '.join(s.split())
delimiters = [",", ".", "!", "?", "/", "&", "-", ":", ";", "#", "'", "..."]
d_dataset_list = ['hey-you...are you ok?']
d_list = []
for d in d_dataset_list:
d_list.append(remove_delimiters(delimiters, d[1]))
print d_list
Output = 'heyyouare you ok'
What is the best way of avoiding strings being combined together when a delimiter is removed? For example, so that the output is hey you are you ok ?
There may be a number of different sequences of ..., for example .. or .......... etc. How does one go around implementing some form of rule, where if more than one . appear after each other, to remove it? I want to try and avoid hard-coding all sequences in my delimiters list. Thankyou
You could try something like this:
Given delimiters d, join them to a regular expression
>>> d = ",.!?/&-:;#'..."
>>> "["+"\\".join(d)+"]"
"[,\\.\\!\\?\\/\\&\\-\\:\\;\\#\\'\\.\\.\\.]"
Split the string using this regex with re.split
>>> s = 'hey-you...are you ok?'
>>> re.split("["+"\\".join(d)+"]", s)
['hey', 'you', '', '', 'are you ok', '']
Join all the non-empty fragments back together
>>> ' '.join(w for w in re.split("["+"\\".join(d)+"]", s) if w)
'hey you are you ok'
Also, if you just want to remove all non-word characters, you can just use the character group \W instead of manually enumerating all the delimiters:
>>> ' '.join(w for w in re.split(r"\W", s) if w)
'hey you are you ok'
So first of all, your function for removing delimiters could be simplified greatly by using the replace function (http://www.tutorialspoint.com/python/string_replace.htm)
This would help solve your first question. Instead of just removing them, replace with a space, then get rid of the spaces using the pattern you already used (split() treats consecutive delimiters as one)
A better function, which does this, would be:
def remove_delimiters (delimiters, s):
new_s = s
for i in delimiters: #replace each delimiter in turn with a space
new_s = new_s.replace(i, ' ')
return ' '.join(new_s.split())
to answer your second question, I'd say it's time for regular expressions
>>> import re
... ss = 'hey ... you are ....... what?'
... print re.sub('[.+]',' ',ss)
hey you are what?
>>>

Iterating through a string word by word

I wanted to know how to iterate through a string word by word.
string = "this is a string"
for word in string:
print (word)
The above gives an output:
t
h
i
s
i
s
a
s
t
r
i
n
g
But I am looking for the following output:
this
is
a
string
When you do -
for word in string:
You are not iterating through the words in the string, you are iterating through the characters in the string. To iterate through the words, you would first need to split the string into words , using str.split() , and then iterate through that . Example -
my_string = "this is a string"
for word in my_string.split():
print (word)
Please note, str.split() , without passing any arguments splits by all whitespaces (space, multiple spaces, tab, newlines, etc).
This is one way to do it:
string = "this is a string"
ssplit = string.split()
for word in ssplit:
print (word)
Output:
this
is
a
string
for word in string.split():
print word
Using nltk.
from nltk.tokenize import sent_tokenize, word_tokenize
sentences = sent_tokenize("This is a string.")
words_in_each_sentence = word_tokenize(sentences)
You may use TweetTokenizer for parsing casual text with emoticons and such.
One way to do this is using a dictionary. The problem for the code above is it counts each letter in a string, instead of each word. To solve this problem, you should first turn the string into a list by using the split() method, and then create a variable counts each comma in the list as its own value. The code below returns each time a word appears in a string in the form of a dictionary.
s = input('Enter a string to see if strings are repeated: ')
d = dict()
p = s.split()
word = ','
for word in p:
if word not in d:
d[word] = 1
else:
d[word] += 1
print (d)
s = 'hi how are you'
l = list(map(lambda x: x,s.split()))
print(l)
Output: ['hi', 'how', 'are', 'you']
You can try this method also:
sentence_1 = "This is a string"
list = sentence_1.split()
for i in list:
print (i)

Python: replace nth word in string

What is the easiest way in Python to replace the nth word in a string, assuming each word is separated by a space?
For example, if I want to replace the tenth word of a string and get the resulting string.
I guess you may do something like this:
nreplace=1
my_string="hello my friend"
words=my_string.split(" ")
words[nreplace]="your"
" ".join(words)
Here is another way of doing the replacement:
nreplace=1
words=my_string.split(" ")
" ".join([words[word_index] if word_index != nreplace else "your" for word_index in range(len(words))])
Let's say your string is:
my_string = "This is my test string."
You can split the string up using split(' ')
my_list = my_string.split()
Which will set my_list to
['This', 'is', 'my', 'test', 'string.']
You can replace the 4th list item using
my_list[3] = "new"
And then put it back together with
my_new_string = " ".join(my_list)
Giving you
"This is my new string."
A solution involving list comprehension:
text = "To be or not to be, that is the question"
replace = 6
replacement = 'it'
print ' '.join([x if index != replace else replacement for index,x in enumerate(s.split())])
The above produces:
To be or not to be, it is the question
You could use a generator expression and the string join() method:
my_string = "hello my friend"
nth = 0
new_word = 'goodbye'
print(' '.join(word if i != nth else new_word
for i, word in enumerate(my_string.split(' '))))
Output:
goodbye my friend
Through re.sub.
>>> import re
>>> my_string = "hello my friend"
>>> new_word = 'goodbye'
>>> re.sub(r'^(\s*(?:\S+\s+){0})\S+', r'\1'+new_word, my_string)
'goodbye my friend'
>>> re.sub(r'^(\s*(?:\S+\s+){1})\S+', r'\1'+new_word, my_string)
'hello goodbye friend'
>>> re.sub(r'^(\s*(?:\S+\s+){2})\S+', r'\1'+new_word, my_string)
'hello my goodbye'
Just replace the number within curly braces with the position of the word you want to replace - 1. ie, for to replace the first word, the number would be 0, for second word the number would be 1, likewise it goes on.

Categories

Resources