Get sub-strings of a string - python

for e.g : string = 'AbcDEfGhIJK'
I want to fetch a list with :
['A','bc','DE','f','G','h','IJK']
I am trying to think of a logic for doing this, but so far no luck.
EDIT :
I don't know regex, so I just used loops
This is what I came up with, It doesn't give the last 'IJK' though
u_count = 0
l_count = 0
l_string = ''
u_string = ''
output = []
data = 'AbcDEfGhIJK'
for c in data:
if(c.isupper()):
if(l_count !=0):
output.append(l_string)
l_count = 0
l_string = ''
u_string += c
u_count += 1
if(c.islower()):
if(u_count !=0):
output.append(u_string)
u_count = 0
u_string = ''
l_string +=c
l_count += 1
print(output)

You could do that using itertools.groupby:
from itertools import groupby
string = 'AbcDEfGhIJK'
out = [''.join(group) for key, group in groupby(string, key=lambda c: c.islower())]
print(out)
# ['A', 'bc', 'DE', 'f', 'G', 'h', 'IJK']
Here, groupby will group the characters that give the same output for islower()

You could use a regex:
import re
text = 'AbcDEfGhIJK'
result = re.split('([a-z]+)', text)
print(result)
Output
['A', 'bc', 'DE', 'f', 'G', 'h', 'IJK']
The idea is to split the string on lower case letters '([a-z]+)', but keeping the splitting pattern.

str = 'AbcDEfGhIJK'
str=list(str)
for k,v in enumerate(str[:-1]):
joined=''.join([str[k],str[k+1]])
if joined.isupper() or joined.islower():
str[k+1]=joined
str[k]=''
str=[x for x in str if x!='']
print(str)
Output
['A', 'bc', 'DE', 'f', 'G', 'h', 'IJK']

Related

How to extract list of pairs in a list enclosed by hash symbols?

For example, from the 'tokens' list below, I want to extract the pair_list:
tokens = ['0', '#', 'a', 'b', '#', '#', 'c', '#', '#', 'g', 'h', 'g', '#']
pair_list = [['a', 'b'], ['c'], ['g', 'h', 'g']]
I was trying to do something like below, but hasn't succeeded:
hashToken_begin_found = True
hashToken_end_found = False
previous_token = None
pair_list = []
for token in tokens:
if hashToken_begin_found and not hashToken_end_found and previous_token and previous_token == '#':
hashToken_begin_found = False
elif not hashToken_begin_found:
if token == '#':
hashToken_begin_found = True
hashToken_end_found = True
else:
...
ADDITION:
My actual problem is more complicated. What's inside the pair of # symbols are words in social media, like hashed phrases in twitter, but they are not English. I was simplified the problem to illustrate the problem. The logic would be something like I wrote: found the 'start' and 'end' of each # pair and extract it. In my data, anything in a pair of hash tags is a phrase, i.e. I live in #United States# and #New York#!. I need to get United States and New York. No regex. These words are already in a list.
I think you're overcomplicating the issue here. Think of the parser as a very simple state machine. You're either in a sublist or not. Every time you hit a hash, you toggle the state.
When entering a sublist, make a new list. When inside a sublist, append to the current list. That's about it. Here's a sample:
pair_list = []
in_pair = False
for token in tokens:
if in_pair:
if token == '#':
in_pair = False
else:
pair_list[-1].append(token)
elif token == '#':
pair_list.append([])
in_pair = True
You could try itertools.groupby in one single line:
from itertools import groupby
tokens = ['0', '#', 'a', 'b', '#', '#', 'c', '#', '#', 'g', 'h', 'g', '#']
print([list(y) for x, y in itertools.groupby(tokens, key=lambda x: x.isalpha()) if x])
Output:
[['a', 'b'], ['c'], ['g', 'h', 'g']]
I group by the consecutive groups where the value is alphabetic.
If you want to use a for loop you could try:
l = [[]]
for i in tokens:
if i.isalpha():
l[-1].append(i)
else:
if l[-1]:
l.append([])
print(l[:-1])
Output:
[['a', 'b'], ['c'], ['g', 'h', 'g']]
Another way (Try it online!):
it = iter(tokens)
pair_list = []
while '#' in it:
pair_list.append(list(iter(it.__next__, '#')))
Yet another (Try it online!):
pair_list = []
try:
i = 0
while True:
i = tokens.index('#', i)
j = tokens.index('#', i + 1)
pair_list.append(tokens[i+1 : j])
i = j + 1
except ValueError:
pass

How to join subsequent digits in a Python list into a double (or more) digit number

I have the following string:
string = 'TAA15=ATT'
I make a list out of it:
string_list = list(string)
print(string_list)
and the result is:
['T', 'A', 'A', '1', '5','=', 'A', 'T', 'T']
I need to detect subsequent digits and join them into a single number, as shown below:
['T', 'A', 'A', '15','=', 'A', 'T', 'T']
I'm also quite concerned with performances. This string conversion is done thousand times.
Thank you for any hints you can provide.
Here is a very short solution
import re
def digitsMerger(source):
return re.findall(r'\d+|.', source)
digitsMerger('TAA15=ATT')
['T', 'A', 'A', '15', '=', 'A', 'T', 'T']
Using itertools.groupby
Ex:
from itertools import groupby
string = 'TAA15=ATT'
result = []
for k, v in groupby(string, str.isdigit):
if k:
result.append("".join(v))
else:
result.extend(v)
print(result)
Output:
['T', 'A', 'A', '15', '=', 'A', 'T', 'T']
Another regexp:
import re
s = 'TAA15=ATT'
pattern = r'\d+|\D'
m = re.findall(pattern, s)
print(m)
You can use regular expressions, in Python the library re:
import re
string = 'TAA15=ATT'
num = re.sub('[^0-9,]', "", string)
pos = string.find(num)
str2 = re.sub('\\d+',"", string)
str2 = re.sub('=',"", str2)
print(str2)
l = list()
for el in str2:
l.append(el)
l.insert(pos, num)
print(l)
Basically re.sub('[^0-9,]', "", string) is telling: take the string, match all the characters that are not (^ means negation) numbers (0-9) and substitute them with the second parameter, ie., an empty string. So basically what's left are only digits that you have to convert to an integer.
If the = is always after the digit instead of
str2 = re.sub('\\d+',"", string)
str2 = re.sub('=',"", str2)
you can do
str2 = re.sub('\\d+=',"", string)
You can create a function that compares the last value seen and the next and use functools.reduce:
from functools import reduce
string_list = ['T', 'A', 'A', '1', '5', 'A', 'T', 'T']
def combine_nums(lst, nxt):
if lst and all(map(str.isdigit, (lst[-1], nxt))):
nxt = lst[-1] + nxt
return lst + [nxt]
print(reduce(combine_nums, string_list, [])
Results:
['T', 'A', 'A', '1', '15', 'A', 'T', 'T']

Convert list of strings into dictionary

I would like to convert a list of strings into a dictionary.
The list looks like such after I have split it into the seperate words:
[['ice'], ['tea'], ['silver'], ['gold']]
Which I want to convert to a dictionary which looks like such:
{ 1 : ['i', 'c', 'e']
2 : ['t','e','a']
3 : ['s','i','l','v','e','r']
4 : ['g','o','l','d']}
This is my code thus far:
import itertools
def anagram1(dict):
with open('words.txt', 'r') as f:
data = f.read()
data = data.split()
x = []
y = []
for word in data:
x1 = word.split()
x.append(x1)
for letters in word:
y1 = letters.split()
y.append(y1)
d = dict(itertools.zip_longest(*[iter(y)] * 2, fillvalue=""))
To which I receive the following error:
TypeError: 'dict' object is not callable
import pprint
l = [['ice'], ['tea'], ['silver'], ['gold']]
d = {idx: list(item[0]) for idx, item in enumerate(l, start =1)}
pprint.pprint(d)
{1: ['i', 'c', 'e'],
2: ['t', 'e', 'a'],
3: ['s', 'i', 'l', 'v', 'e', 'r'],
4: ['g', 'o', 'l', 'd']}
Following should do the job:
with open('file.txt', 'r') as f:
data = f.read()
data = data.split()
data_dict = {i:v for i,v in enumerate(data)}

How to concatenate strings in list between to points within the list?

mylist = ['A','12','D']
I also cannot apply ''.join(mylist) and type check for a list since its all in a str.
You can use regex:
import re
my_string = ''.join(mylist)
# 'A[BC]D[EFG]'
# Replace '[' or ']' with ',' in my_string
pattern = '[\[\]]'
my_string = re.sub(pattern, ',', my_string)
# 'A,BC,D,EFG,'
my_list = my_string.split(',')
# ['A', 'BC', 'D', 'EFG', '']
new_list = [letter for letter in my_list if letter]
# ['A', 'BC', 'D', 'EFG']
You can try this:
mylist = ['A','[','B','C',']','D']
final_list = []
temp1_val = ''
flag = False
for i in mylist:
if i == '[':
flag = True
elif i == ']':
final_list.append(temp1_val)
temp1_val = ''
flag = False
elif flag:
temp1_val += i
elif not flag:
final_list.append(i)
Output:
['A', 'BC', 'D']
Output for second example:
['A', 'BC', 'D', 'EFG']

How to convert comma-delimited string to list in Python?

Given a string that is a sequence of several values separated by a commma:
mStr = 'A,B,C,D,E'
How do I convert the string to a list?
mList = ['A', 'B', 'C', 'D', 'E']
You can use the str.split method.
>>> my_string = 'A,B,C,D,E'
>>> my_list = my_string.split(",")
>>> print my_list
['A', 'B', 'C', 'D', 'E']
If you want to convert it to a tuple, just
>>> print tuple(my_list)
('A', 'B', 'C', 'D', 'E')
If you are looking to append to a list, try this:
>>> my_list.append('F')
>>> print my_list
['A', 'B', 'C', 'D', 'E', 'F']
In the case of integers that are included at the string, if you want to avoid casting them to int individually you can do:
mList = [int(e) if e.isdigit() else e for e in mStr.split(',')]
It is called list comprehension, and it is based on set builder notation.
ex:
>>> mStr = "1,A,B,3,4"
>>> mList = [int(e) if e.isdigit() else e for e in mStr.split(',')]
>>> mList
>>> [1,'A','B',3,4]
Consider the following in order to handle the case of an empty string:
>>> my_string = 'A,B,C,D,E'
>>> my_string.split(",") if my_string else []
['A', 'B', 'C', 'D', 'E']
>>> my_string = ""
>>> my_string.split(",") if my_string else []
[]
>>> some_string='A,B,C,D,E'
>>> new_tuple= tuple(some_string.split(','))
>>> new_tuple
('A', 'B', 'C', 'D', 'E')
You can split that string on , and directly get a list:
mStr = 'A,B,C,D,E'
list1 = mStr.split(',')
print(list1)
Output:
['A', 'B', 'C', 'D', 'E']
You can also convert it to an n-tuple:
print(tuple(list1))
Output:
('A', 'B', 'C', 'D', 'E')
You can use this function to convert comma-delimited single character strings to list-
def stringtolist(x):
mylist=[]
for i in range(0,len(x),2):
mylist.append(x[i])
return mylist
#splits string according to delimeters
'''
Let's make a function that can split a string
into list according the given delimeters.
example data: cat;dog:greff,snake/
example delimeters: ,;- /|:
'''
def string_to_splitted_array(data,delimeters):
#result list
res = []
# we will add chars into sub_str until
# reach a delimeter
sub_str = ''
for c in data: #iterate over data char by char
# if we reached a delimeter, we store the result
if c in delimeters:
# avoid empty strings
if len(sub_str)>0:
# looks like a valid string.
res.append(sub_str)
# reset sub_str to start over
sub_str = ''
else:
# c is not a deilmeter. then it is
# part of the string.
sub_str += c
# there may not be delimeter at end of data.
# if sub_str is not empty, we should att it to list.
if len(sub_str)>0:
res.append(sub_str)
# result is in res
return res
# test the function.
delimeters = ',;- /|:'
# read the csv data from console.
csv_string = input('csv string:')
#lets check if working.
splitted_array = string_to_splitted_array(csv_string,delimeters)
print(splitted_array)

Categories

Resources