I have an arbitrarily nested array of values that looks like this:
['"multiply"', 'ALAssn ', ['ACmp ', ['Ge ', ['Var "n"'], ' ', ['Num 0']]], ['ALAssn ', ['ACmp ', ['Eq ', ['Var "p"'], ' ', ['Mul ', ['Var "n"'], ' ', ['Var "m"']]]]]
and I need to try and figure out a way to parse through the every value in the array and format it so that:
Each array of length 1 is split into two separate values:
-- Example: ['Var "n"'] should now become ["Var", "n"] and ['Num 0'] now becomes ["Num", 0].
All instances of empty list values are removed.
-- Example: ['Ge ', ['Var "n"'], ' ', ['Num 0']] now becomes ['Ge ', ['Var "n"'], ['Num 0']]
The whitespace in any string is removed.
-- Example: 'Ge ' now becomes 'Ge'
The given snippet is a portion of a much larger string that needs to parsed. I understand what needs to be done at a high level..ie:
Once I get to an list of length 1, list.split(" ") to split into two separate elements, then trim arr[1] to get rid of the extra quotation marks
If el is an empty string for every element in the list, list.remove(el)
Check if isinstance(el, string) of every element when traversing, and if true, el.replace(" ", "") to rid of the whitespace.
My only issue comes when traversing through every single element in the list. I've tried doing so recursively and iteratively, but so far haven't been able to crack it.
Ideally, I traverse through every single element, and then once I hit an element that meets the criteria, set that element equal to the change that I want to make on it. This is only really the case for points 1 and 3.
EDIT:
Thank you so much for the answers given. I have one more addition I would like to make.
Assume too I have a nested identifiers like 'Reads "a"' as the first value of an array, with the possibility of having addition identifiers like Write "a" in the same level. These also needs to be converted to the format ["Read", "a"]. See the change in the large list below. How would I go about doing this?
['Read "a"', ['Add', ['Var', 'i'], ['Num', '1']]], 'Write "a"', ['Add', ['Var', 'i'], ['Num', '1']], ['Var', 't']]
The point of these values 'Read' and 'Write' is so that, when traversing the list, we know the "type" of the next n elements of the list corresponding to that identifier. We can distinguish them basically by saying they are are the only values in the nested list that will not be lists themselves.
For example: ['identifier', [], [], []]
Assume it is known that the identifier type contains 3 lists, first, second, third. The goal is to read identifier and then store first, second, and third as nodes in a tree, for example.
This problem seems like it would be easiest to deal with by constructing a new list with the fixed-up items, rather than trying to modify the existing list in place. This would let you use recursion to deal with the nesting, while using iteration over the flat parts of each list.
I'd structure the code like this:
def process(lst):
if len(lst) == 1: # special case for one-element lists
result = lst[0].split()
result[1] = result[1].strip('"') # strip quotation marks
return result
result = []
for item in lst:
if isinstance(item, list):
result.append(process(item)) # recurse on nested lists
else: # item is a string
stripped = item.strip() # remove leading and trailing whitespace
if stripped:
result.append(stripped) # keep only non-empty strings
return result
Seems you can collapse 1 and 3 into one operation:
def sanitize(item):
if isinstance(item, list):
if len(item) == 1:
item = item[0].split()
return [output for i in item if (output := sanitize(i))]
return item.strip('" ') # Strips both '"' and ' '.
item = ['"multiply"', 'ALAssn ', ['ACmp ', ['Ge ', ['Var "n"'], ' ', ['Num 0']]], ['ALAssn ', ['ACmp ', ['Eq ', ['Var "p"'], ' ', ['Mul ', ['Var "n"'], ' ', ['Var "m"']]]]]]
sanitize(item)
# Returns: ['multiply', 'ALAssn', ['ACmp', ['Ge', ['Var', 'n'], ['Num', '0']]], ['ALAssn', ['ACmp', ['Eq', ['Var', 'p'], ['Mul', ['Var', 'n'], ['Var', 'm']]]]]]
Related
Suppose you have a string:
text = "coding in python is a lot of fun"
And character positions:
positions = [(0,6),(10,16),(29,32)]
These are intervals, which cover certain words within text, i.e. coding, python and fun, respectively.
Using the character positions, how could you split the text on those words, to get this output:
['coding','in','python','is a lot of','fun']
This is just an example, but it should work for any string and any list of character positions.
I'm not looking for this:
[text[i:j] for i,j in positions]
I'd flatten positions to be [0,6,10,16,29,32] and then do something like
positions.append(-1)
prev_positions = [0] + positions
words = []
for begin, end in zip(prev_positions, positions):
words.append(text[begin:end])
This exact code produces ['', 'coding', ' in ', 'python', ' is a lot of ', 'fun', ''], so it needs some additional work to strip the whitespace
Below code works as expected
text = "coding in python is a lot of fun"
positions = [(0,6),(10,16),(29,32)]
textList = []
lastIndex = 0
for indexes in positions:
s = slice(indexes[0], indexes[1])
if positions.index(indexes) > 0:
print(lastIndex)
textList.append(text[lastIndex: indexes[0]])
textList.append(text[indexes[0]: indexes[1]])
lastIndex = indexes[1] + 1
print(textList)
Output: ['coding', 'in ', 'python', 'is a lot of ', 'fun']
Note: If space are not needed you can trim them
['key=IAfpK', ' age=58', ' key=WNVdi', ' age=64', ' key=jp9zt', ' age=47', ' key=0Sr4C', ' age=68', ' key=CGEqo', ' age=76', ' key=IxKVQ', ' age=79', ' key=eD221', ' age=29']
I got the following list, i need to convert it to a dictionary,like
{"IAfpK":58,"WNVdi":,"64":,.....}
I have tried ast library and JSON.loads but in vain
Simple one-liner using a dict comprehension:
{x.split("=")[1]: int(y.split("=")[1]) for x,y in zip(arr[::2],arr[1::2])}
zip(arr[::2],arr[1::2]) iterates over pairs of the array, and str.split extracts the correct value for the key and value.
If you know that your list is always following this exact format and order, just loop through the list:
mydict = {}
for element in mylist:
if "key=" in element:
mydict[element.replace("key=", "")] = None
else:
mydict[mydict.keys()[-1]] = int(element.replace("age=", ""))
Given you a list arr of the shape [' key=aKey', ' age=valueForAKey', ' key=bKey', ...] (note the space at the start of each list element).
You can use this dictionary comprehension to extract the matching key and values and build the resulting dictionary.
{arr[i][5:]: arr[i+1][5:] for i in range(0, len(arr), 2)}
Try it out here: https://www.online-python.com/iGI3A2YEnr
If the number of leading spaces is inconsistent (as in the example you gave), you can use the lstrip() method to remove the leading spaces.
{arr[i].lstrip()[4:]: arr[i+1].lstrip()[4:] for i in range(0, len(arr), 2)}
up till this function everything is fine, i get 4999 rows, that's the amount i got. Can you check the code down below, where do i make mistakes that i end up having 5095 instead of 4999 and in the second function i have 5032 instead of 4999 instances
I have to get no more than 4999.
Any help is appreciated
a=[]
for i in matches:
a.append([i for i in list(dict.fromkeys(i))])
print(len(a))
print ((a))
result:
4999
[['23-year-old'], [' '], ['42 years old'], ['-year-old']..]
can the -year-old be a problem in here?
Now here i face the problem
t=[]
for i in a:
for j in i:
p=len(j)
if p>1:
r=j.replace('-', ' ').split(' ')
# print(r)
t+=[s for s in r if s.isdigit()]
else:
t+=['']
print(len(t))
print(t)
output:
5095 #This should be 4999
['23', '', '42', '', '', '30', '31', ''...]
I do have also the same issue with the list of the gender? i end up having 5032
This part is not answered yet
import re
fil = data['transcription']
print(fil)
gender_aux = []
for i in fil:
try:
gender = re.findall("female|gentleman|woman|lady|man|male|girl|boy|she|he", i) or [" "]
except:
gender_aux.append(' ')
# pass
gender_dict = {"male": ["gentleman", "man", "male", "boy",'he'],
"female": ["lady","female", "woman", "girl",'she']}
for g in gender:
if g in gender_dict['male']:
gender_aux.append('male')
break
elif g in gender_dict['female']:
gender_aux.append('female')
break
else:
gender_aux+=[' ']
break
print(len(gender_aux))
print(gender_aux)
output:
5032 #this should be 4999
['female', 'male', 'male', ' ',
Assuming there are no decimal values in your dataset, and that each list item will only contain one string, one number per string.
If all you're after a list containing the integer values of everyone's ages, starting from your completed list a, you can simply
import re
t = [re.findall(r'\d+', item[0])[0] for item in a if re.findall(r'\d+', item[0])]
This list comprehension accomplishes a few things.
Firstly, because your a list is a list of single item lists, as we iterate through each item, we obtain the value of the first (and only) item in the list using item[0]. We then perform a regex operation (hence import re) on this item, with the search pattern r'\d+' which extracts only the integer values from each string (You can check out https://regex101.com/ to play around with regex patterns to better understand how they work).
Because re.findall returns a list of matches, and because it seems each string in your dataset will only contain one match (at most), we simply take the [0] index of the resulting list as our chosen value. Where there are no matches, re.findall returns an empty list. Because empty lists evaluate to false, the if statement in our list comprehension will prevent index errors on strings where there are no numbers to be extracted.
Using your example, the resulting t array would be as follows:
['23', '42']
Note that the empty strings are not included in the final list. If you wanted to include them, you could simply add an else condition to our if statement as follows:
t = [re.findall(r'\d+', item[0])[0] if re.findall(r'\d+', item[0]) else '' for item in a]
this would result in
['23', '', '42', '', '']
Lastly, if you wanted to convert each number (currently strings) to integer values, you could instead write:
t = [int(re.findall(r'\d+', item[0])[0]) if re.findall(r'\d+', item[0]) else '' for item in a]
which finally, would result in:
[23, '', 42, '', '']
Of course, this all assumes there are no decimal values in your dataset, and that each list item will only contain one string, with each string only containing one desired number.
For example, our re.findall with the string "I am 42 years old, and my son is 16", would return ['42', '16'], and because we only return the first item of the list, the final list would not include '16'. Keep this in mind.
Because we aren't creating any additional items (e.g. by using str.split()), we can be sure the resulting list consists of the same number of elements (so long as we use the variant with the else '' statement). If we use the first variant, the resulting list will contain only as many elements as there are elements in a containing numbers.
I am newbie in python. I have split a list which contains 100 separate string. It all have 300 chars in it. After splitting, it became to act like 2D array and I want to join them together to get an list in the beginning.
Below is my sample list and what I have tried but it does not work. I want to replace ' ' instead of '1' and remove less than 3 length of chars and join them together. Only replacing function does not work, I cannot remove words this situation.
1 c1|FaAO120O'8ovfoy1W#atvGs1[1s1[1/1]O-a8o1-...
2 O8v^10O#to1'#^'^tv1^]s111t01Otaq>-ata_1...
3 *#^-G1_#O-#b^'ta8a2%e1|28Oot^12#O-#ys1>c...
def tokenize(text):
return text.split("1")
def trimm(text):
return ' '.join([i for i in data if len(i) > 3])
token_data = [tokenize(i) for i in X]
#trim_data = [trimm(i) for i in token_data]
for n in token_data:
for i in token_data[n]:
res=trimm(i)
Below is after tokenize function.
['c', '|FaAO', "20O'8o\x02vfoy", 'W#at\x1bvGs', '[', 's', '[', '/', ']O-a8o', '-\x1b-\x03\x1b#', '^]', '-a\x02\x1b', 'av', 'vc]]\x1b#a\x02d', ']#^-', 'O', 'v\x1bz\x1b#\x1b', "A\x1b'#\x1bvva^\x02", '\x03#^cd0t', '^\x02s', '[', '\x03o', "-\x1b\x02^'Ocv\x1b", 'Ov', 'W\x1b88', 'Ov', 'O', '-\x1b\x02tO8', '\x03#\x1bOf', 'A^W\x02\x08', '', '>0\x1b', 'av', '\x03\x1ba\x02d', 't#\x1bOt\x1bA', 'Wat0s', '[', 'gO8oA^8', 'Wat0', 'v^-\x1b', 'vc__\x1bvv', '\x03ct', 't0\x1b', 't#\x1bOt-\x1b\x02tv', '\x03\x1ba\x02d', "'#^zaA\x1bA", 't0#^cd0', '0\x1b#s', '[', "'vo_0aOt#avt", 'O#\x1b', '\x02^t', 'vOtav]O_t^#o\x08', '', '>^-']
Below is after trimm function
|FaAO 20O'8ovfoy W#atvGs ]O-a8o --# -a vc]]#ad ]#^- vz# A'#vva^ #^cd0t -^'Ocv W88 -tO8 #Of A^W ad t#OtA Wat0s gO8oA^8 Wat0 v^- vc__vv t#Ot-tv ad '#^zaAA t0#^cd0 0#s 'vo_0aOt#avt vOtav]O_t^#
I can do above situation only one 300 chars string. However I want it to do all strings in the original list. Therefore how can I make a loop that trimm and join every string ?
These two lines look wrong:
for n in token_data:
for i in token_data[n]:
n will be an element of token_data, taking token_data[n] does not make sense to me, since n is not an index, instead I would use for i in n: for the second for loop.
values = ['Limpets', 'Mussels', 'Phytoplankton', 'Zooplankton', 'Prawn', 'Crab', 'Whelk', 'Seaweed']
keys = ['Whelk ', 'Mussels ', 'Bird ', 'Prawn ', 'Fish ', 'Zooplankton ', 'Crab ', 'Lobster ', 'Limpets ']
What I want is the items in values that are not in keys. I have tried writing it as:
for item in values:
if item not in keys:
print(item)
the answer is should get is
phytoplankton
seaweed
but what i get instead is:
Phytoplankton
Seaweed
Limpets
Mussels
Crab
Whelk
Prawn
Zooplankton
I also tried storing the item in a list and then printing that list but nothing I've tried is working for me. I saw some answers using list comprehension but I'm taking an introductory course so all I've got is loops... I'm using python3.5 if that makes any difference.
Just use sets
set(values).difference(set(keys))
Or for this particular example OP can use
set(values).difference(set([i.strip() for i in keys]))
Since the keys list has a trailing space for each item so we need to clear that up.
in keys list you have spaces after words for example 'Whelk ', but in values you don't f.e 'Whelk'. 'Whelk ' and 'Whelk' are two different words so when you write
if item not in keys:
it returns true. you should remove whitespace after words in keys list first and then try your code sample
Your code is correct. The thing is, each string in your keys list contains a space at the end. This code:
values = ['Limpets', 'Mussels', 'Phytoplankton', 'Zooplankton', 'Prawn', 'Crab', 'Whelk', 'Seaweed']
keys = ['Whelk', 'Mussels', 'Bird', 'Prawn', 'Fish', 'Zooplankton', 'Crab', 'Lobster', 'Limpets']
for item in values:
if item not in keys:
print(item)
produces this output:
PhytoplanktonSeaweed
If for some reason you cannot modify the entires of keys, you can modify your loops to be:
for item in values:
if item + " " not in keys:
print(item)
which will give you the same output:
PhytoplanktonSeaweed
I think this is what you want:
[x for x in values if x not in keys]
as per you question below will help
[i for i in keys if i.strip() not in values]
Also, you can use set in python for more evaluations of this kind
It will return the whole list because your keys contains additional spaces in each string so you have to remove the spaces first.
updated_keys = [i.strip() for i in keys]
answer_list = [i for i in values if i not in updated_keys]
list(set(values) - set([i.strip() for i in keys]))
You first need to use map(lambda x : x.strip(),keys) then use reduce or any other solution
values = ['Limpets', 'Mussels', 'Phytoplankton', 'Zooplankton', 'Prawn', 'Crab', 'Whelk', 'Seaweed']
keys = ['Whelk ', 'Mussels ', 'Bird ', 'Prawn ', 'Fish ', 'Zooplankton ', 'Crab ', 'Lobster ', 'Limpets ']
map(lambda x : x.strip(),keys)
reduce(lambda x , y : x+ [y] if y.strip() not in keys else x, values,[])
O/P : ['Phytoplankton', 'Seaweed']