This code is based on an elegant answer I received to this question and scaled up to accept nested lists of up to 5 elements. The overall goal is to merge nested lists that have repeating value in index position 1.
The exception pass suppresses the IndexError when a nested list in marker_array has 4 elements. But the code fails to include the last list after the 4 element list in the final output. My understanding was that the purpose of defaultdict was to avoid IndexErrors in the first place.
# Nested list can have 4 or 5 elements per list. Sorted by [1]
marker_array = [
['hard','00:01','soft','tall','round'],
['heavy','00:01','light','skinny','bouncy'],
['rock','00:01','feather','tree','ball'],
['fast','00:35','pidgeon','random'],
['turtle','00:40','wet','flat','tail']]
from collections import defaultdict
d1= defaultdict(list)
d2= defaultdict(list)
d3= defaultdict(list)
d4= defaultdict(list)
# Surpress IndexError due to 4 element list.
# Add + ' ' because ' '.join(d2[x])... create spaces between words.
try:
for pxa in marker_array:
d1[pxa[1]].extend(pxa[:1])
d2[pxa[1]].extend(pxa[2] + ' ')
d3[pxa[1]].extend(pxa[3] + ' ')
d4[pxa[1]].extend(pxa[4] + ' ')
except IndexError:
pass
# Combine all the pieces.
res = [[' '.join(d1[x]),
x,
''.join(d2[x]),
''.join(d3[x]),
''.join(d4[x])]
for x in sorted(d1)]
# Remove empty elements.
for p in res:
if not p[-1]:
p.pop()
print res
The output is almost what I need:
[['hard heavy rock', '00:01', 'soft light feather ', 'tall skinny tree ', 'round bouncy ball '], ['fast', '00:35', 'pidgeon ', 'random ']]
This scaled up version has certainly lost some of the original elegance due to my skill level. Any general pointers on improving this code are much appreciated, but my two main questions in order of importance are:
How can I make sure that the ['turtle','00:40','wet','flat','tail'] nested list is not ignored?
What can I do to avoid trailing white space as in 'soft light feather '?
The problem is the placement of your try block. The IndexError isn't being caused by the defaultdict, it is because you're trying to access pxa[4] in the 4th row of marker_array, which doesn't exist.
Move your try / except inside the for loop, like this:
for pxa in marker_array:
try:
d1[pxa[1]].extend(pxa[:1])
d2[pxa[1]].extend(pxa[2] + ' ')
d3[pxa[1]].extend(pxa[3] + ' ')
d4[pxa[1]].extend(pxa[4] + ' ')
except IndexError:
pass
Output will now include the 4th row.
To answer your second question, you can remove the whitespace by surrounding your various ''.join() calls with a strip() or rstrip() call on each join (e.g. strip(''.join(d2[x])).
Because your try statement starts outside the for loop, an exception in the for loop causes the program to go to the except block and not return to the loop afterwards. Instead, put the try before the main block inside the loop:
for pxa in marker_array:
try:
d1[pxa[1]].extend(pxa[:1])
d2[pxa[1]].extend(pxa[2] + ' ')
d3[pxa[1]].extend(pxa[3] + ' ')
d4[pxa[1]].extend(pxa[4] + ' ')
except IndexError:
pass
Technically it's best practice to include as little code as possible inside the try block, so if you're sure that lists will never have fewer than 4 items, you can move the start of the try block down to the line immediately before you extend d4.
If I understand your code correctly, you're getting the trailing white space because your adding a space after pxa[4]. Of course, removing the space in d4[pxa[1]].extend(pxa[4] + ' ') such that it's d4[pxa[1]].extend(pxa[4]) won't solve your problem for the shorter lists. Instead, you can not add a space after pxa[3] and instead add one before pxa[4], like this:
d3[pxa[1]].extend(pxa[3])
d4[pxa[1]].extend(' ' + pxa[4])
I think that should fix it.
Related
I am trying to replace
'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
with:
'AMAT 10Q Filing Section: Risk'
However, everything up until Filing Section: Risk will be constantly changing, except for positioning. I just want to pull the characters from position 0 to 5 and from 15 through 19.
df['section'] = df['section'].str.replace(
I'd like to manipulate this but not sure how?
Any help is much appreciated!
Given your series as s
s.str.slice(0, 5) + s.str.slice(15, 19) # if substring-ing
s.str.replace(r'\d{5}', '') # for a 5-length digit string
You may need to adjust your numbers to index properly. If that doesn't work, you probably want to use a regular expression to get rid of some length of numbers (as above, with the example of 5).
Or in a single line to produce the final output you have above:
s.str.replace(r'\d{10}_|\d{8}_', '').str.replace('_', ' ')
Though, it might not be wise to replace the underscores. Instead, if they change, explode the data into various columns which can be worked on separately.
If you want to replace a fix length/position of chars, use str.slice_replace to replace
df['section'] = df['section'].str.slice_replace(6, 14, ' ')
Other people would probably use regex to replace pieces in your string. However, I would:
Split the string
append the piece if it isn't a number
Join the remaining data
Like so:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
n = []
for i in s.split('_'):
try:
i = int(i)
except ValueError:
n.append(i)
print(' '.join(n))
AMAT 10Q Filing Section: Risk
Edit:
Re-reading your question, if you are just looking to substring:
Grabbing the first 5 characters:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
print(s[:4]) # print index 0 to 4 == first 5
print(s[15:19]) # print index 15 to 19
print(s[15:]) # print index 15 to the end.
If you would like to just replace pieces:
print(s.replace('_', ' '))
you could throw this in one line as well:
print((s[:4] + s[15:19] + s[28:]).replace('_', ' '))
'AMAT 10Q Filing Section: Risk'
I have a dataframe which contians nested lists. Some of those lists are empty, or contain only whitespaces. e.g:
df=pd.DataFrame({'example':[[[' ', ' '],['Peter'],[' ', ' '],['bla','blaaa']]]})
for my further operations they are not allowed to be empty and cannot be deleted. Is there a way to fill them with e.g. 'some_string
i thought of something similar to
df.example = [[[a.replace(' ','some_string')if all a in i =='\s'for a in i]for i in x] for x in df.example], but this yields an invalid syntax error, further it wouldnt just fill the list, but each whitespace in the list.
Since i am still learning python, my idea of a solution might be too complicated or completely wrong.
i.e. the solution should look like:
example
0 [[some_string], [Peter], [some_string], [bla, blaaa]
Using apply
Ex:
df=pd.DataFrame({'example':[[[' ', ' '],['Peter'],[' ', ' '],['bla','blaaa']]]})
df["example"] = df["example"].apply(lambda x: [i if "".join(i).strip() else ['some_string'] for i in x])
print(df)
Output:
example
0 [[some_string], [Peter], [some_string], [bla, ...
Note: This will be slow if you data is very large because of the iteration.
I tend to use list comprehension a lot in Python because I think it is a clean way to generate lists, but often I find myself coming back a week later and thinking to myself "What the hell did I do this for?!" and it's a 70+ character nested conditional list comprehension statement. I am wondering if it gets to a certain point if I should break it out into if/elif/else, and the performance impact, if any of doing so.
My current circumstance:
Returned structure from call is a list of tuples. I need to cast it to a list, some values need to be cleaned up, and I need to strip the last element from the list.
e.g.
[(val1, ' ', 'ChangeMe', 'RemoveMe),
(val1, ' ', 'ChangeMe', 'RemoveMe),
(val1, ' ', 'ChangeMe', 'RemoveMe)]
So in this case, I want to remove RemoveMe, replace all ' ' with '' and replace ChangeMe with val2. I know it is a lot of changes, but the data I am returned is terrible sometimes and I have no control over what is coming to me as a response.
I currently have something like:
response = cursor.fetchall()
response = [['' if item == ' ' else item if item != 'ChangeMe' else 'val2' for item in row][:-1] for row in response]`
Is a nested multi-conditional comprehension statement frowned upon? I know stylistically Python prefers to be very readable, but also compact and not as verbose.
Any tips or info would be greatly appreciated. Thanks all!
Python favors one-liner, on the sole condition that these make the code more readable, and not that it complicates it.
In this case, you use two nested list comprehension, two adjacent ternary operators, a list slicing, all of this on a single line which exceeds the 100 characters... It is everything but readable.
Sometimes it's better to use a classic for loop.
result = []
for val, space, item, remove in response:
result.append([val, '', 'val2'])
And then you realise you can write it as a list comprehension much more comprehensible (assuming your filter condition is simple):
result = [[val, '', 'val2'] for val, *_ in response]
Remember, every code is written once, but it is read many times.
This is one quick way you could do a list-comprehension making use of a dictionary for mapping items:
response = [('val1', ' ', 'ChangeMe', 'RemoveMe'), ('val1', ' ', 'ChangeMe', 'RemoveMe'), ('val1', ' ', 'ChangeMe', 'RemoveMe')]
map_dict = {' ': '', 'ChangeMe': 'val2', 'val1': 'val1'}
response = [tuple(map_dict[x] for x in tupl if x != 'RemoveMe') for tupl in response]
# [('val1', '', 'val2'), ('val1', '', 'val2'), ('val1', '', 'val2')]
I'm doing a for loop over a list in Python, which to all of my knowledge should create a shallow copy of each element in the list. However, when I perform operations on elements, the changes aren't being reflected in the original list.
Here's the code:
interactions = f.readlines()
for interaction in interactions:
orig = interaction
interaction = interaction.replace('.', '').replace(' ', '').strip('\n')
if len(interaction) == 0:
interactions.remove(orig)
print(interactions)
The output of this operation given an initial list of ['. Bph\n', 'Cardiovascular\n', 'Diabetes\n', '. '] is:
['. Bph\n', 'Cardiovascular\n', 'Diabetes\n', '. ']
['. Bph\n', 'Cardiovascular\n', 'Diabetes\n', '. ']
['. Bph\n', 'Cardiovascular\n', 'Diabetes\n', '. ']
['. Bph\n', 'Cardiovascular\n', 'Diabetes\n']
Even though when I follow the debugger I can see the correct operations being performed on interaction but the changes just arent being reflected in interactions. What am I doing wrong here?
Why should it reflect? You're modifying str elements and they can't get altered in place, new strings containing the altered value get returned; nothing reflects in the original list. :-)
Instead, you should iterate over a copy of interactions[:] and assign using the index enumerate produces:
for i, interaction in enumerate(interactions[:]):
# redundant in this snippet
orig = interaction
# strip() will remove '\n' and leading-trailing ' '
interaction = interaction.replace('.', '').replace(' ', '').strip('\n')
# if interaction: suffices
if len(interaction) == 0:
interactions.remove(orig)
else:
interactions[i] = interaction
print(interactions)
This now gives output of:
['Bph', 'Cardiovascular\n', 'Diabetes\n', '. ']
['Bph', 'Cardiovascular', 'Diabetes\n', '. ']
['Bph', 'Cardiovascular', 'Diabetes', '. ']
['Bph', 'Cardiovascular', 'Diabetes']
Python strings are immutable, so you either need to replace each element with your cleaned version (using indices like in Jim's answer) or build up a new list. IMO the neatest way to do this is with a list comprehension:
interactions = [i.replace('.', '').replace(' ', '').strip('\n') for i in interactions]
interactions = list(filter(None, interactions)) # Remove empty elements
Basically, I print a long message but I want to group all of those words into 5 character long strings.
For example "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner." I want to make that
"iPhon 6isn' tsimp lybig ger-i t'sbe terri never yway. Large r,yet drama tical lythi nner. "
As suggested by #vaultah, this is achieved by splitting the string by a space and joining them back without spaces; then using a for loop to append the result of a slice operation to an array. An elegant solution is to use a comprehension.
text = "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner."
joined_text = ''.join(text.split())
splitted_to_six = [joined_text[char:char+6] for char in range(0,len(joined_text),6)]
' '.join(splitted_to_six)
I'm sure you can use the re module to get back dashes and apostrophes as they're meant to be
Simply do the following.
import re
sentence="iPhone 6 isn't simply bigger - it's better in every way. Larger, yet dramatically thinner."
sentence = re.sub(' ', '', sentence)
count=0
new_sentence=''
for i in sentence:
if(count%5==0 and count!=0):
new_sentence=new_sentence+' '
new_sentence=new_sentence+i
count=count+1
print new_sentence
Output:
iPhon e6isn 'tsim plybi gger- it'sb etter ineve ryway .Larg er,ye tdram atica llyth inner .