I have a dataframe which contians nested lists. Some of those lists are empty, or contain only whitespaces. e.g:
df=pd.DataFrame({'example':[[[' ', ' '],['Peter'],[' ', ' '],['bla','blaaa']]]})
for my further operations they are not allowed to be empty and cannot be deleted. Is there a way to fill them with e.g. 'some_string
i thought of something similar to
df.example = [[[a.replace(' ','some_string')if all a in i =='\s'for a in i]for i in x] for x in df.example], but this yields an invalid syntax error, further it wouldnt just fill the list, but each whitespace in the list.
Since i am still learning python, my idea of a solution might be too complicated or completely wrong.
i.e. the solution should look like:
example
0 [[some_string], [Peter], [some_string], [bla, blaaa]
Using apply
Ex:
df=pd.DataFrame({'example':[[[' ', ' '],['Peter'],[' ', ' '],['bla','blaaa']]]})
df["example"] = df["example"].apply(lambda x: [i if "".join(i).strip() else ['some_string'] for i in x])
print(df)
Output:
example
0 [[some_string], [Peter], [some_string], [bla, ...
Note: This will be slow if you data is very large because of the iteration.
Related
I have a dataframe with a list of poorly spelled clothing types. I want them all in the same format , an example is i have "trous" , "trouse" and "trousers", i would like to replace the first 2 with "trousers".
I have tried using string.replace but it seems its getting the first "trous" and changing it to "trousers" as it should and when it gets to "trouse", it works also but when it gets to "trousers" it makes "trousersersers"! i think its taking the strings which contain trous and trouse and trousers and changing them.
Is there a way i can limit the string.replace to just look for exactly "trous".
here's what iv troied so far, as you can see i have a good few changes to make, most of them work ok but its the likes of trousers and t-shirts which have a few similar changes to be made thats causing the upset.
newTypes=[]
for string in types:
underwear = string.replace(('UNDERW'), 'UNDERWEAR').replace('HANKY', 'HANKIES').replace('TIECLI', 'TIECLIPS').replace('FRAGRA', 'FRAGRANCES').replace('ROBE', 'ROBES').replace('CUFFLI', 'CUFFLINKS').replace('WALLET', 'WALLETS').replace('GIFTSE', 'GIFTSETS').replace('SUNGLA', 'SUNGLASSES').replace('SCARVE', 'SCARVES').replace('TROUSE ', 'TROUSERS').replace('SHIRT', 'SHIRTS').replace('CHINO', 'CHINOS').replace('JACKET', 'JACKETS').replace('KNIT', 'KNITWEAR').replace('POLO', 'POLOS').replace('SWEAT', 'SWEATERS').replace('TEES', 'T-SHIRTS').replace('TSHIRT', 'T-SHIRTS').replace('SHORT', 'SHORTS').replace('ZIP', 'ZIP-TOPS').replace('GILET ', 'GILETS').replace('HOODIE', 'HOODIES').replace('HOODZIP', 'HOODIES').replace('JOGGER', 'JOGGERS').replace('JUMP', 'SWEATERS').replace('SWESHI', 'SWEATERS').replace('BLAZE ', 'BLAZERS').replace('BLAZER ', 'BLAZERS').replace('WC', 'WAISTCOATS').replace('TTOP', 'T-SHIRTS').replace('TROUS', 'TROUSERS').replace('COAT', 'COATS').replace('SLIPPE', 'SLIPPERS').replace('TRAINE', 'TRAINERS').replace('DECK', 'SHOES').replace('FLIP', 'SLIDERS').replace('SUIT', 'SUITS').replace('GIFTVO', 'GIFTVOUCHERS')
newTypes.append(underwear)
types = newTypes
Assuming you're okay with not using string.replace(), you can simply do this:
lst = ["trousers", "trous" , "trouse"]
for i in range(len(lst)):
if "trous" in lst[i]:
lst[i] = "trousers"
print(lst)
# Prints ['trousers', 'trousers', 'trousers']
This checks if the shortest substring, trous, is part of the string, and if so converts the entire string to trousers.
Use a dict for string to be replaced:
d={
'trous': 'trouser',
'trouse': 'trouser',
# ...
}
newtypes=[d.get(string,string) for string in types]
d.get(string,string) will return string if string is not in d.
Looked for awhile on here, but couldn't find the answer.
df['Products'] = ['CC: buns', 'people', 'CC: help me']
Trying to get only text after colon or keep text if no colon is in the string.
Tried a lot of things, but this was my final attempt.
x['Product'] = x['Product'].apply(lambda i: i.extract(r'(?i):(.+)') if ':' in i else i)
I get this error:
Might take two steps, I assume.
I tried this:
x['Product'] = x['Product'].str.extract(r'(?i):(.+)')
Got me everything after the colon and a bunch of NaN, so my regex is working. I am assuming my lambda sucks.
Use str.split and get last item
df['Products'] = df['Products'].str.split(': ').str[-1]
Out[9]:
Products
0 buns
1 people
2 help me
Try this
df['Products'] = df.Products.apply(lambda x: x.split(': ')[-1] if ':' in x else x)
print(df)
Output:
Products
0 buns
1 people
2 help me
I tend to use list comprehension a lot in Python because I think it is a clean way to generate lists, but often I find myself coming back a week later and thinking to myself "What the hell did I do this for?!" and it's a 70+ character nested conditional list comprehension statement. I am wondering if it gets to a certain point if I should break it out into if/elif/else, and the performance impact, if any of doing so.
My current circumstance:
Returned structure from call is a list of tuples. I need to cast it to a list, some values need to be cleaned up, and I need to strip the last element from the list.
e.g.
[(val1, ' ', 'ChangeMe', 'RemoveMe),
(val1, ' ', 'ChangeMe', 'RemoveMe),
(val1, ' ', 'ChangeMe', 'RemoveMe)]
So in this case, I want to remove RemoveMe, replace all ' ' with '' and replace ChangeMe with val2. I know it is a lot of changes, but the data I am returned is terrible sometimes and I have no control over what is coming to me as a response.
I currently have something like:
response = cursor.fetchall()
response = [['' if item == ' ' else item if item != 'ChangeMe' else 'val2' for item in row][:-1] for row in response]`
Is a nested multi-conditional comprehension statement frowned upon? I know stylistically Python prefers to be very readable, but also compact and not as verbose.
Any tips or info would be greatly appreciated. Thanks all!
Python favors one-liner, on the sole condition that these make the code more readable, and not that it complicates it.
In this case, you use two nested list comprehension, two adjacent ternary operators, a list slicing, all of this on a single line which exceeds the 100 characters... It is everything but readable.
Sometimes it's better to use a classic for loop.
result = []
for val, space, item, remove in response:
result.append([val, '', 'val2'])
And then you realise you can write it as a list comprehension much more comprehensible (assuming your filter condition is simple):
result = [[val, '', 'val2'] for val, *_ in response]
Remember, every code is written once, but it is read many times.
This is one quick way you could do a list-comprehension making use of a dictionary for mapping items:
response = [('val1', ' ', 'ChangeMe', 'RemoveMe'), ('val1', ' ', 'ChangeMe', 'RemoveMe'), ('val1', ' ', 'ChangeMe', 'RemoveMe')]
map_dict = {' ': '', 'ChangeMe': 'val2', 'val1': 'val1'}
response = [tuple(map_dict[x] for x in tupl if x != 'RemoveMe') for tupl in response]
# [('val1', '', 'val2'), ('val1', '', 'val2'), ('val1', '', 'val2')]
I'm doing a for loop over a list in Python, which to all of my knowledge should create a shallow copy of each element in the list. However, when I perform operations on elements, the changes aren't being reflected in the original list.
Here's the code:
interactions = f.readlines()
for interaction in interactions:
orig = interaction
interaction = interaction.replace('.', '').replace(' ', '').strip('\n')
if len(interaction) == 0:
interactions.remove(orig)
print(interactions)
The output of this operation given an initial list of ['. Bph\n', 'Cardiovascular\n', 'Diabetes\n', '. '] is:
['. Bph\n', 'Cardiovascular\n', 'Diabetes\n', '. ']
['. Bph\n', 'Cardiovascular\n', 'Diabetes\n', '. ']
['. Bph\n', 'Cardiovascular\n', 'Diabetes\n', '. ']
['. Bph\n', 'Cardiovascular\n', 'Diabetes\n']
Even though when I follow the debugger I can see the correct operations being performed on interaction but the changes just arent being reflected in interactions. What am I doing wrong here?
Why should it reflect? You're modifying str elements and they can't get altered in place, new strings containing the altered value get returned; nothing reflects in the original list. :-)
Instead, you should iterate over a copy of interactions[:] and assign using the index enumerate produces:
for i, interaction in enumerate(interactions[:]):
# redundant in this snippet
orig = interaction
# strip() will remove '\n' and leading-trailing ' '
interaction = interaction.replace('.', '').replace(' ', '').strip('\n')
# if interaction: suffices
if len(interaction) == 0:
interactions.remove(orig)
else:
interactions[i] = interaction
print(interactions)
This now gives output of:
['Bph', 'Cardiovascular\n', 'Diabetes\n', '. ']
['Bph', 'Cardiovascular', 'Diabetes\n', '. ']
['Bph', 'Cardiovascular', 'Diabetes', '. ']
['Bph', 'Cardiovascular', 'Diabetes']
Python strings are immutable, so you either need to replace each element with your cleaned version (using indices like in Jim's answer) or build up a new list. IMO the neatest way to do this is with a list comprehension:
interactions = [i.replace('.', '').replace(' ', '').strip('\n') for i in interactions]
interactions = list(filter(None, interactions)) # Remove empty elements
This code is based on an elegant answer I received to this question and scaled up to accept nested lists of up to 5 elements. The overall goal is to merge nested lists that have repeating value in index position 1.
The exception pass suppresses the IndexError when a nested list in marker_array has 4 elements. But the code fails to include the last list after the 4 element list in the final output. My understanding was that the purpose of defaultdict was to avoid IndexErrors in the first place.
# Nested list can have 4 or 5 elements per list. Sorted by [1]
marker_array = [
['hard','00:01','soft','tall','round'],
['heavy','00:01','light','skinny','bouncy'],
['rock','00:01','feather','tree','ball'],
['fast','00:35','pidgeon','random'],
['turtle','00:40','wet','flat','tail']]
from collections import defaultdict
d1= defaultdict(list)
d2= defaultdict(list)
d3= defaultdict(list)
d4= defaultdict(list)
# Surpress IndexError due to 4 element list.
# Add + ' ' because ' '.join(d2[x])... create spaces between words.
try:
for pxa in marker_array:
d1[pxa[1]].extend(pxa[:1])
d2[pxa[1]].extend(pxa[2] + ' ')
d3[pxa[1]].extend(pxa[3] + ' ')
d4[pxa[1]].extend(pxa[4] + ' ')
except IndexError:
pass
# Combine all the pieces.
res = [[' '.join(d1[x]),
x,
''.join(d2[x]),
''.join(d3[x]),
''.join(d4[x])]
for x in sorted(d1)]
# Remove empty elements.
for p in res:
if not p[-1]:
p.pop()
print res
The output is almost what I need:
[['hard heavy rock', '00:01', 'soft light feather ', 'tall skinny tree ', 'round bouncy ball '], ['fast', '00:35', 'pidgeon ', 'random ']]
This scaled up version has certainly lost some of the original elegance due to my skill level. Any general pointers on improving this code are much appreciated, but my two main questions in order of importance are:
How can I make sure that the ['turtle','00:40','wet','flat','tail'] nested list is not ignored?
What can I do to avoid trailing white space as in 'soft light feather '?
The problem is the placement of your try block. The IndexError isn't being caused by the defaultdict, it is because you're trying to access pxa[4] in the 4th row of marker_array, which doesn't exist.
Move your try / except inside the for loop, like this:
for pxa in marker_array:
try:
d1[pxa[1]].extend(pxa[:1])
d2[pxa[1]].extend(pxa[2] + ' ')
d3[pxa[1]].extend(pxa[3] + ' ')
d4[pxa[1]].extend(pxa[4] + ' ')
except IndexError:
pass
Output will now include the 4th row.
To answer your second question, you can remove the whitespace by surrounding your various ''.join() calls with a strip() or rstrip() call on each join (e.g. strip(''.join(d2[x])).
Because your try statement starts outside the for loop, an exception in the for loop causes the program to go to the except block and not return to the loop afterwards. Instead, put the try before the main block inside the loop:
for pxa in marker_array:
try:
d1[pxa[1]].extend(pxa[:1])
d2[pxa[1]].extend(pxa[2] + ' ')
d3[pxa[1]].extend(pxa[3] + ' ')
d4[pxa[1]].extend(pxa[4] + ' ')
except IndexError:
pass
Technically it's best practice to include as little code as possible inside the try block, so if you're sure that lists will never have fewer than 4 items, you can move the start of the try block down to the line immediately before you extend d4.
If I understand your code correctly, you're getting the trailing white space because your adding a space after pxa[4]. Of course, removing the space in d4[pxa[1]].extend(pxa[4] + ' ') such that it's d4[pxa[1]].extend(pxa[4]) won't solve your problem for the shorter lists. Instead, you can not add a space after pxa[3] and instead add one before pxa[4], like this:
d3[pxa[1]].extend(pxa[3])
d4[pxa[1]].extend(' ' + pxa[4])
I think that should fix it.