I have the next DataFrame:
x = [{'name': 'a.1,b,c,d,e,a,f,g,h'}, {'name': 'b.1,c,a.1,d,e.1,g,a,h'}, {'name': 'b.1,d,e,a,f.1,c,r'}]
df = pd.DataFrame(x)
I need to filter column -> 'name' and delete the values a and a.1, and to get the next result:
I will be grateful for help
You can use list comprehension:
df['name_1'] = [','.join([a for a in d.split(',') if a not in set(('a','a.1'))]) for d in df['name']]
You can also apply the very same function to the 'name' column:
df['name_1'] = df['name'].apply(lambda row : ','.join([a for a in row.split(',') if a not in set(('a','a.1'))]))
Output:
name name_1
0 a.1,b,c,d,e,a,f,g,h b,c,d,e,f,g,h
1 b.1,c,a.1,d,e.1,g,a,h b.1,c,d,e.1,g,h
2 b.1,d,e,a,f.1,c,r b.1,d,e,f.1,c,r
You can use Dataframe's apply function this:
def update(r):
items = r.split(',')
return ','.join(filter(lambda x: x != 'a.1' and x != 'a', items))
df['name_1'] = df['name'].apply(update)
df
Output:
name name_1
0 a.1,b,c,d,e,a,f,g,h b,c,d,e,f,g,h
1 b.1,c,a.1,d,e.1,g,a,h b.1,c,d,e.1,g,h
2 b.1,d,e,a,f.1,c,r b.1,d,e,f.1,c,r
You can just use replace:
df['name'] = df['name'].replace({'a.1,':'','a':''},regex=True).str.replace(',,',',')
name
0 b,c,d,e,f,g,h
1 b.1,c,d,e.1,g,h
2 b.1,d,e,f.1,c,r
Another regex option:
df['name'].str.replace(r"a\.1|a","").str.replace(r'(?<=,),|^,',"")
0 b,c,d,e,f,g,h
1 b.1,c,d,e.1,g,h
2 b.1,d,e,f.1,c,r
Try this:
df['name_1'] = df['name'].str.split(',').apply(lambda cell_value: ",".join(filter(lambda x: x not in {'a', 'a.1'}, cell_value)))
I am surprised that all proposed regex answers use two passes to achieve the goal.
Here is a regex answer that relies on a single expression:
# list of substrings to remove
# IMPORTANT: if an element is a left-aligned substring of another
# the substring should be ordered after the longer string
# here "a" is after "a.1"
remove = ['a.1', 'a']
import re
regex = '({0}),|,({0})'.format('|'.join(map(re.escape, remove)))
df['name_1'] = df['name'].str.replace(regex, '', regex=True)
Alternative if group capturing is unwanted:
regex = '(?:{0}),|,(?:{0})'.format('|'.join(map(re.escape, remove)))
Example:
name name_1
0 a.1,b,c,d,e,a,f,g,h b,c,d,e,f,g,h
1 b.1,c,a.1,d,e.1,g,a,h,a b.1,c,d,e.1,g,h
2 b.1,d,e,a,f.1,c,r b.1,d,e,f.1,c,r
Format of the generated regex:
'(a\\.1|a),|,(a\\.1|a)'
# or for the alternative
'(?:a\\.1|a),|,(?:a\\.1|a)'
I am facing a issue with my for loop in respect to lists.
I have two lists like shown below. Now I want to remove the name if the name in both lists matches. My code
Input:
col = ['cat','dog','bird','fish']
col_names= [cat,bird]
r=[]
for i in col:
print(i)
if i in col_names: col_names.remove(i)
r.append(col_names)
print(r)
then I am getting an output like this
r = [['dog','fish']] [['dog','fish']]
What I want is:
r =['dog','bird','fish'] ['cat','dog','fish']
Simpler way to achieve this is using nested list comprehension :
>>> col = ['cat','dog','bird','fish']
>>> col_names= ['cat', 'bird']
>>> [[c for c in col if c !=cn] for cn in col_names]
[['dog', 'bird', 'fish'], ['cat', 'dog', 'fish']]
The code you shared is not logically correct. If you want to do it with explicit for loop, you can write it like this:
new_list = []
for cn in col_names:
temp_list = []
for c in col:
if c != cn:
temp_list.append(c)
new_list.append(temp_list)
print(new_list)
The problem here is col is edited everytime you remove an element from it because it is a pointer. If you want to achieve your desired output, you should create a copy first, like the following
col = ['cat','dog','bird','fish']
col_names= ['cat','bird','elephant']
r=[]
for name in col_names:
tmp = col.copy() # Creates a copy of col list
try:
tmp.remove(name)
r.append(tmp)
except ValueError: # If name is not in the list
pass # Don't do anything
print(r)
Output
[['dog', 'bird', 'fish'], ['cat', 'dog', 'fish']]
I have a list of dictionaries with text as value, and I want to remove the dictionaries that includes certain words in the texts.
df = [{'name':'jon','text':'the day is light'},{'name':'betty','text':'good night'},{'name':'shawn','text':'good afternoon'}]
I want to remove the dictionaries that include words 'light' and 'night' for 'text' key:
words = ['light','night']
pattern = re.compile(r"|".join(words))
Expected result:
df = [{'name':'shawn','text':'good afternoon'}]
[x for x in df if not any(w in x['text'] for w in words)]
You're close. All you need to do is write your list comprehension and apply the search pattern:
result = [x for x in df if not re.search(pattern, x['text'])]
Full example:
import re
df = [{'name':'jon','text':'the day is light'},{'name':'betty','text':'good night'},{'name':'shawn','text':'good afternoon'}]
words = ['light','night']
pattern = re.compile(r"|".join(words))
result = [x for x in df if not re.search(pattern, x['text'])]
print(result) # => [{'name': 'shawn', 'text': 'good afternoon'}]
I found my answer:
[x for x in df if not pattern.search(x['text'])]
If I have aaabbbccc, I'd like to change them in to a3b3c3.
I am using if statement for this.. but it looks too inefficient.
Maybe Regex would be helpful, but regex for searching only the consecutives is possible?
if I have aaabbbcccaaa then I'd like to change them a3b3c3a3 list this.. which means the algorithm only search the "consecutives and count them" change into integer.
Any hint to proceed would be appreciated.
def comp(string):
index = []
for i in range(len(string)):
try:
if string[i] is not string[i+1]:
index.append(i)
except:
pass
first = string[index[0]] + str(index[0]+1)
print(first)
message_comp = [first]
for i in range(1, len(message_comp)):
message_comp.append(message[index[i]]*(index[i-1]+1))
final = ''.join(message_comp)
return final
itertools.groupby:
Make an iterator that returns consecutive keys and groups from the iterable
import itertools
x = 'aaabbbcccaaa'
groups = [i + str(len([*j])) for i, j in itertools.groupby(x)]
# ['a3', 'b3', 'c3', 'a3']
join to finish up:
''.join(groups)
# a3b3c3a3
If needed, replace to remove 1:
''.join(groups).replace('1', '') instead of ''.join(groups)
Maybe itertools groupby?
from itertools import groupby
s = "aaabbbcccaaa"
groups = groupby(s)
a = [(label, sum(1 for _ in group)) for label, group in groups]
b = [i for sub in a for i in sub]
print("".join(map(str,b)))
output: a3b3c3a3
I want to create a key list for a large hdf file from an Excel file:
Table:
I want the key list to look like this:
/blue/a/a1
/blue/a/a2
/blue/b/b1
...
my code so far is:
import pandas as pd
import numpy as np
df = pd.read_excel('file.xlsx', usecols = ['A', 'B', 'C'])
print(df)
list1, list2, list3 = df['A'].tolist(), df['B'].tolist(), df['C'].tolist()
print(list1,list2,list3)
for i in list1:
list1[i].append(list2[i]).append(list3[i])
print(list1)
The conversion to 3 lists works. Then I try to append the rows of each list together, without success. Is there a simple way to do that?
Use zip method and then use string.join to get your required output
Ex:
res = []
list1, list2, list3 = df['A'].tolist(), df['B'].tolist(), df['C'].tolist()
for i in zip(list1, list2, list3 ):
val = map(str, i)
res.append("/{0}".format("/".join(val)))
You should use a apply function to create a new column as required and then convert it to a list
import pandas as pd
import numpy as np
df = pd.read_excel('file.xlsx', usecols = ['A', 'B', 'C'])
print(df)
x=df[['A','B','C']].apply(lambda row: '/'+row['A']+'/'+row['B']+'/'+row['C'],axis=1)
x.tolist()
Update shorter code:
x=df[['A','B','C']].apply(lambda row: '/'+'/'.join(row),axis=1)
x.tolist()