I have a list of strings with a few unclean entries and I want to replace the unclean entries with clean entries
list = ['created_DATE', 'column1(case', 'timestamp', 'location(case']
I want to get a list that is like this
cleanList = ['created_DATE', 'column1', 'timestamp', 'location']
I tired the following:
str_match = [s for s in list if "(case" in s] *#find the intersecting elements*
print (str_match)
new=[]
for k in str_match:
a=k.replace("(case" , "")
new.append(a) *#make an list of the words without the substring*
print(new)
I am not sure how do I now replace the entries from the new list into the original list. Can someone please help.
Thank you
If you want to remove all occurrences of "case(" from your list's elements, then you could write it like this:
list = ['created_DATE', 'column1(case', 'timestamp', 'location(case']
clean = []
for n in list:
clean.append(n.replace("(case", ""))
print(clean)
You can either create a new list clean as told by #alani:
import re
myList = ['created_DATE', 'column1(case', 'timestamp', 'location(case']
clean = [re.sub("\(.*", "", s) for s in myList]
print(clean)
or iterate over elements of myList and update in place
for i in range(len(myList)):
if "(case" in myList[i]:
myList[i] = myList[i].replace("(case" , "")
print(myList)
Related
Im trying to filter the list1 based on another list2 with the following code:
import csv
with open('screen.csv') as f: #A file with a list of all the article titles
reader = csv.reader(f)
list1 = list(reader)
print(list1)
list2 = ["Knowledge Management", "modeling language"] #key words that article title should have (at least one of them)
list2 = [str(x) for x in list2]
occur = [i for i in list1 for j in list2 if str(j) in i]
print(occur)
but the output is empty.
My list1 looks like this:
list_1 is actually a list of lists, not a list of strings, so you need to flatten it (e.g. by doing this) before trying to compare elements:
list_1 = [['foo bar'], ['baz beep bop']]
list_2 = ['foo', 'bub']
flattened_list_1 = [
element
for sublist in list_1
for element in sublist
]
occurrences = [
phrase
for phrase in flattened_list_1 if any(
word in phrase
for word in list_2
)
]
print(occurrences)
# output:
# ['foo bar']
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
print(df[df.column_of_list.map(lambda x: np.isin(x, another_list).all())])
#OR
print(df[df[0].map(lambda x: np.isin(x, another_list).all())])
Try with real data:
import numpy as np
import pandas as pd
data = ["Knowledge Management", "modeling language"]
another_list=["modeling language","natural language"]
df = pd.DataFrame(data)
a = df[df[0].map(lambda x: np.isin(x, another_list).all())]
print(a)
Your list1 is a list of lists, because the csv.reader that you're using to create it always returns lists for each row, even if there's only a single item. (If you're expecting a single name from each row, I'm not sure why you're using csv here, it's only going to be a hindrance.)
Later when you check if str(j) in i as part of your filtering list comprehension, you're testing if the string j is present in the list i. Since the values in list2 are not full titles but key-phrases, you aren't going to find any matches. If you were checking in the inner strings, you'd get substring checks, but when you test list membership it must be an exact match.
Probably the best way to fix the problem is to do away with the nested lists in list1. Try creating it with:
with open('screen.csv') as f:
list1 = [line.strip() for line in f]
list1 = ['/mnt/1m/a_pre.geojson','/mnt/2m/b_pre.geojson']
list2 = ['/mnt/1m/a_post.geojson']
I have multiple lists and I want to find all the elements of list1 which do not have entry in list2 with a filtering condition.
The condition is it should match 'm' like 1m,2m.. and name of geojson file excluding 'pre or post' substring.
For in e.g. list1 '/mnt/1m/a_pre.geojson' is processed but '/mnt/2m/b_pre.geojson' is not so the output should have a list ['/mnt/2m/b_pre.geojson']
I am using 2 for loops and then splitting the string which I am sure is not the only one and there might be easier way to do this.
for i in list1:
for j in list2:
pre_tile = i.split("/")[-1].split('_pre', 1)[0]
post_tile = j.split("/")[-1].split('_post', 1)[0]
if pre_tile == post_tile:
...
I believe you have similar first part of the file paths. If so, you can try this:
list1 = ['/mnt/1m/a_pre.geojson','/mnt/2m/b_pre.geojson']
list2 = ['/mnt/1m/a_post.geojson']
res = [x for x in list1 if x[:7] not in [y[:7] for y in list2]]
res:
['/mnt/2m/b_pre.geojson']
If I understand you correctly, using a regular expression to do this kind of string manipulation can be fast and easy.
Additionally, to do multiple member-tests in list2, it's more efficient to convert the list to a set.
import re
list1 = ['/mnt/1m/a_pre.geojson', '/mnt/2m/b_pre.geojson']
list2 = ['/mnt/1m/a_post.geojson']
pattern = re.compile(r'(.*?/[0-9]m/.*?)_pre.geojson')
set2 = set(list2)
result = [
m.string
for m in map(pattern.fullmatch, list1)
if m and f"{m[1]}_post.geojson" not in set2
]
print(result)
I web scraped some stock tickers off a website and the text inside the span tags has '\xa0AYTU\xa0' as an example. I'm trying to remove '\xa0' from either side of the ticker using replace('xa0',''). However, when I go to append the list after I replaced the characters it appends the list with '\xa0AYTU\xa0' no matter what..
Here is my for loop in question.
fu_tickers = []
for t in match_fu.find_all('span'):
temp = str(t.text)
temp2 = temp.replace('xa0','')
fu_tickers.append(temp2)
print(fu_tickers)
When I insert print(temp2) inside the for loop I can see it properly removes the characters but for some reason will not append temp2 string to the fu_tickers list with the characters removed.
Current results = ['\xa0AYTU\xa0', '\xa0CETX\xa0', '\xa0CHFS\xa0']
Desired results = ['AYTU', 'CETX', 'CHFS']
Add \ to 'xa0' in str.replace:
lst = ['\xa0AYTU\xa0', '\xa0CETX\xa0', '\xa0CHFS\xa0']
lst = [i.replace('\xa0', '') for i in lst]
print(lst)
Prints:
['AYTU', 'CETX', 'CHFS']
Or use str.strip:
lst = ['\xa0AYTU\xa0', '\xa0CETX\xa0', '\xa0CHFS\xa0']
lst = [i.strip() for i in lst]
print(lst)
Prints:
['AYTU', 'CETX', 'CHFS']
I have a single list that could be any amount of elements.
['jeff','ham','boat','','my','name','hello']
How do I split this one list into two lists or any amount of lists depending on blank string elements?
All these lists can then be put into one list of lists.
If you are certain that there is only one blank string in the list, you can use str.index to find the index of the blank string, and then slice the list accordingly:
index = lst.index('')
[lst[:index], lst[index + 1:]]
If there could be more than one blank string in the list, you can use itertools.groupby like this:
lst = ['jeff','ham','boat','','my','name','hello','','hello','world']
from itertools import groupby
print([list(g) for k, g in groupby(lst, key=bool) if k])
This outputs:
[['jeff', 'ham', 'boat'], ['my', 'name', 'hello'], ['hello', 'world']]
Using itertools.groupby, you can do:
from itertools import groupby
lst = ['jeff','ham','boat','','my','name','hello']
[list(g) for k, g in groupby(lst, key=bool) if k]
# [['jeff', 'ham', 'boat'], ['my', 'name', 'hello']]
Using bool as grouping key function makes use of the fact that the empty string is the only non-truthy string.
This is one approach using a simple iteration.
Ex:
myList = ['jeff','ham','boat','','my','name','hello']
result = [[]]
for i in myList:
if not i:
result.append([])
else:
result[-1].append(i)
print(result)
Output:
[['jeff', 'ham', 'boat'], ['my', 'name', 'hello']]
Let list_string be your list. This should do the trick :
list_of_list=[[]]
for i in list_string:
if len(i)>0:
list_of_list[-1].append(i)
else:
list_of_list.append([])
Basically, you create a list of list, and you go through your original list of string, each time you encounter a word, you put it in the last list of your list of list, and each time you encounter '' , you create a new list in your list of list. The output for your example would be :
[['jeff','ham','boat'],['my','name','hello']]
i'm not sure that this is what you're trying to do, but try :
my_list = ['jeff','ham','boat','','my','name','','hello']
list_tmp = list(my_list)
final_list = []
while '' in list_tmp:
idx = list_tmp.index('')
final_list.append(list_tmp[:idx])
list_tmp = list_tmp[idx + 1:]
So I have a long list of column headers. All are strings, some are several words long. I've yet to find a way to write a function that extracts the first word from each value in the list and returns a list of just those singular words.
For example, this is what my list looks like:
['Customer ID', 'Email','Topwater -https:', 'Plastics - some uml']
And I want it to look like:
['Customer', 'Email', 'Topwater', 'Plastics']
I currently have this:
def first_word(cur_list):
my_list = []
for word in cur_list:
my_list.append(word.split(' ')[:1])
and it returns None when I run it on a list.
You can use list comprehension to return a list of the first index after splitting the strings by spaces.
my_list = [x.split()[0] for x in your_list]
To address "and it returns None when I run it on a list."
You didn't return my_list. Because it created a new list, didn't change the original list cur_list, the my_list is not returned.
To extract the first word from every value in a list
From #dfundako, you can simplify it to
my_list = [x.split()[0] for x in cur_list]
The final code would be
def first_word(cur_list):
my_list = [x.split()[0] for x in cur_list]
return my_list
Here is a demo. Please note that some punctuation may be left behind especially if it is right after the last letter of the name:
names = ["OMG FOO BAR", "A B C", "Python Strings", "Plastics: some uml"]
first_word(names) would be ['OMG', 'A', 'Python', 'Plastics:']
>>> l = ['Customer ID', 'Email','Topwater -https://karls.azureedge.net/media/catalog/product/cache/1/image/627x470/9df78eab33525d08d6e5fb8d27136e95/f/g/fgh55t502_web.jpg', 'Plastics - https://www.bass.co.za/1473-thickbox_default/berkley-powerbait-10-power-worm-black-blue-fleck.jpg']
>>> list(next(zip(*map(str.split, l))))
['Customer', 'Email', 'Topwater', 'Plastics']
[column.split(' ')[0] for column in my_list] should do the trick.
and if you want it in a function:
def first_word(my_list):
return [column.split(' ')[0] for column in my_list]
(?<=\d\d\d)\d* try using this in a loop to extract the words using regex