I have the following list:
print(sentences_fam)
>>>[['30973', 'ok'],
['3044', 'ok'],
['53690', 'fd', '65', 'ca'],
['36471', 'none','good','standing'],
['j6426', 'none'],
['500861', 'm', 'br'],
['j0076', 'none'],
['mf4422', 'ok'],
['jf1816', 'father', '64', 'ca'],
['500854', 'no', 'fam', 'none', 'hx'],
['54480n', 'none'],
['mf583', 'none'],
...]
print (len(sentences_fam))
>>> 1523613
The lists are of many different lengths and contain all sorts of different strings.
I am trying to remove all lists that contain the keyword 'none'. Based on the list above my desired output should look like this.
[['30973', 'ok'],
['3044', 'ok'],
['53690', 'fd', '65', 'ca'],
['500861', 'm', 'br'],
['mf4422', 'ok'],
['jf1816', 'father', '64', 'ca'],
...]
My list comprehension skills are still not so great so I'm not sure what to do. I have tried converting this list into a dataframe but I have had no luck because each string gets assigned an individual column and I have not found a good way of formatting the data again into a list of lists. I need that type of format to be able to pass the data to the word2vec library.
Basically the whole list is the body of text and each sublist is a sentence. Also please keep in mind that I will be needing to apply this to a large list so performance/efficiency might be important.
filtered_list = [sublist for sublist in sentences_fam if "none" not in sublist]
Related
I have a line that looks like this:
Amount:Category:Date:Description:55544355
My requirement is to find a sequence of two characters, followed later by that same sequence of two characters, followed later by that same sequence of two characters again till all sequences are found. I achieved this as follows:
>>>my_str = 'Amount:Category:Date:Description:55544355'
>>>[item[0] for item in re.findall(r"((..)\2*)", my_str)]
>>>['Am', 'ou', 'nt', ':C', 'at', 'eg', 'or', 'y:', 'Da', 'te', ':D', 'es', 'cr', 'ip', 'ti', 'on', ':5', '55', '44', '35']
This is obviously not the right output since the desired output is:
[[':D',':D'],['55','55'],['at', 'at']]
What am I doing wrong?
Would you please try the following:
my_str = 'Amount:Category:Date:Description:55544355'
print(re.findall(r'(..)(?=.*?\1)', my_str))
Output:
['at', ':D', '55']
If you want to print all occurrences of the characters, another step is required.
You have to use a lookahead with a backreference. To get both values, you can wrap the backreference also in a capture group which will be returned as a tuple by re.findall.
import re
print(re.findall(r"(..)(?=.*?(\1))", "Amount:Category:Date:Description:55544355"))
Output
[('at', 'at'), (':D', ':D'), ('55', '55')]
If you want a list of lists:
import re
print([list(elem) for elem in re.findall(r"(..)(?=.*?(\1))", "Amount:Category:Date:Description:55544355")])
Output
[['at', 'at'], [':D', ':D'], ['55', '55']]
so i have some data i have been trying to clean up, its a list and it looks like this
a = [\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain]
i have tried to clean it up by doing this
a.replace("\n", "|")
the output turn out like this :
[london||18||20||30||||japan||6||80||2|||Spain]
if i do this:
a.replace("\n","")
i get this:
[london,"", "", 18,"","",20"","",30,"","","",""japan,"",""6,"","",80,"","",2"","","","",Spain]
can anyone explain why i am having multiple pipes, spaces and whats the best way to clean the data.
Assuming that your input is:
s = '\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain'
The issue is that there are multiple '\n' in-between data, therefore just replacing each '\n' with another character (say '|') will give you as many of the new characters as there were '\n'.
The simplest approach is to use str.split() to get the non-blank data:
l = list(s.split())
print(l)
# ['london', '18', '20', '30', 'japan', '6', '80', '2', 'Spain']
or, combine it with str.join(), if you want to have it separated by '|':
t = '|'.join(s.split())
print(t)
# london|18|20|30|japan|6|80|2|Spain
I tried it and got this:
a = ['\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain']
print(a[0].replace("\n", ""))
Output:
london182030japan6802Spain
Could you please clarify the exact input and the expected output? it does not seem correct yet and I have taken some liberties.
If your input was a string you can use split():
a = '\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain'
print(a.split())
Output:
['london', '18', '20', '30', 'japan', '6', '80', '2', 'Spain']
If have a list of different types of strings and from that I want to combine all the alphabetic strings in the list into one single value.
For example:
['000000001', 'Aaron', 'Appindangoye', '26', '183', '84.8']
Here, I want to get Aaron Appindangoye together.
You can access to the 2 names by index:
items = ['000000001', 'Aaron', 'Appindangoye', '26', '183', '84.8']
name = ' '.join(items[1:3])
print(name)
-> Aaron Appindangoye
See list in the doc
I have a list comprised of strings that all follow the same format 'Name%Department%Age'
I would like to order the list by age, then name, then department.
alist = ['John%Maths%30', 'Sarah%English%50', 'John%English%30', 'John%English%31', 'George%Maths%30']
after sorting would output:
['Sarah%English%50, 'John%English%31', 'George%Maths%30', 'John%English%30, 'John%Maths%30']
The closest I have found to what I want is the following (found here: How to sort a list by Number then Letter in python?)
import re
def sorter(s):
match = re.search('([a-zA-Z]*)(\d+)', s)
return int(match.group(2)), match.group(1)
sorted(alist, key=sorter)
Out[13]: ['1', 'A1', '2', '3', '12', 'A12', 'B12', '17', 'A17', '25', '29', '122']
This however only sorted my layout of input by straight alphabetical.
Any help appreciated,
Thanks.
You are on the right track.
Personally, I:
would first use string.split() to chop the string up into its constituent parts;
would then make the sort key produce a tuple that reflects the desired sort order.
For example:
def key(name_dept_age):
name, dept, age = name_dept_age.split('%')
return -int(age), name, dept
alist = ['John%Maths%30', 'Sarah%English%50', 'John%English%30', 'John%English%31', 'George%Maths%30']
print(sorted(alist, key=key))
Use name, department, age = item.split('%') on each item.
Make a dict out of them {'name': name, 'department': department, 'age': age}
Then sort them using this code
https://stackoverflow.com/a/1144405/277267
sorted_items = multikeysort(items, ['-age', 'name', 'department'])
Experiment once with that multikeysort function, you will see that it will come in handy in a couple of situations in your programming career.
I have a pandas data frame where I need to extract sub-string from each row of a column based on the following conditions
We have start_list ('one','once I','he') and end_list ('fine','one','well').
The sub-string should be preceded by any of the elements of the start_list.
The sub-string may be succeeded by any of the elements of the end_list.
When any of the elements of the start_list is available then the succeeding sub string should be extracted with/without the presence of the elements of the end_list.
Example Problem:
df = pd.DataFrame({'a' : ['one was fine today', 'we had to drive', ' ','I
think once I was fine eating ham ', 'he studies really
well
and is polite ', 'one had to live well and prosper',
'43948785943one by onej89044809', '827364hjdfvbfv',
'&^%$&*+++===========one kfnv dkfjn uuoiu fine', 'they
is one who makes me crazy'],
'b' : ['11', '22', '33', '44', '55', '66', '77', '', '88',
'99']})
Expected Result:
df = pd.DataFrame({'a' : ['was', '','','was ','studies really','had to live',
'by','','kfnv dkfjn uuoiu','who makes me crazy'],
'b' : ['11', '22', '33', '44', '55', '66', '77', '',
'88','99']})
I think this should work for you. This solution requires Pandas of course and also the built-in library functools.
Function: remove_preceders
This function takes as input a collection of words start_list and str string. It looks to see if any of the items in start_list are in string, and if so returns only the piece of string that occurs after said items. Otherwise, it returns the original string.
def remove_preceders(start_list, string):
for word in start_list:
if word in string:
string = string[string.find(word) + len(word):]
return string
Function: remove_succeders
This function is very similar to the first, except it returns only the piece of string that occurs before the items in end_list.
def remove_succeeders(end_list, string):
for word in end_list:
if word in string:
string = string[:string.find(word)]
return string
Function: to_apply
How do you actually run the above functions? The apply method allows you to run complex functions on a DataFrame or Series, but it will then look for as input either a full row or single value, respectively (based on whether you're running on a DF or S).
This function takes as input a function to run & a collection of words to check, and we can use it to run the above two functions:
def to_apply(func, words_to_check):
return functools.partial(func, words_to_check)
How to Run
df['no_preceders'] = df.a.apply(
to_apply(remove_preceders,
('one', 'once I', 'he'))
)
df['no_succeders'] = df.a.apply(
to_apply(remove_succeeders,
('fine', 'one', 'well'))
)
df['substring'] = df.no_preceders.apply(
to_apply(remove_succeeders,
('fine', 'one', 'well'))
)
And then there's one final step to remove the items from the substring column that were not affected by the filtering:
def final_cleanup(row):
if len(row['a']) == len(row['substring']):
return ''
else:
return row['substring']
df['substring'] = df.apply(final_cleanup, axis=1)
Results
Hope this works.