Hello I have a list as follows:
['2925729', 'Patrick did not shake our hands nor ask our names. He greeted us promptly and politely, but it seemed routine.'].
My goal is a result as follows:
['2925729','Patrick did not shake our hands nor ask our names'], ['2925729', 'He greeted us promptly and politely, but it seemed routine.']
Any pointers would be very much appreciated.
>>> t = ['2925729', 'Patrick did not shake our hands nor ask our names. He greeted us promptly and politely, but it seemed routine.']
>>> [ [t[0], a + '.'] for a in t[1].rstrip('.').split('.')]
[['2925729', 'Patrick did not shake our hands nor ask our names.'], ['2925729', ' He greeted us promptly and politely, but it seemed routine.']]
If you have a large dataset and want to conserve memory, you may want to create a generator instead of a list:
g = ( [t[0], a + '.'] for a in t[1].rstrip('.').split('.') )
for key, sentence in g:
# do processing
Generators do not create lists all at once. They create each element as you access it. This is only helpful if you don't need the whole list at once.
ADDENDUM: You asked about making dictionaries if you have multiple keys:
>>> data = ['1', 'I think. I am.'], ['2', 'I came. I saw. I conquered.']
>>> dict([ [t[0], t[1].rstrip('.').split('.')] for t in data ])
{'1': ['I think', ' I am'], '2': ['I came', ' I saw', ' I conquered']}
Related
I am a middle school student studying Python. Is there a way to omit certain characters from the list and mix them?
Input list
['Hello', 'Middle school student', 'I am']
Expected output
['Middle school student', 'Hello', 'I am']
If you specify is, everything except for is mixed.
Here is a simple shuffle that is effective and efficient. Basically, you randomly swap each element with another element.
import random
def shuffle(lst):
for i in range(len(lst)):
j = random.randrange(len(lst))
lst[i],lst[j] = lst[j],lst[i]
I have a large dataset all_transcripts with conversations and I have a small list gemeentes containing names of different cities. In all_transcripts, I want to replace each instance in which the name of a city is given, by 'woonplaats' (Dutch for city).
To do so, I have the following code:
all_transcripts['filtered'] = all_transcripts['no_punc'].str.replace('|'.join(gemeentes),' woonplaats ')
However, this replaces each instance in which the word combination appears and not just whole words.
What I'm looking for is something like:
all_transcripts['filtered'] = all_transcripts['no_punc'].re.sub('|'r"\b{}\b".format(join(gemeentes)),' woonplaats ')
But this doesn't work.
As an example, I have:
all_transcripts['no_punc'] = ['i live in amsterdam', 'i come from haarlem', 'groningen is her favourite city']
gemeentes = ['amsterdam', 'rotterdam', 'den haag', 'haarlem', 'groningen']
The output that I want, after I run the code is as follows:
>>> ['i live in woonplaats', 'i come from woonplaats', 'woonplaats is her favourite city']
Before, I've worked with the '\b' option of regex. However, I don't know how to apply it here. I could run a for loop for each word in gemeentes and apply it to the whole dataset. But given its size (gemeentes has over 300 variables and all_transcripts over 2.5 million rows), this would be very computationally expensive and thus, I would like a similar approach as above in which I replace a string, using the OR operator.
It looks like you're close, but you'll want to change your re.sub call a little. Something like this should work:
gemeentes = ['amsterdam', 'rotterdam', 'den haag', 'haarlem', 'groningen']
all_transcripts['filtered'] = [re.sub(r"\b({})\b".format("|".join(gemeentes)), "woonplaats", s) for s in all_transcripts['no_punc']]
Output
all_transcripts['filtered'] = ['i live in woonplaats', 'i come from woonplaats', 'woonplaats is her favourite city']
As for performance, I'm not sure that you're going to get better speeds out of this over a traditional for-loop as you're still having to loop over the 25 million entries and apply the regex.
If you are using pandas dataframe then you can use the following :
import pandas as pd
all_transcripts['filtered']= all_transcripts.replace([amsterdam', 'rotterdam', 'den haag', 'haarlem', 'groningen'], "woonplaats", regex=True)
How do you get the sum of each nested list for each key in the dictionary below?
Let's say the following below is called msgs
I tried the following code:
I ended up getting the result:
It is almost right but for some reason the sum of the first nested list is incorrect, being 0 whereas it should be 19. I have a feeling this has to do with the total = 0 part in the above code I wrote but I am not sure if this is the case and I don't know how to fix the issue.
The way I got the values in the nested list was I summed the number of strings in each index of the nested list. So for instance, this here was for the first key. As you can see, there are 15 entries in the first one and 4 in the second one.
(this dictionary is called 'kakao' in my code)
{'Saturday, July 28, 2018': [['hey', 'ben', 'u her?', 'here?', 'ok so basically', 'farzam and avash dont wanna go to vegas', 'lol', 'im offering a spontaneous trip me and you to SF', 'lol otherwise ill just go back to LA', 'i mean sf is far but', 'i mean if u really wanna hhah', 'we could go and see chris', 'but otherwise its fine', 'alright send me the code too', 'im on my way right now'], ['Wtf is happening lol', '8 haha', 'Key is #8000', 'Hf']]}
The code I used to get the sums as a nested list was:
kakao = {'Saturday, July 28, 2018': [['hey', 'ben', 'u her?', 'here?', 'ok so basically', \
'farzam and avash dont wanna go to vegas', 'lol', 'im offering a spontaneous trip me and you to SF', \
'lol otherwise ill just go back to LA', 'i mean sf is far but', 'i mean if u really wanna hhah', \
'we could go and see chris', 'but otherwise its fine', 'alright send me the code too', 'im on my way right now'], \
['Wtf is happening lol', '8 haha', 'Key is #8000', 'Hf']],
'Friday, August 3, 2018': [['Someone', 'said', 'something'], ['Just', 'test']],}
print({key: [sum(map(lambda letters: len(letters), val))] for key, val in kakao.items()})
#the result --> {'Saturday, July 28, 2018': [19], 'Friday, August 3, 2018': [5]}
I guess you want to count the letters form the sentences at the same day, hope this code can help you.
I am looking for a way to create several lists and for the keywords in those lists to be extracted and matched with a responce.
User Input: This is a good day I am heading out for a jog.
List 1 : Keywords : good day, great day, awesome day, best day.
List 2 : Keywords : a run, a swim, a game.
But for a huge database of words, can this be linked to just the lists? Or does it need to be especific words?
Also would you recommend Python for a huge database of keywords?
The first thing to do is to break the input string up into tokens. A token is just a piece of the string that you want to match. In your case, it looks like your token size is 2 words (but it doesn't have to be). You might also want to strip all punctuation from the input string as well.
Then for your input, your tokens are
['This is', 'is a', 'a good', 'good day', 'day I', 'I am', 'am heading', 'heading out', 'out for', 'for a', 'a jog']
Then you can iterate over the tokens and check to see if they're contained in each one of the lists. Might look like this:
input = 'This is a good day I am heading out for a jog'
words = input.split(' ')
tokens = [' '.join(words[i:i+2]) for i in range(len(words) - 1)]
for token in tokens:
if token in list1:
print('{} is in list1'.format(token))
if token in list2:
print('{} is in list2'.format(token))
One thing you will likely want to do to optimize this is to use sets for list1 and list2, instead of lists.
set1 = set(list1)
sets offer O(1) lookups, as opposed to O(n) for lists, which is critical if your keyword lists are large.
So say I have a string such as:
Hello There what have You Been Doing.
I am Feeling Pretty Good and I Want to Keep Smiling.
I'm looking for the result:
['Hello There', 'You Been Doing', 'I am Feeling Pretty Good and I Want to Keep Smiling']
After a long time of head scratching which later evolved into head slamming, I turned to the internet for my answers. So far, I've managed to find the following:
r"([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)"
The above works but it clearly does not allow for 'and', 'to', 'for', 'am' (these are the only three I'm looking for) to be in the middle of the words and I can not figure out how to add that in there. I'm assuming I have to use the Pipe to do that, but where exactly do I put that group in?
I've also tried the answers over here, but they didn't end up working for me either.
If you are able to enumerate the words you're ok with being uncapitalized in the middle of a capitalized sentence, I would use an alternation to represent them :
\b(?:and|or|but|to|am)\b
And use that alternation to match a sequence of capitalized words and accepted uncapitalized words, which must start with a capitalized word :
[A-Z][a-z]*(?:\s(?:[A-Z][a-z]*|(?:and|or|but|to|am)\b))*
If you are ok with any word of three letters or less (including words like 'owl' or 'try', but not words like 'what') being uncapitalized, you can use the following :
[A-Z][a-z]*(?:\s(?:[A-Z][a-z]*|[a-z]{1,3}\b))*
I guess below works too with itertools.groupby
from itertools import groupby
s = 'Hello There what have You Been Doing. I am Feeling Pretty Good and I Want to Keep Smiling.'
[ ' '.join( list(g) ) for k, g in groupby(s.split(), lambda x: x[0].islower() and x not in ['and','to'] ) if not k ]
Output:
['Hello There',
'You Been Doing. I',
'Feeling Pretty Good and I Want to Keep Smiling.']