break paragraph into sentences in python and link back to an ID

break paragraph into sentences in python and link back to an ID - python

I have two lists, one with ids and one with corresponding comments for each id.
list_responseid = ['id1', 'id2', 'id3', 'id4']
list_paragraph = [['I like working and helping them reach their goals.'],
['The communication is broken.',
'Information that should have come to me is found out later.'],
['Try to promote from within.'],
['I would relax the required hours to be available outside.',
'We work a late night each week.']]
The ResponseID 'id1' is related to the paragraph ('I like working and helping them reach their goals.') and so on.
I can break paragraph into sentences using the following function:
list_sentence = list(itertools.chain(*list_paragraph))
What would be the syntax to get the end result that is data frame (or list) with separate entry for a sentence and have an ID associated with that sentence (which is now linked to paragraph). The final result would look like this (I will convert list to panda data frame at the end).
id1 'I like working with students and helping them reach their goals.'
id2 'The communication from top to bottom is broken.'
id2 'Information that should have come to me is found out later and in some cases students know more about what is going on than we do!'
id3 'Try to promote from within.'
id4 'I would relax the required 10 hours to be available outside of 8 to 5 back to 9 to 5 like it used to be.'
id4 'We work a late night each week and rarely do students take advantage of those extended hours.'
Thanks.

If you do it often it would be clearer, and probably more efficient depending on the size of the arrays, if you make a dedicated function for that with two regular nested loops, but if you need a quick one liner for it (it's doing just that):
id_sentence_tuples = [(list_responseid[id_list_idx], sentence) for id_list_idx in range(len(list_responseid)) for sentence in list_paragraph[id_list_idx]]
id_sentence_tuples will then be a list of tupples where each of the elements is a pair like (paragraph_id, sentence) just as the result you expect.
Also i would advise you to check that both lists have the same length before doing it so in case they don't you get a meaningful error.
if len(list_responseid) != len(list_paragraph):
IndexError('Lists must have same cardinality')

I had a dataframe with an ID and a review (col = ['ID','Review']). If you can combine these lists to make a dataframe then you can use my approach. I split these reviews into sentences using nltk and then linked back the IDs within the loop. Following is the code that you can use.
## Breaking feedback into sentences
import nltk
count = 0
df_sentences = pd.DataFrame()
for index, row in df.iterrows():
feedback = row['Reviews']
sent_text = nltk.sent_tokenize(feedback) # this gives us a list of sentences
for j in range(0,len(sent_text)):
# print(index, "-", sent_text[j])
df_sentences = df_sentences.append({'ID':row['ID'],'Count':int(count),'Sentence':sent_text[j]}, ignore_index=True)
count = count + 1
print(df_sentences)

Related

Google Kickstart 2014 Round D Sort a scrambled itinerary - Do I need to bring the input in a ready-to-use array format?

Problem:
Once upon a day, Mary bought a one-way ticket from somewhere to somewhere with some flight transfers.
For example: SFO->DFW DFW->JFK JFK->MIA MIA->ORD.
Obviously, transfer flights at a city twice or more doesn't make any sense. So Mary will not do that.
Unfortunately, after she received the tickets, she messed up the tickets and she forgot the order of the ticket.
Help Mary rearrange the tickets to make the tickets in correct order.
Input:
The first line contains the number of test cases T, after which T cases follow.
For each case, it starts with an integer N. There are N flight tickets follow.
Each of the next 2 lines contains the source and destination of a flight ticket.
Output:
For each test case, output one line containing "Case #x: itinerary", where x is the test case number (starting from 1) and the itinerary is a sorted list of flight tickets that represent the actual itinerary.
Each flight segment in the itinerary should be outputted as pair of source-destination airport codes.
Sample Input: Sample Output:
2 Case #1: SFO-DFW
1 Case #2: SFO-DFW DFW-JFK JFK-MIA MIA-ORD
SFO
DFW
4
MIA
ORD
DFW
JFK
SFO
DFW
JFK
MIA
My question:
I am a beginner in the field of competitive programming. My question is how to interpret the given input in this case. How did Googlers program this input? When I write a function with a Python array as its argument, will this argument be in a ready-to-use array format or will I need to deal with the above mentioned T and N numbers in the input and then arrange airport strings in an array format to make it ready to be passed in the function's argument?
I have looked up at the following Google Kickstart's official Python solution to this problem and was confused how they simply pass the ticket_list argument in the function. Don't they need to clear the input from the numbers T and N and then arrange the airport strings into an array, as I have explained above?
Also, I could not understand how could the methods first and second simply appear if no Class has been initialized? But I think this should be another question...
def print_itinerary(ticket_list):
arrival_map = {}
destination_map = {}
for ticket in ticket_list:
arrival_map[ticket.second] += 1
destination_map[ticket.first] += 1
current = FindStart(arrival_map)
while current in destination_map:
next = destination_map[current]
print current + "-" + next
current = next

You need to implement it yourself to read data from standard input and write results to standard output.
Sample code for reading from standard input and writing to standard output can be found in the coding section of the FAQ on the KickStart Web site.
If you write the solution to this problem in python, you can get T and N as follows.
T = int(input())
for t in range(1, T + 1):
N = int(input())
...
Then if you want to get the source and destination of the flight ticket as a list, you can use the same input method to get them in the list.
ticket_list = [[input(), input()] for _ in range(N)]
# [['MIA', 'ORD'], ['DFW', 'JFK'], ['SFO', 'DFW'], ['JFK', 'MIA']]
If you want to use first and second, try a namedtuple.
Pair = namedtuple('Pair', ['first', 'second'])
ticket_list = [Pair(input(), input()) for _ in range(N)]

Checking if words are present in a list and if string integer present

The idea is to find specific words whether or not are present in a list of sentences.
Also, as another output find string integers.
Code to find the presence of words in a list
import pandas as pd
import re
info = ['Crafting a compelling job description is essential to helping you attract the most qualified candidates for your job. With more than 25 million jobs listed on Indeed, a great job description can help your jobs stand out from the rest. Your job descriptions are where you start marketing your company and your job to your future hire.']
df = pd.DataFrame(info,columns=['One'])
df['New_Col'] = df.One.str.contains('jobs', flags = re.IGNORECASE, regex = True, na = False)
save = []
for i,e in enumerate(info):
save.append(e.isdigit())
df['New_Col2'] = save
Output:
info
Out[40]: ['Crafting a compelling job description is essential to helping you attract the most qualified candidates for your job. With more than 25 million jobs listed on Indeed, a great job description can help your jobs stand out from the rest. Your job descriptions are where you start marketing your company and your job to your future hire.']
Output
One New_Col New_Col2
0 Crafting a compelling job description is essen... True False
Summary: ideally it would be nice to automate it in a way that I just feed the regex contains with a list of words that should be looking for. (e.g. ['jobs','employement'] and so on), which can be done with format function and loop it through. However, I'm not a big fan of regex, probably apply the function would make more sense. All in all, any better way of tackling such an issue is benefitial

You can do
code = "One Two Three \n59 results 46"
res = [int(s) for s in code .split() if s.isdigit()]
print(res)
result:
[59, 46]

Pattern matching in columns in python

I have two dataframes df and df1. I want to search pattern in df based on the values given in df1. DataFrames are given below:
import pandas as pd
data={"id":["I983","I873","I526","I721","I536","I327","I626","I213","I625","I524"],
"coltext":[ "I could take my comment back, I would do so in a second. I have addressed my teammates and coaches and while many understand my actions were totall", "We’re just trying to see if he can get on the field as a football player, and then we’ll make decision",
"TextNow offers low-cost, international calling to over 230 countries. Stay connected longer with rates starting at less than",
"Wi-Fi can provide you with added coverage in places where cell networks don't always work - like basements and apartments. No roaming fees for Wi-Fi connection",
"Send messages and make calls on your compute",
"even have a free, Wi-Fi only version of TextNow, available for download on you",
"the rest of the players accepted apologies this spring and are welcoming him back",
"was really looking at him and watching how much this really means to him and how much he really missed us",
"I’ll deal with the problem and I’ll remedy the problem",
"The first step was for him to be able to complete what we call our bottom line program which has been completed"]}
df=pd.DataFrame(data=data)
data1={"col1":["addressed teammates coaches","football player decision","watching really missed", "bottom line program","meassges make calls"],
"col2":["international calling over","download on you","rest players accepted","deal problem remedy","understand actions totall"],
"col3":["first step him","Wi-Fi only version","cell network works","accepted apologies","stay connected longer"]}
df1=pd.DataFrame(data=data1)
For example first element "addressed teammates coaches" from df1['col1'] lies in first element in df['coltext'] and likewise I want to search every element from every column in df1 in df['coltext']. If pattern is found, then create third col in df.
Desired Output:
id coltext patternMatch
I983 I could take my comment back, col1, col2
I873 We’re just trying to see if he can col1
I526 TextNow offers low-cost, col3, col2
I721 Wi-Fi can provide you with col3
I536 Send messages and make calls col1

There may be other efficient ways, one way may be as following:
# create dictionary of data1 such that values and keys are reversed
my_dict = {item:k for k, v in data1.items() for item in v}
# for column in df check if all words are in 'coltext' for each key in dictionary
df['patternMatch'] = df['coltext'].apply(lambda row:
{v for k, v in my_dict.items()
if all(word in row for word in k.split())})

How to speed up the sum of presence of keys in the series of documents? - Pandas, nltk

I have a dataframe column with documents like
38909 Hotel is an old style Red Roof and has not bee...
38913 I will never ever stay at this Hotel again. I ...
38914 After being on a bus for -- hours and finally ...
38918 We were excited about our stay at the Blu Aqua...
38922 This hotel has a great location if you want to...
Name: Description, dtype: object
I have a bag of words like keys = ['Hotel','old','finally'] but the actual length of keys = 44312
Currently Im using
df.apply(lambda x : sum([i in x for i in keys ]))
Which gives the following output based on sample keys
38909 2
38913 2
38914 3
38918 0
38922 1
Name: Description, dtype: int64
When I apply this on actual data for just 100 rows timeit gives
1 loop, best of 3: 5.98 s per loop
and I have 50000 rows. Is there a faster way of doing the same in nltk or pandas.
EDIT :
In case looking for document array
array([ 'Hotel is an old style Red Roof and has not been renovated up to the new standard, but the price was also not up to the level of the newer style Red Roofs. So, in overview it was an OK stay, and a safe',
'I will never ever stay at this Hotel again. I stayed there a few weeks ago, and I had my doubts the second I checked in. The guy that checked me in, I think his name was Julio, and his name tag read F',
"After being on a bus for -- hours and finally arriving at the Hotel Lawerence at - am, I bawled my eyes out when we got to the room. I realize it's suppose to be a boutique hotel but, there was nothin",
"We were excited about our stay at the Blu Aqua. A new hotel in downtown Chicago. It's architecturally stunning and has generally good reviews on TripAdvisor. The look and feel of the place is great, t",
'This hotel has a great location if you want to be right at Times Square and the theaters. It was an easy couple of blocks for us to go to theater, eat, shop, and see all of the activity day and night '], dtype=object)

The following code is not exactly equivalent to your (slow) version, but it demonstrates the idea:
keyset = frozenset(keys)
df.apply(lambda x : len(keyset.intersection(x.split())))
Differences/limitation:
In your version a word is counted even if it is contained as a substring in a word in the document. For example, had your keys contained the word tyl, it would be counted due to occurrence of "style" in your first document.
My solution doesn't account for punctuation in the documents. For example, the word again in the second document comes out of split() with the full stop attached to it. That can be fixed by preprocessing the document (or postprocessing the result of the split()) with a function that removes the punctuation.

It seems you can just use np.char.count -
[np.count_nonzero(np.char.count(i, keys)) for i in arr]
Might be better to feed a boolean array for counting -
[np.count_nonzero(np.char.count(i, keys)!=0) for i in arr]

If need check only if present values of list:
from numpy.core.defchararray import find
v = df['col'].values.astype(str)
a = (find(v[:, None], keys) >= 0).sum(axis=1)
print (a)
[2 1 1 0 0]
Or:
df = pd.concat([df['col'].str.contains(x) for x in keys], axis=1).sum(axis=1)
print (df)
38909 2
38913 1
38914 1
38918 0
38922 0
dtype: int64

Python an open-source list of words by valence or categories for comparison

I tend to take notes quite regularly and since the great tablet revolution I've been taking them electronically. I've been trying to see if I can find any patterns in the way I take notes. So I've put together a small hack to load the notes and filter out proper nouns and fluff to leave a list of key words I employ.
import os
import re
dr = os.listdir('/home/notes')
dr = [i for i in dr if re.search('.*txt$',i)]
ignore = ['A','a','of','the','and','in','at','our','my','you','your','or','to','was','will','because','as','also','is','eg','e.g.','on','for','Not','not']
words = set()
d1 = open('/home/data/en_GB.dic','r')
dic = d1.read().lower()
dic = re.findall('[a-z]{2,}',dic)
sdic = set(dic)
for i in dr:
a = open(os.path.join('/home/notes',i),'r')
atmp = a.read()
atmp = atmp.lower()
atmp = re.findall('[a-z]{3,}',atmp)
atmp = set(atmp)
atmp.intersection_update(sdic)
atmp.difference_update(set(ignore))
words.update(atmp)
a.close()
words = sorted(words)
I now have a list of about 15,000 words I regularly use while taking notes. It would be a little unmanageable to sort by hand and I wondered if there was an open-source library of
positive-negative-neutral or optimistic-pessimistic-indifferent or other form of word list along any meaning scale that I could run the word list through.
In a perfect scenario I would also be able to run it through some kind of thesarus so I could group the words into meaning clusters to get a high level view of what sense terms I've been employing most.
Does anyone know if there are any such lists out there and if so, how would I go about employing them in Python?
Thanks

I found a list of words used for sentiment analysis of Twitter at: http://alexdavies.net/twitter-sentiment-analysis/
It includes example Python code for how to use it.
See also: Sentiment Analysis Dictionaries

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.