Pandas - merge many rows into one - python

with this:
dataset = pd.read_csv('lyrics.csv', delimiter = '\t', quoting = 3)
I print my dataset in this fashion:
lyrics,classification
0 "I should have known better with a girl like you
1 That I would love everything that you do
2 And I do, hey hey hey, and I do
3 Whoa, whoa, I
4 Never realized what I kiss could be
5 This could only happen to me
6 Can't you see, can't you see
7 That when I tell you that I love you, oh
8 You're gonna say you love me too, hoo, hoo, ho...
9 And when I ask you to be mine
10 You're gonna say you love me too
11 So, oh I never realized what I kiss could be
12 Whoa whoa I never realized what I kiss could be
13 You love me too
14 You love me too",0
but what I really need is to have all thats between "" per row. how do I make this conversion in pandas?

Solution that worked for OP (from comments):
Fixing the problem at its source (in read_csv):
#nbeuchat is probably right, just try
dataset = pd.read_csv('lyrics.csv', quoting = 2)
That should give you a dataframe with one row and two columns: lyrics (with embedded line returns in the string) and classification (0).
General solution for collapsing series of strings:
You want to use pd.Series.str.cat:
import pandas as pd
dataset = pd.DataFrame({'lyrics':pd.Series(['happy birthday to you',
'happy birthday to you',
'happy birthday dear outkast',
'happy birthday to you'])})
dataset['lyrics'].str.cat(sep=' / ')
# 'happy birthday to you / happy birthday to you / happy birthday dear outkast / happy birthday to you'
The default sep is None, which would give you 'happy birthday to youhappy birthday to youhappy ...' so pick the sep value that works for you. Above I used slashes (padded with spaces) since that's what you typically see in quotations of songs and poems.
You can also try print(dataset['lyrics'].str.cat(sep='\n')) which maintains the line breaks but stores them all in one string instead of one string per line.

Related

Columnwise Summarize multiple sentences present in a list using the gensim summarizer

I am having a data-set consisting of faculty id and the feedback of students regarding the respective faculty. There are multiple comments for each faculty and therefore the comments regarding each faculty are present in the form of a list. I want to apply gensim summarization on the "comments" column of the data-set to generate the summary of faculty performance according to the student feedback.
Just for a trial I tried to summarize the feedbacks corresponding to the first faculty id. There are 8 distinct comments (sentences) in that particular feedback, still gensim throws an error ValueError: input must have more than one sentence.
df_test.head()
csf_id comments
0 9 [' good subject knowledge.', ' he has good kn...
1 10 [' good knowledge of subject. ', ' good subjec...
2 11 [' good at clearing the concepts interactive w...
3 12 [' clears concepts very nicely interactive wit...
4 13 [' good teaching ability.', ' subject knowledg...
from gensim.summarization import summarize
text = df_test["comments"][0]
print("Text")
print(text)
print("Summary")
print(summarize(text))
ValueError: input must have more than one sentence
what changes shold i make so that the summarizer reads all the sentenses and summarizes them.
for gensim summarization, newline and full stop will divide the sentence.
from gensim.summarization.summarizer import summarize
summarize("punctual in time.")
this will throw Same error ValueError: input must have more than one sentence
now when there is something after full stop it will interpret it as more than one sentence
summarize("punctual in time. good subject knowledge")
#o/p will be blank string since the text is very small, and now you won't receive any error
''
Now coming to ur problem, you need to join all the element into one string
#example
import pandas as pd
df = pd.DataFrame([[["good subject."," punctual in time.","discipline person."]]], columns = ['comment'])
print(df)
comment
0 [good subject., punctual in time, discipline ...
df['comment'] = df['comment'].apply(''.join)
df['comment'].apply(summarize) #this will work for you but keep in mind you have long text to generate summary
got the solution, Actually Pandas has inbuilt methods for that to be done. Just follow the below piece of code if some of you face the same problem.
df["comments"] = df["comments"].str.replace(",","").astype(str)
df["comments"] = df["comments"].str.replace("[","").astype(str)
df["comments"] = df["comments"].str.replace("]","").astype(str)
df["comments"] = df["comments"].str.replace("'","").astype(str)
Doing this will remove all the square brackets and commas from the list and the feedback will be treated as one single string. Then you can summarize the text present in the rows of a dataframe using:
from gensim.summarization import summarize
summary = summarize(df["comment[i]"])
print(summary)

Replace only first character from a column in a dataframe

I am trying to replace certain words that occur at the very first of the statement in each row in the dataframe. However, passing in '1' position is replacing everything. Why is passing '1' in replace not working? Is there are different way to this?
Thanks!
Initial:
df_test = pd.read_excel('sample.xlsx')
print('Initial: \n',df_test)
Initial:
some_text
0 ur goal is to finish shopping for books today
1 Our goal is to finish shopping for books today
2 The help is on the way
3 he way is clear … he is going to library
Tried:
df_test['some_text'] = df_test['some_text'] \
.str.replace('ur ','Our ',1) \
.str.replace('he ','The ',1)
print('Tried:\n',df_test)
Tried: (Incorrect Results)
some_text
0 Our goal is to finish shopping for books today
1 OOur goal is to finish shopping for books today
2 TThe help is on the way
3 The way is clear … he is going to library
Final output needed:
some_text
0 Our goal is to finish shopping for books today
1 Our goal is to finish shopping for books today
2 The help is on the way
3 The way is clear … he is going to library
Not sure why the other answer got deleted, it was much more concise and did the job. (Sorry, I don't remember who posted it. I tried the answer and it worked but had certain limitations)
df.some_text.str.replace('^ur','Our ').str.replace('^he','The ')
However, as pointed out in the comments, this would replace all the starting characters starting with 'ur' ('ursula') or 'he' ('helen').
The corrected code is:
df.some_text.str.replace('^ur\s','Our ').str.replace('^he\s','The ')
the '^' indicates start of line & should only replace the incomplete words at the beginning of line. The '\s' indicates a space after the first word so it only matches the correct word.
Programming languages, including Python, don't read like human beings. You need to tell Python to split by whitespace. For example, via str.split:
df = pd.DataFrame({'some_text': ['ur goal is to finish shopping for books today',
'Our goal is to finish shopping for books today',
'The help is on the way',
'he way is clear … he is going to library']})
d = {'ur': 'Our', 'he': 'The'}
df['result'] = [' '.join((d.get(i, i), j)) for i, j in df['some_text'].str.split(n=1)]
print(df)
some_text \
0 ur goal is to finish shopping for books today
1 Our goal is to finish shopping for books today
2 The help is on the way
3 he way is clear … he is going to library
result
0 Our goal is to finish shopping for books today
1 Our goal is to finish shopping for books today
2 The help is on the way
3 The way is clear … he is going to library

How to retrieve the whole sentence around a selected word?

I would like to find a selected word and take everything from the first period(.) before it and up until the first period(.) after it.
example:
inside a file call 'text.php'
'The price of blueberries has gone way up. In the year 2038 blueberries have
almost tripled in price from what they were ten years ago. Economists have
said that berries may going up 300% what they are worth today.'
Code example: (I know that if i use a code like this i can find +5 before the word ['that'] and +5 after the word, but i would like to find everything between the period before and after a word.)
import re
text = 'The price of blueberries has gone way up, that might cause trouble for farmers.
In the year 2038 blueberries have almost tripled in price from what they were ten years
ago. Economists have said that berries may going up 300% what they are worth today.'
find =
re.search(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,5}that(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,5}", text)
done = find.group()
print(done)
return:
'blueberries has gone way up, that might cause trouble for farmers'
I would like it to return every sentence with ['that'] in it.
Example return(what i'm looking to get):
'The price of blueberries has gone way up, that might cause trouble for farmers',
'Economists have said that berries may going up 300% what they are worth today'
I would do it like this:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
for sentence in text.split('.'):
if 'that' in sentence:
print(sentence.strip())
.strip() is there simply to trim extra spaces because I'm splitting on ..
If you do want to use the re module, I would be using something like this:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
results = re.findall(r"[^.]+that[^.]+", text)
results = map(lambda x: x.strip(), results)
print(results)
To get the same results.
Things to keep in mind:
If you have words like thatcher in the sentence, the sentence will be printed too. In the first solution, you could use if 'that' in sentence.split(): instead so as to split the string into words, and in the second solution, you could use re.findall(r"[^.]+\bthat\b[^.]+", text) (note the \b tokens; these represent word boundaries).
The script relies on period (.) to limit the sentences. If the sentence themselves contain words that use periods, then the results might not be the expected results (e.g. for the sentence Dr. Tom is sick yet again today, so I'm substituting for him., the script will find Dr as one sentence and Tom is sick yet again today, so I'm substituting for him. as another sentence)
EDIT: To answer your question in the comments, I would make the following changes:
Solution 1:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
sentences = text.split('.')
for i, sentence in enumerate(sentences):
if 'almost' in sentence:
before = '' if i == 0 else sentences[i-1].strip()
middle = sentence.strip()
after = '' if i == len(sentences)-1 else sentences[i+1].strip()
print(". ".join([before, middle, after]))
Solution 2:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
results = re.findall(r"(?:[^.]+\. )?[^.]+almost[^.]+(?:[^.]+\. )?", text)
results = map(lambda x: x.strip(), results)
print(results)
Note that these can potentially give overlapping results. E.g. if the text is a. b. b. c., and you are trying to find sentences containing b, you will get a. b. b and b. b. c.
This function should do the job:
old_text = 'test 1: test friendly, test 2: not friendly, test 3: test friendly, test 4: not friendly, test 5: not friendly'
replace_dict={'test 1':'tested 1','not':'very'}
The function:
def replace_me(text,replace_dict):
for key in replace_dict.keys():
text=text.replace(str(key),str(replace_dict[key]))
return text
result:
print(replace_me(old_text,replace_dict))
Out: 'tested 1: test friendly, test 2: very friendly, test 3: test friendly, test 4: very friendly, test 5: very friendly'

word frequencies in text file in python

I want to find frequencies for the certain words in wanted, and while it finds me the frequecies, the displayed result contains lots of unnecessary data.
Code:
from collections import Counter
import re
wanted = "whereby also thus"
cnt = Counter()
words = re.findall('\w+', open('C:/Users/user/desktop/text.txt').read().lower())
for word in words:
if word in wanted:
cnt[word] += 1
print (cnt)
Results:
Counter({'e': 131, 'a': 119, 'by': 38, 'where': 16, 's': 14, 'also': 13, 'he': 4, 'whereby': 2, 'al': 2, 'b': 2, 'o': 1, 't': 1})
Questions:
How do i omit all those 'e', 'a' 'by', 'where', etc.?
If I then wanted to sum up the frequencies of words (also, thus, whereby) and divide them by total number of words in text, would that be possible?
disclaimer: this is not school assignment. i jut got lots of free time at work now and since i spend a lot of time with reading texts i decided to do this little project of mine to remind myself a bit of what i've been taught couple years ago.
Thanks in advance for any help.
As others have pointed out, you need to change your string wanted to a list. I just hardcoded a list, but you could do use str.split(" ") if you were passed a string in a function. I also implemented you the frequency counter. Just as a note, make sure you close your files; it's also easier (and recommended) that you use the open directive.
from collections import Counter
import re
wanted = ["whereby", "also", "thus"]
cnt = Counter()
with open('C:/Users/user/desktop/text.txt', 'r') as fp:
fp_contents = fp.read().lower()
words = re.findall('\w+', fp_contents)
for word in words:
if word in wanted:
cnt[word] += 1
print (cnt)
total_cnt = sum(cnt.values())
print(float(total_cnt)/len(cnt))
Reading from the web
I made this little mod of the code of Axel to read from a txt on the web, Alice in wonderland, to apply the code (as I don't have your txt file and I wanted to try it). So, I publish it here in case someone should need something like this.
from collections import Counter
import re
from urllib.request import urlopen
testo = str(urlopen("https://www.gutenberg.org/files/11/11.txt").read())
wanted = ["whereby", "also", "thus", "Alice", "down", "up", "cup"]
cnt = Counter()
words = re.findall('\w+', testo)
for word in words:
if word in wanted:
cnt[word] += 1
print(cnt)
total_cnt = sum(cnt.values())
print(float(total_cnt) / len(cnt))
output
Counter({'Alice': 334, 'up': 97, 'down': 90, 'also': 4, 'cup': 2})
105.4
>>>
How many times the same word is found in adjacent sentences
This answer to the request (from the author of the question) of looking for how many times a word is found in adjacent sentences. If in a sentence there are more same words (ex.: 'had') and in the next there is another equal, I counted that for 1 ripetition. That is why I used the wordfound list.
from collections import Counter
import re
testo = """There was nothing so VERY remarkable in that; nor did Alice think it so? Thanks VERY much. Out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed. Quite natural); but when the Rabbit actually TOOK A WATCH OUT OF ITS? WAISTCOAT-POCKET, and looked at it, and then hurried on.
Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit. with either a waistcoat-pocket, or a watch to take out of it! and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop? Down a large rabbit-hole under the hedge.
Alice opened the door and found that it led into a small passage, not much larger than a rat-hole: she knelt down and looked along the passage into the loveliest garden you ever saw. How she longed to get out of that dark hall, and wander about among those beds of bright flowers and those cool fountains, but she could not even get her head through the doorway; 'and even if my head would go through,' thought poor Alice, 'it would be of very little use without my shoulders. Oh, how I wish I could shut up like a telescope! I think I could, if I only knew how to begin.'For, you see, so many out-of-the-way things had happened lately, that Alice had begun to think that very few things indeed were really impossible. There seemed to be no use in waiting by the little door, so she went back to the table, half hoping she might find another key on it, or at any rate a book of rules for shutting people up like telescopes: this time she found a little bottle on it, ('which certainly was not here before,' said Alice,) and round the neck of the bottle was a paper label, with the words 'DRINK ME' beautifully printed on it in large letters. It was all very well to say 'Drink me,' but the wise little Alice was not going to do THAT in a hurry. 'No, I'll look first,' she said, 'and see whether it's marked "poison" or not'; for she had read several nice little histories about children who had got burnt, and eaten up by wild beasts and other unpleasant things, all because they WOULD not remember the simple rules their friends had taught them: such as, that a red-hot poker will burn you if you hold it too long; and that if you cut your finger VERY deeply with a knife, it usually bleeds; and she had never forgotten that, if you drink much from a bottle marked 'poison,' it is almost certain to disagree with you, sooner or later. However, this bottle was NOT marked 'poison,' so Alice ventured to taste it, and finding it very nice, (it had, in fact, a sort of mixed flavour of cherry-tart, custard, pine-apple, roast turkey, toffee, and hot buttered toast,) she very soon finished it off. """
frasi = re.findall("[A-Z].*?[\.!?]", testo, re.MULTILINE | re.DOTALL)
print("How many times this words are repeated in adjacent sentences:")
cnt2 = Counter()
for n, s in enumerate(frasi):
words = re.findall("\w+", s)
wordfound = []
for word in words:
try:
if word in frasi[n + 1]:
wordfound.append(word)
if wordfound.count(word) < 2:
cnt2[word] += 1
except IndexError:
pass
for k, v in cnt2.items():
print(k, v)
output
How many times this words are repeated in adjacent sentences:
had 1
hole 1
or 1
as 1
little 2
that 1
hot 1
large 1
it 5
to 5
a 6
not 3
and 2
s 1
me 1
bottle 1
is 1
no 1
the 6
how 1
Oh 1
she 2
at 1
marked 1
think 1
VERY 1
I 2
door 1
red 1
of 1
dear 1
see 1
could 2
in 2
so 1
was 1
poison 1
A 1
Alice 3
all 1
nice 1
rabbit 1

Python string,re.match,loop

OK guys I got like 4 example:
I love #hacker,
I just scored 27 points in the Picking Cards challenge on #Hacker,
I just signed up for summer cup #hacker,
interesting talk by hari, co-founder of hacker,
I need to find how many times the word "hacker" repeats.
import re
count = 0
res = re.match("hacker")
for res in example:
count += 1
return count
Here is my code "so far" since I don't know how should I figure out the solution for this exercise
you can use re.findall:
my_string = """I love #hacker, I just scored 27 points in the Picking Cards challenge on #Hacker, I just signed up for summer cup #hacker, interesting talk by hari, co-founder of hacker,"""
>>> import re
>>> len(re.findall("hacker",my_string.lower()))
4
re.findall give you all matched substring in the string, and then len will give you how many of them are.
str.lower() is used to convert string to lowercase
instead of str.lower you can also use re.IGNORECASE FLAG:
>>> len(re.findall("hacker",my_string,re.IGNORECASE))
4
this:
the_string = """I love #hacker, I just scored 27 points in the Picking Cards challenge on #Hacker, I just signed up for summer cup #hacker, interesting talk by hari, co-founder of hacker,"""
num = the_string.lower().count("hacker")
string1="hello Hacker what are you doing hacker"
a=re.findall("hacker",string1.lower())
print (len(a))
Output:
>>>
2
>>>
re.findall will find all of the strings that you write.
Edit: I added the string1.lower() too as mentioned by Rawing.
Your codes are not working because match() find the first match
only. Not all of them.
You can just use count() function , after split your string so you dont need regex , if you want to match upper cases too you need to use lower function :
>>> l='this is a test and not a Test'
>>> map(lambda x: x.lower() ,l.split()).count('test')
2
>>> l='this is a test and not a rtest'
>>> map(lambda x: x.lower() ,l.split()).count('test')
1

Categories

Resources