I have a dataframe
0 i only need uxy to hit 20 eod to make up for a...
1 oh this isn’t good
2 lads why is my account covered in more red ink...
3 i'm tempted to drop my last 800 into some stup...
4 the sell offs will continue until moral improves.
I have a code
my_text = 'i only need uxy to hit 20 eod to make up for a...
oh this isn’t good'
seq = tokenizer.texts_to_sequences([my_text])
seq = pad_sequences(seq, maxlen=maxlen)
prediction = model.predict(seq)
print('positivity:',prediction)
What i want to do is to calculate the positivity for each sentence in each row. It works fine with one my_text but I don't know how to change it in a way to calculate for each sentence. And I would like to create an extra column that would show the positivity for each row sentence
Appreciate your help
Just create a function with the same exact code that you have with a return statement to return the value, then .apply it on the column you want to calculate the values:
def getPositivity(my_text):
seq = tokenizer.texts_to_sequences([my_text])
seq = pad_sequences(seq, maxlen=maxlen)
prediction = model.predict(seq)
return prediction
df['prediction'] = df['col'].apply(getPositivity)
Code above assumes that the dataframe variable name is df, and the column name for these string values is col
Related
I would like to flag sentences located in a pandas dataframe. As you can see in the example, some of the sentences are split into multiple rows (these are subtitles from an srt file that I would like to translate to a different language eventually, but first I need to put them in a single cell). The end of the sentence is determined by the period at the end. I want to create a column like the column sentence, where I number each sentence (it doesn't have to be a string, it could be a number too)
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.contains('\.')
df
output:
subtitle sentence_number presence_of_period
0 This is an example of subtitle. sentence_1 True
1 I want to group by sentences, which sentence_2 False
2 the end is determined by a period. sentence_2 True
3 row 0 should have sentece_1, rows 1 and 2 sentence_3 False
4 should have sentence_2. and this sentence_3 True
5 last row should have sentence_3. sentence_4 True
How can I create the sentence_number column since it has to read the previous cells on subtitle column? I was thinking of a window function or the shift() but couldn't figure out how to make it work. I added a column to show if the cell has a period, signifying the end of the sentence. Also, if possible, I would like to move the "and this" from row 4 to the beginning of row 5, since it is a new sentence (not sure if this one would require a different question).
Any thoughts?
To fix the sentence number, here's an option for you.
import pandas as pd
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.count('\.')
df['end'] = df.subtitle.str.endswith('.').astype(int)
df['sentence_#'] = 'sentence_' + (1 + df['presence_of_period'].cumsum() - df['end']).astype(str)
#print (df['subtitle'])
#print (df[['sentence_number','presence_of_period','end','sentence_#']])
df.drop(['presence_of_period','end'],axis=1, inplace=True)
print (df[['subtitle','sentence_#']])
The output will be as follows:
subtitle sentence_#
0 This is an example of subtitle. sentence_1
1 I want to group by sentences, which sentence_2
2 the end is determined by a period. sentence_2
3 row 0 should have sentece_1, rows 1 and 2 sentence_3
4 should have sentence_2. sentence_3
5 and this last row should have sentence_3. sentence_4
If you need to move the partial sentence to the next row, I need to understand a bit more details.
What do you want to do if there are more than two sentences in a row. For example, 'This is first sentence. This second. This is'.
What do you want to do in this case. Split the first one to a row, second to another row, and concatenate the third to the next row data?
Once I understand this, we can use the df.explode() to solve it.
I have a list
top = ['GME', 'MVIS', 'TSLA', 'AMC']
and I have a dataset
dt ... text
0 2021-03-19 20:59:49+06 ... I only need TSLA TSLA TSLA TSLA to hit 20 eod to make up for a...
1 2021-03-19 20:59:51+06 ... Oh this isn’t good
2 2021-03-19 20:59:51+06 ... lads why is my account covered in more GME ...
3 2021-03-19 20:59:51+06 ... I'm tempted to drop my last 800 into some TSLA...
So what i want to do is to check if the sentence contains more than 3 words in the row from the list I want to remove this row
Thank you for help
Let's write a function that determines wether there is, in a given sentence, more than 3 words from the list "top" :
def check_words(sentence,top):
words = sentence.split()
count = 0
for word in words :
if word in top :
count+=1
return(count>3)
Then you want to create a column True/False wether the sentence contains over 3 words from the list. Let's use pandas dataframe structure :
dataframe['Contains_3+_words'] = dataframe.apply(lambda r : check_words(r.text,top), axis=1)
Then we keep only the rows without sentences containing 3+ words from the list :
dataframe = dataframe[dataframe['Contains_3+_words']==False]]
Additionally, you can remove the column we created :
dataframe.drop(['Contains_3+_words'], axis=1, inplace=True)
How can I index a list inside a dataframe?
I have this code here that will get data from JSON and insert it into a dataframe
Here's what the JSON looks like
{"text_sentiment": "positive", "text_probability": [0.33917574607174916, 0.26495590980799744, 0.3958683441202534]}
Here's my code.
input_c = pd.DataFrame(columns=['Comments','Result'])
for i in range(input_df.shape[0]):
url = 'http://classify/?text='+str(input_df.iloc[i])
r = requests.get(url)
result = r.json()["text_sentiment"]
proba = r.json()["text_probability"]
input_c = input_c.append({'Comments': input_df.loc[i].to_string(index=False),'Result': result, 'Probability': proba}, ignore_index = True)
st.write(input_c)
Here's what the results look like
result
Comments Result Probability
0 This movie is good in my eyes. neutral [0.26361889609129974, 0.4879752378104797, 0.2484058660982205]
1 This is a bad movie it's not good. negative [0.5210904912792065, 0.22073131008688818, 0.25817819863390534]
2 One of the best performance in this year. positive [0.14644707145500369, 0.3581522311734714, 0.49540069737152503]
3 The best movie i've ever seen. positive [0.1772046003747405, 0.026468108571479156, 0.7963272910537804]
4 The movie is meh. neutral [0.24349393167653663, 0.6820982528652574, 0.07440781545820596]
5 One of the best selling artist in the world. positive [0.07738688706903311, 0.3329095061233371, 0.5897036068076298]
The data in the Probability column is the one I want to index.
For example: If the value in Result is "positive" then I want the proba to index to 2,and If the result is "neutral" index to 1
Like this
Comments Result Probability
0 This movie is good in my eyes. neutral [0.4879752378104797]
1 This is a bad movie it's not good. negative [0.5210904912792065]
2 One of the best performance in this year. positive [0.49540069737152503]
3 The best movie i've ever seen. positive [0.7963272910537804]
4 The movie is meh. neutral [0.6820982528652574]
5 One of the best selling artist in the world. positive [0.5897036068076298]
Are there any ways on how to do it?
In your code, you already decided the Result content, whether it's negative, neutral, or positive, so you need only to store the maximum value of the probability list in the data frame input_c.
This means, change 'Probability': proba to 'Probability': max(proba), so modify:
input_c = input_c.append({'Comments': input_df.loc[i].to_string(index=False),'Result': result, 'Probability': proba}, ignore_index = True)
to
input_c = input_c.append({'Comments': input_df.loc[i].to_string(index=False),'Result': result, 'Probability': max(proba}, ignore_index = True)
then to set the index in input_c to Probability column, use
input_c.set_index('Probability')
I'm trying to put 0 or 1 in the 'winner' column if there are somebody who won in the member list in a year.
There is a dictionary with an award winner.
award_winner = {'2010':['Momo','Dahyum'],'2011':['Nayeon','Sana'],'2012':['Moon','Jihyo']}
And This is the data frame:
df = pd.DataFrame({'member':[['Jeong-yeon','Momo'],['Jay-z','Bieber'],['Kim','Moon']],'year' : ['2010','2011','2012']})
From the data frame, I would like to see if there's any award winner in each year(dataframe's year) based on the dictionary.
For example, let's look at the first row. Momo won in 2010 and Moon won in 2012 so the desired output of the dataframe should be like this:
So this is the code so far:
df['winner'] = 0 #empty column
def winner_classifier():
for i in range(len(df['member'])): #searching if there are any award winner in df
if df['member'][row][i] in award_winner[df['year'][row]]: #I couldn't make row to
return 1
else:
continue
df['winner'] = df['member'].apply(winner_classifier)
or
In here, I can't assign row. I want the code to look up if there's any winner based on the year from dictionary. So the code should go row by row and check but i can't,,
I summarized the problem like this to ask in stack overflow. But there are more than 10,000 rows and I thought it would be possible if use pandas 'apply' to solve this problem.
Already tried double for loop without using pandas and that took too long.
I tried to use groupby() but i was wonderinghow should i use..
like..
df['winner'] = df['year'].groupby().apply(winner_classifier)..?
Could you help me with this?
Thank you :)
Create a df from dictionary so that you can merge it later
winners = pd.DataFrame({
'year' : list(award_winner.keys()),
'winner': list(award_winner.values())})
print (winners)
year winner
0 2010 [Momo, Dahyum]
1 2011 [Nayeon, Sana]
2 2012 [Moon, Jihyo]
Now merge and find the intersection of awards with members
result = df.merge(winners, on="year")
result['result'] = result.apply(
lambda x: len(set(x.member).intersection(x.winner)) != 0, axis=1)
result = result.drop(['winner'], axis=1)
print (result)
member year result
0 [Jeong-yeon, Momo] 2010 True
1 [Jay-z, Bieber] 2011 False
2 [Kim, Moon] 2012 True
You can make use of Python's set() capability here to easily compare two lists of arbitrary length.
I have written this as a row-wise iterator as I wasn't entirely sure what you wanted the result to look like (ie. do you just want a true/false, or do you want to record the "winner" each row?). With 10k rows it shouldn't be a problem to iterate over the dataframe row by row.
for index, row in df.iterrows():
members_who_were_winners = set(row.member) & set(award_winner[row.year])
if len(members_who_were_winners) > 0:
# You could also write the member name to a new column etc
df.at[index, 'winner_this_year'] = True
else:
df.at[index, 'winner_this_year'] = False
I have two pandas dataframes:
Dataframe 1:
ITEM ID TEXT
1 some random words
2 another word
3 blah
4 random words
Dataframe 2:
INDEX INFO
1 random
3 blah
I would like to match the values from the INFO column (of dataframe 2) with the TEXT column of dataframe 1. If there is a match, I would like to see a new column with a "1".
Something like this:
ITEM ID TEXT MATCH
1 some random words 1
2 another word
3 blah 1
4 random words 1
I was able to create a match per value of the INFO column that I'm looking for with this line of code:
dataframe1.loc[dataframe1['TEXT'].str.contains('blah'), 'MATCH'] = '1'
However, in reality, my real dataframe 2 has 5000 rows. So I cannot manually copy paste all of this. But basically I'm looking for something like this:
dataframe1.loc[dataframe1['TEXT'].str.contains('Dataframe2[INFO]'), 'MATCH'] = '1'
I hope someone can help, thanks!
Give this a shot:
Code:
dfA['MATCH'] = dfA['TEXT'].apply(lambda x: min(len([ y for y in dfB['INFO'] if y in x]), 1))
Output:
ITEM ID TEXT MATCH
0 1 some random words 1
1 2 another word 0
2 3 blah 1
3 4 random words 1
It's a 0 if it's not a match, but that's easy enough to weed out.
There may be a better / faster native solution, but it gets the job done by iterating over both the 'TEXT' column and the 'INFO'. Depending on your use case, it may be fast enough.
Looks like .map() in lieu of .apply() would work just as well. Could make a difference in timing, again, based on your use case.
Updated to take into account string contains instead of exact match...
You could get the unique values from the column in the first dataframe convert them to list and then use eval method on the second with Column.str.contains on that list.
unique = df1['TEXT'].unique().tolist()
df2.eval("Match=Text.str.contains('|'.join(#unique))")