Applying language detector to every row in pandas - python

I am trying to do what has been asked in this question. The problem I am having is that .apply() does not properly iterate over the rows. I have a dataframe which looks like this:
stuff, body
12, "Je parle francais"
25, "This is english"
I have tried 3 things, running df['body'].apply(lambda row: (detect == "en")) which ended up returning false for all things, regardless of language (due to it outputting <function detect at random_bytes> into ever row). df['body'].apply(detect) and df['body'].apply(lambda row: detect(row)") which ended up returning.
LangDetectException: No features in text.
I cannot really afford running through every single row using a for loop due to the amount of data I have. So how would I find out what rows in the body column, are english and which are not, using the langdetect library.

Try this:
import pandas as pd
from langdetect import detect, LangDetectException
df = pd.read_clipboard(sep=', ') #Create dataframe from clipboard
df.loc[3, :] = [30,''] #Add blank text to dataframe
def f(x):
try:
result = detect(x)
except LangDetectException as e:
result = str(e)
return result
df["lang"] = df["body"].apply(f)
Output:
stuff body lang
0 12.0 "Je parle francais" fr
1 25.0 "This is english" en
3 30.0 No features in text.

Related

When adding SpaCy output to existing dataframe, columns do not align

I have a csv with a column of article titles from which I've used SpaCy to extract any people's names that appear in the titles. When trying to add a new column to the csv with the names extracted by SpaCy, they do not align with the rows from which they were extracted.
I believe this is because the SpaCy results have their own index which is independent of the original data's index.
I've tried adding , index=df.index) to the new column line but I get "ValueError: Length of passed values is 2, index implies 10."
How do I align the SpaCy output to the rows from which they originated?
Here's my code:
import pandas as pd
from pandas import DataFrame
df = (pd.read_csv(r"C:\Users\Admin\Downloads\itsnicethat (5).csv", nrows=10,
usecols=['article_title']))
article = [_ for _ in df['article_title']]
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(str(article))
ents = list(doc.ents)
people = []
for ent in ents:
if ent.label_ == "PERSON":
people.append(ent)
import numpy as np
df['artist_names'] = pd.Series(people)
print(df.head())
This is the resulting dataframe:
article_title artist_names
0 “They’re like, is that? Oh it’s!” – ... (Hannah, Ward)
1 Billed as London’s biggest public festival of ... (Dylan, Mulvaney)
2 Transport yourself back to the dusky skies and... NaN
3 Turning to art at the beginning of quarantine ... NaN
4 Dylan Mulvaney, head of design at Gretel, expl... NaN
This is what I'm expecting:
article_title artist_names
0 “They’re like, is that? Oh it’s!” – ... (Hannah, Ward)
1 Billed as London’s biggest public festival of ... NaN
2 Transport yourself back to the dusky skies and... NaN
3 Turning to art at the beginning of quarantine ... NaN
4 Dylan Mulvaney, head of design at Gretel, expl... (Dylan, Mulvaney)
You can see the 5th value in artist_names column is related to the 5th article title. How can I get them to align?
Thank you for your help.
I would iterate through the articles, detect entities from each article separately, and put the detected entities in a list with one element per article:
nlp = spacy.load('en_core_web_lg')
article = [_ for _ in df['article_title']]
entities_by_article = []
for doc in nlp.pipe(article):
people = []
for ent in doc.ents:
if ent.label_ == "PERSON":
people.append(ent)
entities_by_article.append(people)
df['artist_names'] = pd.Series(entities_by_article)
Note: for doc in nlp.pipe(article) is spaCy's more efficient way of looping through a list of texts and could be replaced by:
for a in article:
doc = nlp(a)
## rest of code within loop
if ent.label_ == "PERSON":
people.append(ent)
else:
people.append(np.nan) # if ent.label_ is not a PERSON
include an else statement so if label_ is not PERSON it will be consider as NaN.

spacy stemming on pandas df column not working

How to apply stemming on Pandas Dataframe column
am using this function for stemming which is working perfect on string
xx='kenichan dived times ball managed save 50 rest'
def make_to_base(x):
x_list = []
doc = nlp(x)
for token in doc:
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
print(" ".join(x_list))
make_to_base(xx)
But when i am applying this function on my pandas dataframe column it is not working neither giving any error
x = list(df['text']) #my df column
x = str(x)#converting into string otherwise it is giving error
make_to_base(x)
i've tried different thing but nothing working. like this
df["texts"] = df.text.apply(lambda x: make_to_base(x))
make_to_base(df['text'])
my dataset looks like this:
df['text'].head()
Out[17]:
0 Hope you are having a good week. Just checking in
1 K..give back my thanks.
2 Am also doing in cbe only. But have to pay.
3 complimentary 4 STAR Ibiza Holiday or £10,000 ...
4 okmail: Dear Dave this is your final notice to...
Name: text, dtype: object
You need to actually return the value you got inside the make_to_base method, use
def make_to_base(x):
x_list = []
for token in nlp(x):
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
return " ".join(x_list)
Then, use
df['texts'] = df['text'].apply(lambda x: make_to_base(x))

Match the Exact substring from string of pandas series object

I am trying to match the exact substring from the string of pandas data frame series but somehow str.contains don't seem to be working here. I saw the documentation and it's saying to apply regex = False which is also not working. Can anyone suggest a solution?
Output:
Creative Name Revised Targeting Type
0 ff~tg~conbhv contextual
1 ff~tg~conbhv contextual
2 ff~tg~con contextual
Expected Output:
Creative Name Revised Targeting Type
0 ff~tg~conbhv contextual + behavioral
1 ff~tg~conbhv contextual + behavioral
2 ff~tg~con contextual
Approach:
import pandas as pd
import numpy as np
column = {'col_name': ['Revised Targeting Type']}
data = {"Creative Name":["ff~pd~q4-smartphones-note10-pdp-iphone7_mk~gb_ch~social_md~h_ad~ss1x1_dt~cross_fm~spost_pb~fcbk_sz~1x1_rt~cpm_tg~conbhv_sa~lo_vv~ia_it~soc_ts~lo-iphone7_ff~ukp q4 smartphones ukc q4 - smartphones - static ukt lo-iphone7 ukcdj buy_ct~fb_cs~1x1_lg~engb_cv~ge_ce~loc_mg~oth_ta~lrn_cw~na",
"ff~tg~conbhv",
"ff~tg~con"], "Revised Targeting Type":["ABC", "NA", "NA"]}
mapping = {"Code": ['con', 'conbhv'], "Actual": ['contextual', 'contextual + behavioral'], "OtherPV": [np.nan, np.nan],
"SheetName": ['tg', 'tg']}
# Creating a dataFrame
dataframe_data = pd.DataFrame(data)
mapping_data = pd.DataFrame(mapping)
column_data = pd.DataFrame(column)
print(dataframe_data)
print(mapping_data)
print(column_data)
# loop through Dataframe column avilable in (column_data) dataframe
for i in column_data.iloc[:,0]:
print(i)
# loop through mapping dataframe (mapping_data)
for k, l, m in zip(mapping_data.iloc[:, 0], mapping_data.iloc[:, 1], mapping_data.iloc[:, 3]):
# mask the dataframe (dataframe_date)
mask_null_revised_new_col = (dataframe_data['{}'.format(i)].isin(['NA']))
#apply dataframe values in main dataframe (dataframe_data)
dataframe_data['{}'.format(i)] = np.select([mask_null_revised_new_col &
dataframe_data['Creative Name'].str.contains('{}~{}'.format(m, k))],
[l], default=dataframe_data['{}'.format(i)])
print(dataframe_data)
Creative Name Revised Targeting Type
0 ff~tg~conbhv contextual
1 ff~tg~conbhv contextual
2 ff~tg~con contextual
To be honest, I'm a little confused by your question, but is this what your looking for?
dataframe_data['Revised Targeting Type'] = np.where(dataframe_data['Creative Name'].str.contains('.*conbhv*', regex = True), 'contextual + behavioral', 'contextual')

Creating a term frequency matrix from a Python Dataframe

I am doing some natural language processing on some twitter data. So I managed to successfully load and clean up some tweets and placed it into a data frame below.
id text
1104159474368024599 repmiketurner the only time that michael cohen told the truth is when he pled that he is guilty also when he said no collusion and i did not tell him to lie
1104155456019357703 rt msnbc president trump and first lady melania trump view memorial crosses for the 23 people killed in the alabama tornadoes t
The problem is that I am trying to construct a term frequency matrix where each row is a tweet and each column is the value that said word occurs in for a particular row. My only problem is that other post mentioning term frequency distribution text files. Here is the code I used to generate the data frame above
import nltk.classify
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
df_tweetText = df_tweet
#Makes a dataframe of just the text and ID to make it easier to tokenize
df_tweetText = pd.DataFrame(df_tweetText['text'].str.replace(r'[^\w\s]+', '').str.lower())
#Removing Stop words
#nltk.download('stopwords')
stop = stopwords.words('english')
#df_tweetText['text'] = df_tweetText.apply(lambda x: [item for item in x if item not in stop])
#Remove the https linkes
df_tweetText['text'] = df_tweetText['text'].replace("[https]+[a-zA-Z0-9]{14}",'',regex=True, inplace=False)
#Tokenize the words
df_tweetText
At first I tried to use the function word_dist = nltk.FreqDist(df_tweetText['text']) but It would end up counting the value of the entire sentence instead of each word in the row.
Another thing I had tried was to tokenize each word using df_tweetText['text'] = df_tweetText['text'].apply(word_tokenize) then call FeqDist again but that gives me an error saying unhashable type: 'list'.
1104159474368024599 [repmiketurner, the, only, time, that, michael, cohen, told, the, truth, is, when, he, pled, that, he, is, guilty, also, when, he, said, no, collusion, and, i, did, not, tell, him, to, lie]
1104155456019357703 [rt, msnbc, president, trump, and, first, lady, melania, trump, view, memorial, crosses, for, the, 23, people, killed, in, the, alabama, tornadoes, t]
Is there some alternative way for trying to construct this term frequency matrix? Ideally, I want my data to look something like this
id |collusion | president |
------------------------------------------
1104159474368024599 | 1 | 0 |
1104155456019357703 | 0 | 2 |
EDIT 1: So I decided to take a look at the textmining library and recreated one of their examples. The only problem is that It creates the Term Document Matrix with one row of every single tweet.
import textmining
#Creates Term Matrix
tweetDocumentmatrix = textmining.TermDocumentMatrix()
for column in df_tweetText:
tweetDocumentmatrix.add_doc(df_tweetText['text'].to_string(index=False))
# print(df_tweetText['text'].to_string(index=False))
for row in tweetDocumentmatrix.rows(cutoff=1):
print(row)
EDIT2: So I tried SKlearn but that sortof worked but the problem is that I'm finding chinese/japanese characters in my columns which does should not exist. Also my columns are showing up as numbers for some reason
from sklearn.feature_extraction.text import CountVectorizer
corpus = df_tweetText['text'].tolist()
vec = CountVectorizer()
X = vec.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)
00 007cigarjoe 08 10 100 1000 10000 100000 1000000 10000000 \
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
Probably not optimal by iterating over each row, but works. Milage may vary based on how long tweets are and how many tweets are being processed.
import pandas as pd
from collections import Counter
# example df
df = pd.DataFrame()
df['tweets'] = [['test','xd'],['hehe','xd'],['sam','xd','xd']]
# result dataframe
df2 = pd.DataFrame()
for i, row in df.iterrows():
df2 = df2.append(pd.DataFrame.from_dict(Counter(row.tweets), orient='index').transpose())

Setting pandas conditions for columns by row, Python 2.7

(I suck at titling these questions...)
So I've gotten 90% of the way through a very laborious learning process with pandas, but I have one thing left to figure out. Let me show an example (actual original is a comma-delimited CSV that has many more rows):
Name Price Rating URL Notes1 Notes2 Notes3
Foo $450 9 a.com/x NaN NaN NaN
Bar $99 5 see over www.b.com Hilarious Nifty
John $551 2 www.c.com Pretty NaN NaN
Jane $999 8 See Over in Notes Funky http://www.d.com Groovy
The URL column can say many different things, but they all include "see over," and do not indicate with consistency which column to the right includes the site.
I would like to do a few things, here: first, move websites from any Notes column to URL; second, collapse all notes columns to one column with a new line between them. So this (NaN's removed because pandas makes me in order to use them in df.loc):
Name Price Rating URL Notes1
Foo $450 9 a.com/x
Bar $99 5 www.b.com Hilarious
Nifty
John $551 2 www.c.com Pretty
Jane $999 8 http://www.d.com Funky
Groovy
I got partway there by doing this:
df['URL'] = df['URL'].fillna('')
df['Notes1'] = df['Notes1'].fillna('')
df['Notes2'] = df['Notes2'].fillna('')
df['Notes3'] = df['Notes3'].fillna('')
to_move = df['URL'].str.lower().str.contains('see over')
df.loc[to_move, 'URL'] = df['Notes1']
What I don't know is how to find the Notes column with either www or .com. If I, for example, try to use my above method as a condition, e.g.:
if df['Notes1'].str.lower().str.contains('www'):
df.loc[to_move, 'URL'] = df['Notes1']
I get back ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() But adding .any() or .all() has the obvious flaw that they don't give me what I'm looking for: with any, e.g., every line that meets the to_move requirement in URL will get whatever's in Notes1. I need the check to occur row by row. For similar reasons, I can't even get started collapsing the Notes columns (and I don't know how to check for non-null empty string cells, either, a problem I created at this point).
Where it stands, I know I also have to move in Notes2 to Notes1, Notes3 to Notes2, and '' to Notes3 when the first condition is satisfied, because I don't want the leftover URLs in the Notes columns. I'm sure pandas has easier routes than what I'm doing, because it's pandas, and when I try to do anything with pandas, I find out that it can be done in one line instead of my 20...
(PS, I don't care if the empty columns Notes2 and Notes3 are left over, b/c I'm not using them in my CSV import in the next step, though I can always learn more than I need)
UPDATE: So I figured out a crummy verbose solution using my non-pandas python logic one step at a time. I came up with this (same first five lines above, minus the df.loc line):
url_in1 = df['Notes1'].str.contains('\.com')
url_in2 = df['Notes2'].str.contains('\.com')
to_move = df['URL'].str.lower().str.contains('see-over')
to_move1 = to_move & url_in1
to_move2 = to_move & url_in2
df.loc[to_move1, 'URL'] = df.loc[url_in1, 'Notes1']
df.loc[url_in1, 'Notes1'] = df['Notes2']
df.loc[url_in1, 'Notes2'] = ''
df.loc[to_move2, 'URL'] = df.loc[url_in2, 'Notes2']
df.loc[url_in2, 'Notes2'] = ''
(Lines moved around and to_move repeated in actual code) I know there has to be a more efficient method... This also doesn't collapse in the Notes columns, but that should be easy using the same method, except that I still don't know a good way to find the empty strings.
I'm still learning pandas, so some parts of this code may be not so elegant, but general idea is - get all notes columns, find all urls in there, combine it with URL column and then concat remaining notes into Notes1 column:
import pandas as pd
import numpy as np
import pandas.core.strings as strings
# Just to get first notnull occurence
def geturl(s):
try:
return next(e for e in s if not pd.isnull(e))
except:
return np.NaN
df = pd.read_csv("d:/temp/data2.txt")
dfnotes = df[[e for e in df.columns if 'Notes' in e]]
# Notes1 Notes2 Notes3
# 0 NaN NaN NaN
# 1 www.b.com Hilarious Nifty
# 2 Pretty NaN NaN
# 3 Funky http://www.d.com Groovy
dfurls = dfnotes.apply(lambda x: x.str.contains('\.com'), axis=1)
dfurls = dfurls.fillna(False).astype(bool)
# Notes1 Notes2 Notes3
# 0 False False False
# 1 True False False
# 2 False False False
# 3 False True False
turl = dfnotes[dfurls].apply(geturl, axis=1)
df['URL'] = np.where(turl.isnull(), df['URL'], turl)
df['Notes1'] = dfnotes[~dfurls].apply(lambda x: strings.str_cat(x[~x.isnull()], sep=' '), axis=1)
del df['Notes2']
del df['Notes3']
df
# Name Price Rating URL Notes1
# 0 Foo $450 9 a.com/x
# 1 Bar $99 5 www.b.com Hilarious Nifty
# 2 John $551 2 www.c.com Pretty
# 3 Jane $999 8 http://www.d.com Funky Groovy

Categories

Resources