I am doing some natural language processing on some twitter data. So I managed to successfully load and clean up some tweets and placed it into a data frame below.
id text
1104159474368024599 repmiketurner the only time that michael cohen told the truth is when he pled that he is guilty also when he said no collusion and i did not tell him to lie
1104155456019357703 rt msnbc president trump and first lady melania trump view memorial crosses for the 23 people killed in the alabama tornadoes t
The problem is that I am trying to construct a term frequency matrix where each row is a tweet and each column is the value that said word occurs in for a particular row. My only problem is that other post mentioning term frequency distribution text files. Here is the code I used to generate the data frame above
import nltk.classify
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
df_tweetText = df_tweet
#Makes a dataframe of just the text and ID to make it easier to tokenize
df_tweetText = pd.DataFrame(df_tweetText['text'].str.replace(r'[^\w\s]+', '').str.lower())
#Removing Stop words
#nltk.download('stopwords')
stop = stopwords.words('english')
#df_tweetText['text'] = df_tweetText.apply(lambda x: [item for item in x if item not in stop])
#Remove the https linkes
df_tweetText['text'] = df_tweetText['text'].replace("[https]+[a-zA-Z0-9]{14}",'',regex=True, inplace=False)
#Tokenize the words
df_tweetText
At first I tried to use the function word_dist = nltk.FreqDist(df_tweetText['text']) but It would end up counting the value of the entire sentence instead of each word in the row.
Another thing I had tried was to tokenize each word using df_tweetText['text'] = df_tweetText['text'].apply(word_tokenize) then call FeqDist again but that gives me an error saying unhashable type: 'list'.
1104159474368024599 [repmiketurner, the, only, time, that, michael, cohen, told, the, truth, is, when, he, pled, that, he, is, guilty, also, when, he, said, no, collusion, and, i, did, not, tell, him, to, lie]
1104155456019357703 [rt, msnbc, president, trump, and, first, lady, melania, trump, view, memorial, crosses, for, the, 23, people, killed, in, the, alabama, tornadoes, t]
Is there some alternative way for trying to construct this term frequency matrix? Ideally, I want my data to look something like this
id |collusion | president |
------------------------------------------
1104159474368024599 | 1 | 0 |
1104155456019357703 | 0 | 2 |
EDIT 1: So I decided to take a look at the textmining library and recreated one of their examples. The only problem is that It creates the Term Document Matrix with one row of every single tweet.
import textmining
#Creates Term Matrix
tweetDocumentmatrix = textmining.TermDocumentMatrix()
for column in df_tweetText:
tweetDocumentmatrix.add_doc(df_tweetText['text'].to_string(index=False))
# print(df_tweetText['text'].to_string(index=False))
for row in tweetDocumentmatrix.rows(cutoff=1):
print(row)
EDIT2: So I tried SKlearn but that sortof worked but the problem is that I'm finding chinese/japanese characters in my columns which does should not exist. Also my columns are showing up as numbers for some reason
from sklearn.feature_extraction.text import CountVectorizer
corpus = df_tweetText['text'].tolist()
vec = CountVectorizer()
X = vec.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)
00 007cigarjoe 08 10 100 1000 10000 100000 1000000 10000000 \
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
Probably not optimal by iterating over each row, but works. Milage may vary based on how long tweets are and how many tweets are being processed.
import pandas as pd
from collections import Counter
# example df
df = pd.DataFrame()
df['tweets'] = [['test','xd'],['hehe','xd'],['sam','xd','xd']]
# result dataframe
df2 = pd.DataFrame()
for i, row in df.iterrows():
df2 = df2.append(pd.DataFrame.from_dict(Counter(row.tweets), orient='index').transpose())
Related
In my dataframe highlighting product sales on the internet, I have a column that contains the description of each product sold.
I would like to create an algorithm to check if the combination and or redundancy of words has a correlation with the number of sales.
But I would like to be able to filter out words that are too redundant like the product type. For example, my dataframe deals with the sale of wines, so the algorithm must not take into account the word "wine" in the description.
In my df I have 700 rows consisting of 4 columns:
product_id: id for each product
product_price: product price
total_sales: total number of product sales
product_description: product description (e.g.: "Fruity wine, perfect as a starter"; "Dry and full-bodied wine"; "Fresh and perfect wine as a starter"; "Wine combining strength and character"; "Wine with a ruby color, full-bodied "; etc...)
Edit:
I added:
the column 'CA': the total sales by product * the product's price
an example of my df
My DataFrame example:
import pandas as pd
data = {'Product_id': [1, 2, 3, 4, 5],
'Price': [24, 13.5, 12.9, 34, 26],
'Total_sales': [28, 34, 29, 42, 10],
'CA': [672, 459, 374.1, 1428, 260],
'Product_description': ["Fruity wine, perfect as a starter",
"Dry and full-bodied wine",
"Fresh and perfect wine as a starter",
"Wine combining strength and character",
"Wine with a ruby color, full-bodied "]}
df = pd.DataFrame(data)
df
Edit 2:
to find out if a correlation between certain words (and/or combinations of words) would have an impact on the number of sales. I thought that for that I could create a heatmap putting in ordered the number of different values for my column ["total_sales"] and in abscissa a list of the most used words in the column ["Product_description"]. I thought that an ANOVA could help me verify a correlation between these two variables, or a Chi-square...
My action process:
find the number of unique values for my column ["total_sales"], I have 43 different ones.
create a list stopwords=[list of redundant words (ex: 'the', 'the', 'by', etc...)]
split the words of all my lines for the column ["description"]
wordslist = df["description"].str.split()
I can't filter the results of my wordlist variable with stopwords
comp = re.compile('|'.join(stopwords))
z = [re.sub(comp, '', i).strip() for i in words_split]
print(z)
I get
TypeError: expected string or bytes-like object
After that I intend to get the frequency of each word in the column df["description"]
The words that will have a significant frequency should appear on the abscissa of my heatmap with the ordered numbers of sales
Is this a good method (provided I find the solution to my error) to check if the use of a word/a combination of words has an impact on the sale of a product?
Could you please give me some hints?
Edit 3:
Thank to #maaniB for the great help, thanks to that I took a big step towards the final solution but I still have a little way to go, here is where I am:
I'm french so for the cleaning method with stop_words i replaced nltk by spacy
import re
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop
# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
"\\/",
"\\[",
"\\]",
"\\:",
"\\|",
'\\"',
"\\?",
"\\<",
"\\>",
"\\,",
"\\(",
"\\)",
"\\\\",
"\\.",
"\\+",
"\\-",
"\\!",
"\\$",
"\\`",
"\\،",
"\\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
axis=1
)
# replace stop words with 2 list
stop_words = list(fr_stop) + list(en_stop)
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)
To extract the feature i tried with CountVectorizer and TfidfVectorizer (i confused it with TfidfTransformer) and i find better result with TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# change the ngram_range to make combinations of words
tfidf_vector = TfidfVectorizer(stop_words=stop_words,
ngram_range=(1, 4),
encoding="utf-8")
tpl_cntvec = tfidf_vector.fit_transform(df_produits_en_ligne['post_excerpt'])
df_cntvec = pd.DataFrame(tpl_cntvec.toarray(),
columns=tfidf_vector.get_feature_names(),
index=df_produits_en_ligne.index)
df_total_bow = pd.concat([df_produits_en_ligne['total_sales'], df_cntvec],
axis=1)
df_total_bow
And i'm stuck with the last step, i try the good version of #maaniB with the least square
import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())
To run it and have a result in Jupyter notebook i had to changed the --NotebookApp.iopub_data_rate_limit
by the command line
jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
It worked after 3 minutes of process but i'm totaly lost with the result, it returned to me 46987 lines but i don't know how to interpret it.
Here a screenshot of my results.
Could someone explain to me how to interpret it please?
I tried another method but after an hour of process without result
i cancel it:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
# define dataset
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Edit 4:
I tried to make a heatmap in vain with the df_total_bow
import seaborn as sns
tx = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
ty = df_total_bow['total_sales'].to_numpy()
n = len(df_produits_en_ligne)
indep = tx.dot(ty) / n
c = df_total_bow.fillna(0)
measure = (c - indep)**2 / indep
xi_n = measure.sum().sum()
table = measure / xi_n
sns.heatmap(table.iloc[:-1, :-1], annot=c.iloc[:-1, :-1])
plt.show()
But i get
ValueError: shapes (714,46987) and (714,) not aligned: 46987 (dim 1) != 714 (dim 0)
Your question is a combination of text mining tasks, which I try to briefly address here. The first step is, as always in NLP and text mining projects, the cleaning one, including removing stop words, stop characters, etc.:
import re
import pandas as pd
from nltk.corpus import stopwords
# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
"\\/",
"\\[",
"\\]",
"\\:",
"\\|",
'\\"',
"\\?",
"\\<",
"\\>",
"\\,",
"\\(",
"\\)",
"\\\\",
"\\.",
"\\+",
"\\-",
"\\!",
"\\$",
"\\`",
"\\،",
"\\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
axis=1
)
# replace stop words
stop_words = stopwords.words('english')
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)
# Product_id Price Total_sales CA Product_description
# 0 1 24.0 28 672.0 fruity perfect starter
# 1 2 13.5 34 459.0 dry fullbodied
# 2 3 12.9 29 374.1 fresh perfect starter
# 3 4 34.0 42 1428.0 combining strength character
# 4 5 26.0 10 260.0 ruby color fullbodied
Next, you need to extract features (you mentioned count of words, phrases).
from sklearn.feature_extraction.text import CountVectorizer
# change the ngram_range to make combinations of words
count_vector = CountVectorizer(ngram_range=(1, 4), encoding="utf-8")
tpl_cntvec = count_vector.fit_transform(df['Product_description'])
df_cntvec = pd.DataFrame(
tpl_cntvec.toarray(), columns=count_vector.get_feature_names(), index=df.index
)
df_total_bow = pd.concat([df['Total_sales'], df_cntvec], axis = 1)
df_total_bow
# Total_sales character color color fullbodied combining ... ruby # color ruby color fullbodied starter strength strength character
# 0 28 0 0 0 0 ... # 0 0 1 0 0
# 1 34 0 0 0 0 ... # 0 0 0 0 0
# 2 29 0 0 0 0 ... # 0 0 1 0 0
# 3 42 1 0 0 1 ... # 0 0 0 1 1
# 4 10 0 1 1 0 ... # 1 1 0 0 0
Finally, you can make your models on the data:
import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('Total_sales', 1).columns].to_numpy()
y = df_total_bow['Total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())
Regarding your other questions:
There are various statistical methods to find the importance of words in a text, and their correlation with some other variables. CountVectorizer is just a simple method of feature_extraction. There are better methods like TfidfTransformer.
The type of statistical tests or models depends on the problem. Since you just need to find out the correlation of word combinations with sales statistics, simple regression-based methods with feature extraction are helpful. To rank features (find the word combinations with highest correlations and significance), the recursive feature elimination (sklearn.feature_selection.RFE) might be practical.
I have a csv with a column of article titles from which I've used SpaCy to extract any people's names that appear in the titles. When trying to add a new column to the csv with the names extracted by SpaCy, they do not align with the rows from which they were extracted.
I believe this is because the SpaCy results have their own index which is independent of the original data's index.
I've tried adding , index=df.index) to the new column line but I get "ValueError: Length of passed values is 2, index implies 10."
How do I align the SpaCy output to the rows from which they originated?
Here's my code:
import pandas as pd
from pandas import DataFrame
df = (pd.read_csv(r"C:\Users\Admin\Downloads\itsnicethat (5).csv", nrows=10,
usecols=['article_title']))
article = [_ for _ in df['article_title']]
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(str(article))
ents = list(doc.ents)
people = []
for ent in ents:
if ent.label_ == "PERSON":
people.append(ent)
import numpy as np
df['artist_names'] = pd.Series(people)
print(df.head())
This is the resulting dataframe:
article_title artist_names
0 “They’re like, is that? Oh it’s!” – ... (Hannah, Ward)
1 Billed as London’s biggest public festival of ... (Dylan, Mulvaney)
2 Transport yourself back to the dusky skies and... NaN
3 Turning to art at the beginning of quarantine ... NaN
4 Dylan Mulvaney, head of design at Gretel, expl... NaN
This is what I'm expecting:
article_title artist_names
0 “They’re like, is that? Oh it’s!” – ... (Hannah, Ward)
1 Billed as London’s biggest public festival of ... NaN
2 Transport yourself back to the dusky skies and... NaN
3 Turning to art at the beginning of quarantine ... NaN
4 Dylan Mulvaney, head of design at Gretel, expl... (Dylan, Mulvaney)
You can see the 5th value in artist_names column is related to the 5th article title. How can I get them to align?
Thank you for your help.
I would iterate through the articles, detect entities from each article separately, and put the detected entities in a list with one element per article:
nlp = spacy.load('en_core_web_lg')
article = [_ for _ in df['article_title']]
entities_by_article = []
for doc in nlp.pipe(article):
people = []
for ent in doc.ents:
if ent.label_ == "PERSON":
people.append(ent)
entities_by_article.append(people)
df['artist_names'] = pd.Series(entities_by_article)
Note: for doc in nlp.pipe(article) is spaCy's more efficient way of looping through a list of texts and could be replaced by:
for a in article:
doc = nlp(a)
## rest of code within loop
if ent.label_ == "PERSON":
people.append(ent)
else:
people.append(np.nan) # if ent.label_ is not a PERSON
include an else statement so if label_ is not PERSON it will be consider as NaN.
I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2
I am trying to filter (and consequently change) certain rows in pandas that depend on values in other columns. Say my dataFrame looks like this:
SENT ID WORD POS HEAD
1 1 I PRON 2
1 2 like VERB 0
1 3 incredibly ADV 4
1 4 brown ADJ 5
1 5 sugar NOUN 2
2 1 Here ADV 2
2 2 appears VERB 0
2 3 my PRON 5
2 4 next ADJ 5
2 5 sentence NOUN 0
The structure is such that the 'HEAD' column points at the index of the word on which the row is dependent on. For example, if 'brown' depends on 'sugar' then the head of 'brown' is 4, because the index of 'sugar' is 4.
I need to extract a df of all the rows in which the POS is ADV whose head's POS VERB, so 'Here' will be in the new df but not 'incredibly', (and potentially make changes to their WORD entry).
At the moment I'm doing it with a loop, but I don't think it's the pandas way and it also creates problems further down the road. Here is my current code (the split("-") is from another story - ignore it):
def get_head(df, dependent):
head = dependent
target_index = int(dependent['HEAD'])
if target_index == 0:
return dependent
else:
if target_index < int(dependent['INDEX']):
# 1st int in cell
while (int(head['INDEX'].split("-")[0]) > target_index):
head = data.iloc[int(head.name) - 1]
elif target_index > int(dependent['INDEX']):
while int(head['INDEX'].split("-")[0]) < target_index:
head = data.iloc[int(head.name) + 1]
return head
A difficulty I had when I wrote this function is that I didn't (at the time) have a column 'SENTENCE' so I had to manually find the nearest head. I hope that adding the SENTENCE column should make things somewhat easier, though it is important to note that as there are hundreds of such sentences in the df, simply searching for an index '5' won't do, since there are hundreds of rows where df['INDEX']=='5'.
Here is an example of how I use get_head():
def change_dependent(extract_col, extract_value, new_dependent_pos, head_pos):
name = 0
sub_df = df[df[extract_col] == extract_value] #this is another condition on the df.
for i, v in sub_df.iterrows():
if (get_head(df, v)['POS'] == head_pos):
df.at[v.name, 'POS'] = new_dependent_pos
return df
change_dependent('POS', 'ADV', 'ADV:VERB', 'VERB')
Can anyone here think of a more elegant/efficient/pandas way in which I can get all the ADV instances whose head is VERB?
import pandas as pd
df = pd.DataFrame([[1,1,'I','NOUN',2],
[1,2,'like','VERB',0],
[1,3,'incredibly','ADV',4],
[1,4,'brown','ADJ',4],
[1,5,'sugar','NOUN',5],
[2,1,'Here','ADV',2],
[2,2,'appears','VERB',0],
[2,3,'my','PRON',5],
[2,4,'next','ADJ',5],
[2,5,'sentance','NOUN',0]]
,columns=['SENT','ID','WORD','POS','HEAD'])
adv=df[df['POS']=='ADV']
temp=df[df['POS']=='VERB'][['SENT','ID','POS']].merge(adv,left_on=['SENT','ID'],right_on=['SENT','HEAD'])
temp['WORD']
I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.
I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes.
My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:
import pandas as pd
import nltk
pd.options.display.max_colwidth = 10000
txt_data = pd.read_csv("data_file.csv",sep="|")
txt = str(txt_data.comment)
len(txt)
Out[7]: 71581
txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[10]: 45
txt_lines = []
f = open("txt_lines_only.txt")
for line in f:
txt_lines.append(line)
txt = str(txt_lines)
len(txt)
Out[14]: 1668813
txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[17]: 10086
Note that in both cases, text was processed in such a way that only the anything but spaces, letters and ,.?! was removed (for simplicity).
As you can see a pandas field converted into a string returns fewer matches and the length of the string is also shorter.
Is there any way to improve the above code?
Also, str(x) creates 1 big string out of the comments while [str(x) for x in txt_data.comment] creates a list object which cannot be broken into a bag of words. What is the best way to produce a nltk.Text object that will retain document indices? In other words I'm looking for a way to create a Term Document Matrix, R's equivalent of TermDocumentMatrix() from tm package.
Many thanks.
The benefit of using a pandas DataFrame would be to apply the nltk functionality to each row like so:
word_file = "/usr/share/dict/words"
words = open(word_file).read().splitlines()[10:50]
random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]
df = pd.DataFrame(random_word_list, columns=['text'])
df.head()
text
0 Aaru Aaronic abandonable abandonedly abaction ...
1 abampere abampere abacus aback abalone abactor...
2 abaisance abalienate abandonedly abaff abacina...
3 Ababdeh abalone abac abaiser abandonable abact...
4 abandonable abandon aba abaiser abaft Abama ab...
len(df)
50
txt = df.text.apply(word_tokenize)
txt.head()
0 [Aaru, Aaronic, abandonable, abandonedly, abac...
1 [abampere, abampere, abacus, aback, abalone, a...
2 [abaisance, abalienate, abandonedly, abaff, ab...
3 [Ababdeh, abalone, abac, abaiser, abandonable,...
4 [abandonable, abandon, aba, abaiser, abaft, Ab...
txt.apply(len)
0 1000
1 1000
2 1000
3 1000
4 1000
....
44 1000
45 1000
46 1000
47 1000
48 1000
49 1000
Name: text, dtype: int64
As a result, you get the .count() for each row entry:
txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
txt.head()
0 27
1 24
2 17
3 25
4 32
You can then sum the result using:
txt.sum()
1239