import pandas as pd
from nltk.stem import PorterStemmer, WordNetLemmatizer
porter_stemmer = PorterStemmer()
df = pd.read_csv("last1.csv",sep=',',header=0,encoding='utf-8')
df['rev'] = df['reviewContent'].apply(lambda x : filter(None,x.split(" ")))
Dataset
I am trying to stem my dataframe. While tokenizing I am getting this error for
df['rev'] = df['reviewContent'].apply(lambda x : filter(None,x.split(" ")))
AttributeError: 'float' object has no attribute 'split'
While using Stemming I also get the float problem
df['reviewContent'] = df["reviewContent"].apply(lambda x: [stemmer.stem(y) for y in x])
TypeError: 'float' object is not iterable
What can I do?
When tokenising your data, you don't need the apply call. str.split should do just fine. Also, you can split on multiple whitespace, so you don't have to look for empty strings.
df['rev'] = df['reviewContent'].astype(str).str.split()
You can now run your stemmer as before:
df['rev'] = df['rev'].apply(lambda x: [stemmer.stem(y) for y in x])
We can also write it this way
df['rev'] = df['rev'].astype(str).apply(lambda x: stemmer.stem(x))
Related
I have created process_textData function that takes in a pandas DataFrame column of text, then performs the following:
1. Convert text to lower case and remove all punctuation
2. Optionally apply stemming
3. Apply Ngram Tokenisation
4. Returns the tokenised text as a list.
import string
from nltk.stem.snowball import SnowballStemmer
from nltk import everygrams, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
def process_text(data, n=1):
stemmer = SnowballStemmer('english')
data = data.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
data = data.apply(lambda x: [' '.join(ng).lower() for ng in everygrams(word_tokenize(x),n,n)])
data = data.apply(lambda x: [stemmer.stem(word) for word in x])
return data
After that I implement the function into Sklearn CountVectorizer and it give me this error:
AttributeError: 'list' object has no attribute 'lower'.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words=None, ngram_range=(3, 3))
X = cv.fit_transform(process_textData(df.news, n=3))
X.toarray()
What am I doing wrong, Can somebody help with this?
This returns a list of lists:
# ...
data = data.apply(lambda x: [' '.join(ng).lower() for ng in everygrams(word_tokenize(x),n,n)])
data = data.apply(lambda x: [stemmer.stem(word) for word in x])
return data
While fit_transform expects a list of strings. I suggest editing as such:
# ...
data = data.apply(lambda x: ''.join([' '.join(ng).lower() for ng in everygrams(word_tokenize(x),n,n)]))
data = data.apply(lambda x: ''.join([stemmer.stem(word) for word in x]))
return data
I have imported excel file with some data and removed missing values.
df = pd.read_excel (r'file.xlsx', na_values = missing_values)
Im trying to split string values to make them into list for later actions.
df['GENRE'] = df['GENRE'].map(lambda x: x.split(','))
df['ACTORS'] = df['ACTORS'].map(lambda x: x.split(',')[:3])
df['DIRECTOR'] = df['DIRECTOR'].map(lambda x: x.split(','))
But it gives me following error - AttributeError: 'list' object has no attribute 'split'
I've done the same with csv and it worked.. could it be because its excel?
Im sure it's simple but i can't get my head around it.example of my dataframe
Try using str.split, the Pandas way:
df['GENRE'] = df['GENRE'].str.split(',')
df['ACTORS'] = df['ACTORS'].str.split(',').str[:3]
df['DIRECTOR'] = df['DIRECTOR'].str.split(',')
I have a DataFrame df that that has a list in each row, and I want to apply the remove_stops function to each row.
import pandas as pd
from nltk.corpus import stopwords
stop = stopwords.words('english')
def remove_stops(row):
meaningful_words = [w for w in row if not w in stop]
return meaningful_words
df.apply(remove_stops)
When I run the code, I get the following error
meaningful_words = [w for w in row if not w in stop]
TypeError: ("unhashable type: 'list'", 'occurred at index original')
After some research, I understood the error is being caused because lists are not mutable.
print(type(df))
print(type(df.iloc[0, 0]))
<class 'pandas.core.frame.DataFrame'>
<class 'list'>
How can I solve this issue?
After explicitly using the name of the column I wanted to applycode as expected
df['original'].apply(remove_stops)
I was able to run the as intended. Thanks for the prompt replies.
I have over 8 million rows of text where I want to remove all stop words and also lemmatize the text using dask.map_partitions() but get the following error:
AttributeError: 'Series' object has no attribute 'split'
Is there any way to apply the function to the dataset?
Thanks for the help.
import pandas as pd
import dask.dataframe as dd
from spacy.lang.en import stop_words
cachedStopWords = list(stop_words.STOP_WORDS)
def stopwords_lemmatizing(text):
return [word for word in text.split() if word not in cachedStopWords]
text = 'any length of text'
data = [{'content': text}]
df = pd.DataFrame(data, index=[0])
ddf = dd.from_pandas(df, npartitions=1)
ddf['content'] = ddf['content'].map_partitions(stopwords_lemmatizing, meta='f8')
map_partitions, as the name suggests, works on each partition of your overall dask dataframe, which are each pandas dataframes ( http://docs.dask.org/en/latest/dataframe.html#design ). Your function value-by-value for a seriesq, so what you actually wanted was the simple map:
ddf['content'] = ddf['content'].map(stopwords_lemmatizing)
(if you want to provide the meta here, it should be a zero-length Series rather than dataframe, e.g., meta=pd.Series(dtype='O')).
I'm attempting to convert a pipelinedRDD in pyspark to a dataframe. This is the code snippet:
newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), )))
df = newRDD.toDF()
When I run the code though, I receive this error:
'list' object has no attribute 'encode'
I've tried multiple other combinations, such as converting it to a Pandas dataframe using:
newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), )))
df = newRDD.toPandas()
But then I end up receiving this error:
AttributeError: 'PipelinedRDD' object has no attribute 'toPandas'
Any help would be greatly appreciated. Thank you for your time.
rdd.toDF() or rdd.toPandas() is only used for SparkSession.
To fix your code, try below:
spark = SparkSession.builder.getOrCreate()
rdd = spark.sparkContext.textFile()
newRDD = rdd.map(...)
df = newRDD.toDF() or newRDD.toPandas()