I have a dataframe like as shown below
df = pd.DataFrame({'text': ["Hi how are you","I am fine","I love you","I hate you"],
'tokens':[('Hi','how','are','you'),('I','am','fine'),('I','love','you'),('I','hate','you')]})
I would like to get the pos tag of each token
for tok in df['tokens'].iterrows():
print(token, token.pos_)
Please note that the pos_ here mean it is part of speech tag from nlp domain
However, I get an error
Can help me on how can I iterate over each item in the pandas column?
You are getting 'Series' object has no attribute 'iterrows' because you are using the column df['tokens'], which gives a series (1 dimension only), so there's no iterrows method.
Using your code you could do:
import pandas as pd
df = pd.DataFrame({'text': ["Hi how are you","I am fine","I love you","I hate you"],
'tokens':[('Hi','how','are','you'),('I','am','fine'),('I','love','you'),('I','hate','you')]})
for index, values in df.iterrows():
pos = 1
for x in values[1]:
print(pos, x)
pos += 1
Alternatively you could use (similarly to what #AnuragDabas commented):
df['pos tag'] = df['tokens'].apply(lambda x:list(range(len(x)+1))[1:])
All you need is df.iat[2,1][1] ;)
Related
Say I have a string column in pandas in which each row is made of a list of strings
Class
Student
One
[Adam, Kanye, Alice Stocks, Joseph Matthew]
Two
[Justin Bieber, Selena Gomez]
I want to get rid of all the names in each class wherever the length of the string is more than 8 characters.
So the resulting table would be:
Class
Student
One
Adam, Kanye
Most of the data would be gone because only Adam and Kanye satisfy the condition of len(StudentName)<8
I tried coming up with a .applyfilter myself, but it seems that the code is running on each character level instead of word, can someone point out where I went wrong?
This is the code:
[[y for y in x if not len(y)>=8] for x in df['Student']]
Check Below code. Seems like you are not defining what you need to split at, hence things are automatically getting split a char level.
import pandas as pd
df = pd.DataFrame({'Class':['One','Two'],'Student':['[Adam, Kanye, Alice Stocks, Joseph Matthew]', '[Justin Bieber, Selena Gomez]'],
})
df['Filtered_Student'] = df['Student'].str.replace("\[|\]",'').str.split(',').apply(lambda x: ','.join([i for i in x if len(i)<8]))
df[df['Filtered_Student'] != '']
Output:
# If they're not actually lists, but strings:
if isinstance(df.Student[0], str):
df.Student = df.Student.str[1:-1].str.split(', ')
# Apply your filtering logic:
df.Student = df.Student.apply(lambda s: [x for x in s if len(x)<8])
Output:
Class Student
0 One [Adam, Kanye]
1 Two []
IIUC, this van be done in a oneliner np.where:
import pandas as pd
import numpy as np
df = pd.DataFrame( {'Class': ['One', 'Two'], 'Student': [['Adam', 'Kanye', 'Alice Stocks', 'Joseph Matthew'], ['Justin Bieber', 'Selena Gomez']]})
df.explode('Student').iloc[np.where(df.explode('Student').Student.str.len() <= 8)].groupby('Class').agg(list).reset_index()
Output:
Class Student
0 One [Adam, Kanye]
I am a very beginner in programming and trying to learn to code. so please bear with my bad coding. I am using pandas to find a string from a column (Combinations column in the below code ) in the data frame and print the entire row containing the string . Find the code below. Basically I need to find all the instances where the string occurs , and print the entire row .find my code below . I am not able to figure out how to find that particular instance of the column and print it .
import pandas as pd
data = pd.read_csv("signallervalues.csv",index_col=False)
data.head()
data['col1'] = data['col1'].astype(str)
data['col2'] = data['col2'].astype(str)
data['col3'] = data['col3'].astype(str)
data['col4'] = data['col4'].astype(str)
data['col5']= data['col5'].astype(str)
data.head()
combinations= data['Col1']+data['col2'] + data['col3'] + data['col4'] + data['col5']
data['combinations']= combinations
print(data.head())
list_of_combinations = data['combinations'].to_list()
print(list_of_combinations)
for i in list_of_combinations:
if data['combinations'].str.contains(i).any():
print(i+ 'data occurs in row' )
# I need to print the row containing the string here
else:
print(i +'is occuring only once')
my data frame looks like this
import pandas as pd
data=pd.DataFrame()
# recreating your data (more or less)
data['signaller']= pd.Series(['ciao', 'ciao', 'ciao'])
data['col6']= pd.Series(['-1-11-11', '11', '-1-11-11'])
list_of_combinations=['11', '-1-11-11']
data.reset_index(inplace=True)
# group by the values of column 6 and counting how many times they occur
g=data.groupby('col6')['index']
count= pd.DataFrame(g.count())
count=count.rename(columns={'index':'occurences'})
count.reset_index(inplace=True)
# create a df that keeps only the rows in the list 'list_of_combinations'
count[~count['col6'].isin(list_of_combinations)== False]
My result
I have a dataframe say df which has 3 columns. Column A and B are some strings. Column C is a numeric variable.
Dataframe
I want to convert this to a feature matrix by passing it to a CountVectorizer.
I define my countVectorizer as:
cv = CountVectorizer(input='content', encoding='iso-8859-1',
decode_error='ignore', analyzer='word',
ngram_range=(1), tokenizer=my_tokenizer, stop_words='english',
binary=True)
Next I pass the entire dataframe to cv.fit_transform(df) which doesn't work.
I get this error:
cannot unpack non-iterable int object
Next I covert each row of the dataframe to
sample = pdt_items["A"] + "," + pdt_items["C"].astype(str) + "," + pdt_items["B"]
Then I apply
cv_m = sample.apply(lambda row: cv.fit_transform(row))
I still get error:
ValueError: Iterable over raw text documents expected, string object received.
Please let me know where am I going wrong?Or if I need to take some other approach?
Try this:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
A = ['very good day', 'a random thought', 'maybe like this']
B = ['so fast and slow', 'the meaning of this', 'here you go']
C = [1, 2, 3]
pdt_items = pd.DataFrame({'A':A,'B':B,'C':C})
cv = CountVectorizer()
# use pd.DataFrame here to avoid your error and add your column name
sample = pd.DataFrame(pdt_items['A']+','+pdt_items['B']+','+pdt_items['C'].astype('str'), columns=['Output'])
vectorized = cv.fit_transform(sample['Output'])
With the help of #QuantStats's comment, I applied the cv on each row of dataframe as follows:
row_input = df['column_name'].tolist()
kwds = []
for i in range(len(row_input)):
cell_input = [row_input[i]]
full_set = row_keywords(cell_input, 1,1)
candidates = [x for x in full_set if x[1]> 1] # to extract frequencies more than 1
kwds.append(candidates)
kwds_col = pd.Series(kwds)
df['Keywords'] = kwds_col
("row_keywords" is a function for CountVectorizer.)
I can successfully split a sentence into its individual words and take of every average of the polarity score of every word using this code. It works great.
import statistics as s
from textblob import TextBlob
a = TextBlob("""Thanks, I'll have a read!""")
print(a)
c=[]
for i in a.words:
c.append(a.sentiment.polarity)
d = s.mean(c)
d = 0.25
a.words = WordList(['Thanks', 'I', "'ll", 'have', 'a', 'read'])
How do I transfer the above code to a df that looks like this?:
df
text
1 Thanks, I’ll have a read!
but take the average of every polarity per word?
The closet is I can apply polarity to every sentence for every sentence in df:
def sentiment_calc(text):
try:
return TextBlob(text).sentiment.polarity
except:
return None
df_sentences['sentiment'] = df_sentences['text'].apply(sentiment_calc)
I have the impression the sentiment polarity only works on TextBlob type.
So my idea here is to split the text blob into words (with the split function -- see doc here) and convert them to TextBlob objects.
This is done in the list comprehension:
[TextBlob(x).sentiment.polarity for x in a.split()]
So the whole thing looks like this:
import statistics as s
from textblob import TextBlob
import pandas as pd
a = TextBlob("""Thanks, I'll have a read!""")
def compute_mean(a):
return s.mean([TextBlob(x).sentiment.polarity for x in a.split()])
print(compute_mean("Thanks, I'll have a read!"))
df = pd.DataFrame({'text':["Thanks, I'll have a read!",
"Second sentence",
"a bag of apples"]})
df['score'] = df['text'].map(compute_mean)
print(df)
For a project for my lab, I'm analyzing Twitter data. The tweets we've captured all have the word 'sex' in them, that's the keyword we filtered the TwitterStreamer to capture based on.
I converted the CSV where all of the tweet data (json metatags) is housed into a pandas DB and saved the 'text' column to isolate the tweet text.
import pandas as pd
import csv
df = pd.read_csv('tweets_hiv.csv')
saved_column4 = df.text
print saved_column4
Out comes the correct output:
0 Some example tweet text
1 Oh hey look more tweet text #things I hate #stuff
...a bunch more lines
Name: text, Length: 8540, dtype: object
But, when I try this
from textblob import TextBlob
tweetstr = str(saved_column4)
tweets = TextBlob(tweetstr).upper()
print tweets.words.count('sex', case_sensitive=False)
My output is 22.
There should be AT LEAST as many incidences of the word 'sex' as there are lines in the CSV, and likely more. I can't figure out what's happening here. Is TextBlob not configuring right around a dtype:object?
I'm not entirely sure this is methodically correct insofar as language processing, but using join will give you the count you need.
import pandas as pd
from textblob import TextBlob
tweets = pd.Series('sex {}'.format(x) for x in range(1000))
tweetstr = " ".join(tweets.tolist())
tweetsb = TextBlob(tweetstr).upper()
print tweetsb.words.count('sex', case_sensitive=False)
# 1000
If you just need the count without necessarily using TextBlob, then just do:
import pandas as pd
tweets = pd.Series('sex {}'.format(x) for x in range(1000))
sex_tweets = tweets.str.contains('sex', case=False)
print sex_tweets.sum()
# 1000
You can get a TypeError in the first snippet if one of your elements is not of type string. This is more of join issue. A simple test can be done using the following snippet:
# tweets = pd.Series('sex {}'.format(x) for x in range(1000))
tweets = pd.Series(x for x in range(1000))
tweetstr = " ".join(tweets.tolist())
Which gives the following result:
Traceback (most recent call last):
File "F:\test.py", line 6, in <module>
tweetstr = " ".join(tweets.tolist())
TypeError: sequence item 0: expected string, numpy.int64 found
A simple workaround is to convert x in the list comprehension into a string before using join, like so:
tweets = pd.Series(str(x) for x in range(1000))
Or you can be more explicit and create a list first, map the str function to it, and then use join.
tweetlist = tweets.tolist()
tweetstr = map(str, tweetlist)
tweetstr = " ".join(tweetstr)
The CSV conversion is not the problem! When you use str() on a column of a DataFrame (that is, a Series), it makes a "print-friendly" output of the Series, which means cutting out the majority of the data, and just displaying the first few and the last few. Here is a transcript of an IPython session that will probably illustrate the issue better:
In [1]: import pandas as pd
In [2]: blah = pd.Series('tweet %d' % n for n in range(1000))
In [3]: blah
Out[3]:
0 tweet 0
1 tweet 1
... (output continues from 1 to 29)
29 tweet 29
... (OUTPUT SKIPS HERE)
970 tweet 970
... (output continues from 970 to 998)
998 tweet 998
999 tweet 999
dtype: object
In [4]: blahstr = str(blah)
In [5]: blahstr.count('tweet')
Out[5]: 60
So, since the output of the str() operation cuts off my data (and might even truncate column values, If I had used longer strings), I don't get 1000, I get 60.
If you want to do it your way (combine everything back into a single string and work with it that way), there's no point in using a library like Pandas. Pandas gives you better ways:
Working With a Series of Strings
Pandas has tools for working with a Series that contains strings. Here is a tutorial-like page about it, and here is the full string handling API documentation. In particular, for finding the number of uses of the word "sex", you could do something like this (assuming df is a DataFrame, and text is the column containing the tweets):
import re
counts = df['text'].str.count('sex', re.IGNORECASE)
counts should be a Series containing the number of occurrences of "sex" in each tweet. counts.sum() would give you the total number of usages, which should hopefully be more than 1000.