I have a pandas data frame that is structured something like this:
ID TEXT
1 Start of document
1 middle
1 end of document
2 start of document 2
2 middle
2 end of document 2
The raw data I got has repeating IDs which if you concatenate the text for each unique ID you get a resulting document. Some of these IDs repeat hundreds of times resulting in large quantities of text which I would like to boil down to one observation.
I'm not sure how to go about looping through and creating a new document. Also not sure if Pandas is the right data structure to store large quantities of text (these are transcribed call records--some of them 30 minute+ conversations). Would appreciate any pointers.
IIUC:
df.groupby('ID').TEXT.apply(' '.join)
ID
1 Start of document middle end of document
2 start of document 2 middle end of document 2
Name: TEXT, dtype: object
If without groupby
(df.set_index('ID').TEXT+' ').sum(level=0).str[:-1]
Out[1066]:
ID
1 Start of document middle end of document
2 start of document 2 middle end of document 2
Name: TEXT, dtype: object
Related
Hi have a Data frame with two columns which looks like this:
Index Text
0 READ MY NEW OP-ED: IRREVERSIBLE – Many Effects...
1 #COVID19 is linked to more #diabetes diagnoses...
2 #COVID19: IRREVERSIBLE – Many Effects...
3 READ MY NEW OP-ED: IRREVERSIBLE – Many Effects...
4 Advanced healthcare at your fingertips\nhttps:...
I am trying keep only the rows which contain the #symbol, so based on my data frame above my desired output is:
Index Text
1 #COVID19 is linked to more #diabetes diagnoses...
2 #COVID19: IRREVERSIBLE – Many Effects...
I have tried several ways to obtain that unsuccessfully, my latest code attempt was:
for column in twt_text:
print(twt_text['text'].str.contains('#'))
But the output generated was not at all what I expected:
0 False
1 True
2 True
3 False
4 False
Any idea or insight on how I can obtain the output I want based on text containing # ?
You could build a selection mask and use that to filter the rows:
df[df['Text'].str.contains('#')]
Result
Text
1 #COVID19 is linked to more #diabetes diagnoses...
2 #COVID19: IRREVERSIBLE – Many Effects...
I'm using character n-grams to detect a language, right now i mapped out all nouns and put them in a Pandas Dataframe.
right now i've got this :
word1=df['lemmastring1'][0]*df['count'][0]
How can i Iterrate this to get it for my whole Dataframe.
Picture of the dataframe
This is how i want it for all
Given:
string count
0 _jarr_- 500
1 _mens_- 200
Doing:
df['string']*df['count']
Output:
0 _jarr_-_jarr_-_jarr_-_jarr_-_jarr_-_jarr_-_jar...
1 _mens_-_mens_-_mens_-_mens_-_mens_-_mens_-_men...
dtype: object
I have two pandas dataframes:
Dataframe 1:
ITEM ID TEXT
1 some random words
2 another word
3 blah
4 random words
Dataframe 2:
INDEX INFO
1 random
3 blah
I would like to match the values from the INFO column (of dataframe 2) with the TEXT column of dataframe 1. If there is a match, I would like to see a new column with a "1".
Something like this:
ITEM ID TEXT MATCH
1 some random words 1
2 another word
3 blah 1
4 random words 1
I was able to create a match per value of the INFO column that I'm looking for with this line of code:
dataframe1.loc[dataframe1['TEXT'].str.contains('blah'), 'MATCH'] = '1'
However, in reality, my real dataframe 2 has 5000 rows. So I cannot manually copy paste all of this. But basically I'm looking for something like this:
dataframe1.loc[dataframe1['TEXT'].str.contains('Dataframe2[INFO]'), 'MATCH'] = '1'
I hope someone can help, thanks!
Give this a shot:
Code:
dfA['MATCH'] = dfA['TEXT'].apply(lambda x: min(len([ y for y in dfB['INFO'] if y in x]), 1))
Output:
ITEM ID TEXT MATCH
0 1 some random words 1
1 2 another word 0
2 3 blah 1
3 4 random words 1
It's a 0 if it's not a match, but that's easy enough to weed out.
There may be a better / faster native solution, but it gets the job done by iterating over both the 'TEXT' column and the 'INFO'. Depending on your use case, it may be fast enough.
Looks like .map() in lieu of .apply() would work just as well. Could make a difference in timing, again, based on your use case.
Updated to take into account string contains instead of exact match...
You could get the unique values from the column in the first dataframe convert them to list and then use eval method on the second with Column.str.contains on that list.
unique = df1['TEXT'].unique().tolist()
df2.eval("Match=Text.str.contains('|'.join(#unique))")
My question is related to this past of question of mine: Split text in cells and create additional rows for the tokens.
Let's suppose that I have the following in a DataFrame in pandas:
id text
1 I am the first document and I am very happy.
2 Here is the second document and it likes playing tennis.
3 This is the third document and it looks very good today.
and I want to split the text of each id in tokens of random number of words (varying between two values e.g. 1 and 5) so I finally want to have something like the following:
id text
1 I am the
1 first document
1 and I am very
1 happy
2 Here is
2 the second document and it
2 likes playing
2 tennis
3 This is the third
3 document and
3 looks very
3 very good today
Keep in mind that my dataframe may also have other columns except for these two which should be simply copied at the new dataframe in the same way as id above.
What is the most efficient way to do this?
Define a function to extract chunks in a random fashion using itertools.islice:
from itertools import islice
import random
lo, hi = 3, 5 # change this to whatever
def extract_chunks(it):
chunks = []
while True:
chunk = list(islice(it, random.choice(range(lo, hi+1))))
if not chunk:
break
chunks.append(' '.join(chunk))
return chunks
Call the function through a list comprehension to ensure least possible overhead, then stack to get your output:
pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']], index=df['id']
).stack()
id
1 0 I am the
1 first document and I
2 am very happy.
2 0 Here is the
1 second document and
2 it likes playing tennis.
3 0 This is the third
1 document and it looks
2 very good today.
You can extend the extract_chunks function to perform tokenisation. Right now, I use a simple splitting on whitespace which you can modify.
Note that if you have other columns you don't want to touch, you can do something like a melting operation here.
u = pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']])
(pd.concat([df.drop('text', 1), u], axis=1)
.melt(df.columns.difference(['text'])))
I have a list of keywords as well as a DF that contains a text column. I am trying to filter out every row where the text in the text field contains one of the keywords. I believe what am I looking for is something like the .isin method but that would be able to take a regex argument as I am searching for substrings within the text not exact matches.
What I have:
keys = ['key','key2']
A Text
0 5 Sample text one
1 6 Sample text two
2 3 Sample text three key
3 4 Sample text four key2
And I would like to remove any rows that contain a key in the text so I would end up with:
A Text
0 5 Sample text one
1 6 Sample text two
use str.contains and join the keys using | to create a regex pattern and negate the boolean mask ~ to filter your df:
In [123]:
keys = ['key','key2']
df[~df['Text'].str.contains('|'.join(keys))]
Out[123]:
A Text
0 5 Sample text one
1 6 Sample text two