Split sentences into substrings containing varying number of words using pandas - python

My question is related to this past of question of mine: Split text in cells and create additional rows for the tokens.
Let's suppose that I have the following in a DataFrame in pandas:
id text
1 I am the first document and I am very happy.
2 Here is the second document and it likes playing tennis.
3 This is the third document and it looks very good today.
and I want to split the text of each id in tokens of random number of words (varying between two values e.g. 1 and 5) so I finally want to have something like the following:
id text
1 I am the
1 first document
1 and I am very
1 happy
2 Here is
2 the second document and it
2 likes playing
2 tennis
3 This is the third
3 document and
3 looks very
3 very good today
Keep in mind that my dataframe may also have other columns except for these two which should be simply copied at the new dataframe in the same way as id above.
What is the most efficient way to do this?

Define a function to extract chunks in a random fashion using itertools.islice:
from itertools import islice
import random
lo, hi = 3, 5 # change this to whatever
def extract_chunks(it):
chunks = []
while True:
chunk = list(islice(it, random.choice(range(lo, hi+1))))
if not chunk:
break
chunks.append(' '.join(chunk))
return chunks
Call the function through a list comprehension to ensure least possible overhead, then stack to get your output:
pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']], index=df['id']
).stack()
id
1 0 I am the
1 first document and I
2 am very happy.
2 0 Here is the
1 second document and
2 it likes playing tennis.
3 0 This is the third
1 document and it looks
2 very good today.
You can extend the extract_chunks function to perform tokenisation. Right now, I use a simple splitting on whitespace which you can modify.
Note that if you have other columns you don't want to touch, you can do something like a melting operation here.
u = pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']])
(pd.concat([df.drop('text', 1), u], axis=1)
.melt(df.columns.difference(['text'])))

Related

establish counts of elements of pandas dataframe

Currently working to implement some fuzzy matching logic to group together emails with similar patterns and I need to improve the efficiency of part of the code but not sure what the best path forward is. I use a package to output a pandas dataframe that looks like this:
I redacted the data, but it's just four columns with an ID #, the email associated with a given ID, a group ID number that identifies the cluster a given email falls into, and then the group rep which is the most mathematically central email of a given cluster.
What I want to do is count the number of occurrences of each distinct element in the group rep column and create a new dataframe that's just two columns with one column having the group rep email and then the second column having the corresponding count of that group rep in the original dataframe. It should look something like this:
As of now, I'm converting my group reps to a list and then using a for-loop to create a list of tuples(I think?) with each tuple containing a centroid email group identifiers and the number of times that identifier occurs in the original df (aka the number of emails in the original data that belong to that centroid email's group). The code looks like this:
groups = list(df['group rep'].unique())
# preparing list of tuples with group count
req_groups = []
for g in groups:
count = (g, df['group rep'].value_counts()[g])
#print(count)
req_groups.append(count)
print(req_groups)
Unfortunately, this operation takes far too long. I'm sure there's a better solution, but could definitely use some help finding a path forward. Thanks in advance for your help!
You can use df.groupby('group rep').count().
Let's consider the following dataframe :
email
0 zucchini#yahoo.fr
1 apple#gmail.com
2 citrus#protonmail.com
3 banana#gmail.com
4 pear#gmail.com
5 apple#gmail.com
6 citrus#protonmail.com
Proposed script
import pandas as pd
import operator
m = {'email':['zucchini#yahoo.fr','apple#gmail.com','citrus#protonmail.com','banana#gmail.com',
'pear#gmail.com','apple#gmail.com','citrus#protonmail.com']}
df = pd.DataFrame(m)
counter = pd.DataFrame.from_dict({c: [operator.countOf(df['email'], c)] for c in df['email'].unique()})
cnt_df = counter.T.rename(columns={0:'count'})
print(cnt_df)
Result
count
zucchini#yahoo.fr 1
apple#gmail.com 2
citrus#protonmail.com 2
banana#gmail.com 1
pear#gmail.com 1

pandas series row-wise comparison (preserve cardinality/indices of larger series)

I have two pandas series, both string dtypes.
reports['corpus'] has 1287 rows
0 point seem peaking effects drug unique compari...
1 mother god seen much difficult withstand spent...
2 getting weird half breakthrough feels like sec...
3 vomited three times bucket suddenly felt much ...
4 reached peak mild walk around without difficul...
labels['uniq_labels'] has 52 rows
0 amplification
1 enhancement
2 psychedelic
3 sensory
4 visual
I want to create a new series object equal to the size of reports['corpus']. In it, each row needs to contain a list of string matches (i.e. searching reports['corpus'] for exact string matches to strings in labels['uniq_labels']).
I have tried looping over the two series to check if a string from labels['uniq_labels'] is in a report from reports['corpus']. I split at the report iter and am able to return a list of the strings that match. Though I can't seem to preserve conditions such as: allocating string matches for a given report to the reports' index position (very important).
Edit (Adding example of the series objects):
reports_series = pd.Series(['This is a test first sentence. \
This is the first row of a pandas series.',
'Here is the second row. The row that means the most. The row that never goes away.',
'The third sentence. The third row to the example pandas series.',
'This is the fourth and only fourth row of the pandas series.',
'Here is the fifth row. The fifth row that means the most.'])
labels_series = pd.Series(['first', 'sentence', 'second row'])
Convert uniq_labels column from the labels dataframe to a list, and split the corpus column from reports dataframe on white space, and take the values that are in both the lists.
(reports['corpus']
.str.split(' ')
.apply(lambda x:[i for i in labels['uniq_labels'].tolist() if i in x]))
0 []
1 []
2 []
3 []
4 []
Name: corpus, dtype: object
In the sample you have mentioned above, no values actually match, so the output has empty list only.

flag strings based on previous values in pandas

I would like to flag sentences located in a pandas dataframe. As you can see in the example, some of the sentences are split into multiple rows (these are subtitles from an srt file that I would like to translate to a different language eventually, but first I need to put them in a single cell). The end of the sentence is determined by the period at the end. I want to create a column like the column sentence, where I number each sentence (it doesn't have to be a string, it could be a number too)
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.contains('\.')
df
output:
subtitle sentence_number presence_of_period
0 This is an example of subtitle. sentence_1 True
1 I want to group by sentences, which sentence_2 False
2 the end is determined by a period. sentence_2 True
3 row 0 should have sentece_1, rows 1 and 2 sentence_3 False
4 should have sentence_2. and this sentence_3 True
5 last row should have sentence_3. sentence_4 True
How can I create the sentence_number column since it has to read the previous cells on subtitle column? I was thinking of a window function or the shift() but couldn't figure out how to make it work. I added a column to show if the cell has a period, signifying the end of the sentence. Also, if possible, I would like to move the "and this" from row 4 to the beginning of row 5, since it is a new sentence (not sure if this one would require a different question).
Any thoughts?
To fix the sentence number, here's an option for you.
import pandas as pd
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.count('\.')
df['end'] = df.subtitle.str.endswith('.').astype(int)
df['sentence_#'] = 'sentence_' + (1 + df['presence_of_period'].cumsum() - df['end']).astype(str)
#print (df['subtitle'])
#print (df[['sentence_number','presence_of_period','end','sentence_#']])
df.drop(['presence_of_period','end'],axis=1, inplace=True)
print (df[['subtitle','sentence_#']])
The output will be as follows:
subtitle sentence_#
0 This is an example of subtitle. sentence_1
1 I want to group by sentences, which sentence_2
2 the end is determined by a period. sentence_2
3 row 0 should have sentece_1, rows 1 and 2 sentence_3
4 should have sentence_2. sentence_3
5 and this last row should have sentence_3. sentence_4
If you need to move the partial sentence to the next row, I need to understand a bit more details.
What do you want to do if there are more than two sentences in a row. For example, 'This is first sentence. This second. This is'.
What do you want to do in this case. Split the first one to a row, second to another row, and concatenate the third to the next row data?
Once I understand this, we can use the df.explode() to solve it.

Match all values str column dataframe with other dataframe str column

I have two pandas dataframes:
Dataframe 1:
ITEM ID TEXT
1 some random words
2 another word
3 blah
4 random words
Dataframe 2:
INDEX INFO
1 random
3 blah
I would like to match the values from the INFO column (of dataframe 2) with the TEXT column of dataframe 1. If there is a match, I would like to see a new column with a "1".
Something like this:
ITEM ID TEXT MATCH
1 some random words 1
2 another word
3 blah 1
4 random words 1
I was able to create a match per value of the INFO column that I'm looking for with this line of code:
dataframe1.loc[dataframe1['TEXT'].str.contains('blah'), 'MATCH'] = '1'
However, in reality, my real dataframe 2 has 5000 rows. So I cannot manually copy paste all of this. But basically I'm looking for something like this:
dataframe1.loc[dataframe1['TEXT'].str.contains('Dataframe2[INFO]'), 'MATCH'] = '1'
I hope someone can help, thanks!
Give this a shot:
Code:
dfA['MATCH'] = dfA['TEXT'].apply(lambda x: min(len([ y for y in dfB['INFO'] if y in x]), 1))
Output:
ITEM ID TEXT MATCH
0 1 some random words 1
1 2 another word 0
2 3 blah 1
3 4 random words 1
It's a 0 if it's not a match, but that's easy enough to weed out.
There may be a better / faster native solution, but it gets the job done by iterating over both the 'TEXT' column and the 'INFO'. Depending on your use case, it may be fast enough.
Looks like .map() in lieu of .apply() would work just as well. Could make a difference in timing, again, based on your use case.
Updated to take into account string contains instead of exact match...
You could get the unique values from the column in the first dataframe convert them to list and then use eval method on the second with Column.str.contains on that list.
unique = df1['TEXT'].unique().tolist()
df2.eval("Match=Text.str.contains('|'.join(#unique))")

making new data frame from combining text pandas

I have a pandas data frame that is structured something like this:
ID TEXT
1 Start of document
1 middle
1 end of document
2 start of document 2
2 middle
2 end of document 2
The raw data I got has repeating IDs which if you concatenate the text for each unique ID you get a resulting document. Some of these IDs repeat hundreds of times resulting in large quantities of text which I would like to boil down to one observation.
I'm not sure how to go about looping through and creating a new document. Also not sure if Pandas is the right data structure to store large quantities of text (these are transcribed call records--some of them 30 minute+ conversations). Would appreciate any pointers.
IIUC:
df.groupby('ID').TEXT.apply(' '.join)
ID
1 Start of document middle end of document
2 start of document 2 middle end of document 2
Name: TEXT, dtype: object
If without groupby
(df.set_index('ID').TEXT+' ').sum(level=0).str[:-1]
Out[1066]:
ID
1 Start of document middle end of document
2 start of document 2 middle end of document 2
Name: TEXT, dtype: object

Categories

Resources