pandas series row-wise comparison (preserve cardinality/indices of larger series) - python

I have two pandas series, both string dtypes.
reports['corpus'] has 1287 rows
0 point seem peaking effects drug unique compari...
1 mother god seen much difficult withstand spent...
2 getting weird half breakthrough feels like sec...
3 vomited three times bucket suddenly felt much ...
4 reached peak mild walk around without difficul...
labels['uniq_labels'] has 52 rows
0 amplification
1 enhancement
2 psychedelic
3 sensory
4 visual
I want to create a new series object equal to the size of reports['corpus']. In it, each row needs to contain a list of string matches (i.e. searching reports['corpus'] for exact string matches to strings in labels['uniq_labels']).
I have tried looping over the two series to check if a string from labels['uniq_labels'] is in a report from reports['corpus']. I split at the report iter and am able to return a list of the strings that match. Though I can't seem to preserve conditions such as: allocating string matches for a given report to the reports' index position (very important).
Edit (Adding example of the series objects):
reports_series = pd.Series(['This is a test first sentence. \
This is the first row of a pandas series.',
'Here is the second row. The row that means the most. The row that never goes away.',
'The third sentence. The third row to the example pandas series.',
'This is the fourth and only fourth row of the pandas series.',
'Here is the fifth row. The fifth row that means the most.'])
labels_series = pd.Series(['first', 'sentence', 'second row'])

Convert uniq_labels column from the labels dataframe to a list, and split the corpus column from reports dataframe on white space, and take the values that are in both the lists.
(reports['corpus']
.str.split(' ')
.apply(lambda x:[i for i in labels['uniq_labels'].tolist() if i in x]))
0 []
1 []
2 []
3 []
4 []
Name: corpus, dtype: object
In the sample you have mentioned above, no values actually match, so the output has empty list only.

Related

flag strings based on previous values in pandas

I would like to flag sentences located in a pandas dataframe. As you can see in the example, some of the sentences are split into multiple rows (these are subtitles from an srt file that I would like to translate to a different language eventually, but first I need to put them in a single cell). The end of the sentence is determined by the period at the end. I want to create a column like the column sentence, where I number each sentence (it doesn't have to be a string, it could be a number too)
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.contains('\.')
df
output:
subtitle sentence_number presence_of_period
0 This is an example of subtitle. sentence_1 True
1 I want to group by sentences, which sentence_2 False
2 the end is determined by a period. sentence_2 True
3 row 0 should have sentece_1, rows 1 and 2 sentence_3 False
4 should have sentence_2. and this sentence_3 True
5 last row should have sentence_3. sentence_4 True
How can I create the sentence_number column since it has to read the previous cells on subtitle column? I was thinking of a window function or the shift() but couldn't figure out how to make it work. I added a column to show if the cell has a period, signifying the end of the sentence. Also, if possible, I would like to move the "and this" from row 4 to the beginning of row 5, since it is a new sentence (not sure if this one would require a different question).
Any thoughts?
To fix the sentence number, here's an option for you.
import pandas as pd
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.count('\.')
df['end'] = df.subtitle.str.endswith('.').astype(int)
df['sentence_#'] = 'sentence_' + (1 + df['presence_of_period'].cumsum() - df['end']).astype(str)
#print (df['subtitle'])
#print (df[['sentence_number','presence_of_period','end','sentence_#']])
df.drop(['presence_of_period','end'],axis=1, inplace=True)
print (df[['subtitle','sentence_#']])
The output will be as follows:
subtitle sentence_#
0 This is an example of subtitle. sentence_1
1 I want to group by sentences, which sentence_2
2 the end is determined by a period. sentence_2
3 row 0 should have sentece_1, rows 1 and 2 sentence_3
4 should have sentence_2. sentence_3
5 and this last row should have sentence_3. sentence_4
If you need to move the partial sentence to the next row, I need to understand a bit more details.
What do you want to do if there are more than two sentences in a row. For example, 'This is first sentence. This second. This is'.
What do you want to do in this case. Split the first one to a row, second to another row, and concatenate the third to the next row data?
Once I understand this, we can use the df.explode() to solve it.

Creating new dataframes by selecting rows with numbers/digits

I have this small dataframe:
index words
0 home # there is a blank in words
1 zone developer zone
2 zero zero
3 z3 z3
4 ytd2525 ytd2525
... ... ...
3887 18TH 18th
3888 180m 180m deal
3889 16th 16th
3890 150M 150m monthly
3891 10am 10am 20200716
I would like to extract all the words in index which contains numbers, in order to create a dataframe with only them, and another one where words containing numbers in both index and words are selected.
To select rows which contain numbers I have considered the following:
m1 = df['index'].apply(lambda x: not any(i.isnumeric() for i in x.split()))
m2 = df['index'].str.isalpha()
m3 = df['index'].apply(lambda x: not any(i.isdigit() for i in x))
m4 = ~df['index'].str.contains(r'[0-9]')
I do not know which one should be preferred (as they are redundant). But I would also consider another case, where both index and words contain numbers (digits), in order to select rows and create two dataframes.
Your question not clear. Happy to correct if I got the question wrong
For all words in index containing numbers in their own dataframe please try:
df.loc[df['index'].str.contains('\d+'),'index'].to_frame()
and for words containing numbers in both index and words
df.loc[df['index'].str.contains('\d+'),:]

Match all values str column dataframe with other dataframe str column

I have two pandas dataframes:
Dataframe 1:
ITEM ID TEXT
1 some random words
2 another word
3 blah
4 random words
Dataframe 2:
INDEX INFO
1 random
3 blah
I would like to match the values from the INFO column (of dataframe 2) with the TEXT column of dataframe 1. If there is a match, I would like to see a new column with a "1".
Something like this:
ITEM ID TEXT MATCH
1 some random words 1
2 another word
3 blah 1
4 random words 1
I was able to create a match per value of the INFO column that I'm looking for with this line of code:
dataframe1.loc[dataframe1['TEXT'].str.contains('blah'), 'MATCH'] = '1'
However, in reality, my real dataframe 2 has 5000 rows. So I cannot manually copy paste all of this. But basically I'm looking for something like this:
dataframe1.loc[dataframe1['TEXT'].str.contains('Dataframe2[INFO]'), 'MATCH'] = '1'
I hope someone can help, thanks!
Give this a shot:
Code:
dfA['MATCH'] = dfA['TEXT'].apply(lambda x: min(len([ y for y in dfB['INFO'] if y in x]), 1))
Output:
ITEM ID TEXT MATCH
0 1 some random words 1
1 2 another word 0
2 3 blah 1
3 4 random words 1
It's a 0 if it's not a match, but that's easy enough to weed out.
There may be a better / faster native solution, but it gets the job done by iterating over both the 'TEXT' column and the 'INFO'. Depending on your use case, it may be fast enough.
Looks like .map() in lieu of .apply() would work just as well. Could make a difference in timing, again, based on your use case.
Updated to take into account string contains instead of exact match...
You could get the unique values from the column in the first dataframe convert them to list and then use eval method on the second with Column.str.contains on that list.
unique = df1['TEXT'].unique().tolist()
df2.eval("Match=Text.str.contains('|'.join(#unique))")

Merging nearly-duplicate rows of strings on index in Pandas?

I have a dataset that has 2 copies of each record. Each record has an ID, and each copy has the same ID.
15 of 18 fields are identical in both copies of the records. But in 3 fields, the top row contains 2 items and 1 NAN; the bottom row contains 1 item (where top row had a NAN) and 2 NANs (where top row had items). Sometimes there are random NANs that don't follow this pattern.
I need to collapse each record into one so that I have a single record that contains all 3 non-NAN fields.
I have tried various versions of groupby. But that omits the 3 fields I need, which are all string-based. And it doubles the values of certain numeric fields.
If all else fails, I'll turn the letter fields into number codes and df.groupby(['ID']).agg('sum')
But I figure there's probably a smarter way to do this.

Split sentences into substrings containing varying number of words using pandas

My question is related to this past of question of mine: Split text in cells and create additional rows for the tokens.
Let's suppose that I have the following in a DataFrame in pandas:
id text
1 I am the first document and I am very happy.
2 Here is the second document and it likes playing tennis.
3 This is the third document and it looks very good today.
and I want to split the text of each id in tokens of random number of words (varying between two values e.g. 1 and 5) so I finally want to have something like the following:
id text
1 I am the
1 first document
1 and I am very
1 happy
2 Here is
2 the second document and it
2 likes playing
2 tennis
3 This is the third
3 document and
3 looks very
3 very good today
Keep in mind that my dataframe may also have other columns except for these two which should be simply copied at the new dataframe in the same way as id above.
What is the most efficient way to do this?
Define a function to extract chunks in a random fashion using itertools.islice:
from itertools import islice
import random
lo, hi = 3, 5 # change this to whatever
def extract_chunks(it):
chunks = []
while True:
chunk = list(islice(it, random.choice(range(lo, hi+1))))
if not chunk:
break
chunks.append(' '.join(chunk))
return chunks
Call the function through a list comprehension to ensure least possible overhead, then stack to get your output:
pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']], index=df['id']
).stack()
id
1 0 I am the
1 first document and I
2 am very happy.
2 0 Here is the
1 second document and
2 it likes playing tennis.
3 0 This is the third
1 document and it looks
2 very good today.
You can extend the extract_chunks function to perform tokenisation. Right now, I use a simple splitting on whitespace which you can modify.
Note that if you have other columns you don't want to touch, you can do something like a melting operation here.
u = pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']])
(pd.concat([df.drop('text', 1), u], axis=1)
.melt(df.columns.difference(['text'])))

Categories

Resources