Match all values str column dataframe with other dataframe str column - python

I have two pandas dataframes:
Dataframe 1:
ITEM ID TEXT
1 some random words
2 another word
3 blah
4 random words
Dataframe 2:
INDEX INFO
1 random
3 blah
I would like to match the values from the INFO column (of dataframe 2) with the TEXT column of dataframe 1. If there is a match, I would like to see a new column with a "1".
Something like this:
ITEM ID TEXT MATCH
1 some random words 1
2 another word
3 blah 1
4 random words 1
I was able to create a match per value of the INFO column that I'm looking for with this line of code:
dataframe1.loc[dataframe1['TEXT'].str.contains('blah'), 'MATCH'] = '1'
However, in reality, my real dataframe 2 has 5000 rows. So I cannot manually copy paste all of this. But basically I'm looking for something like this:
dataframe1.loc[dataframe1['TEXT'].str.contains('Dataframe2[INFO]'), 'MATCH'] = '1'
I hope someone can help, thanks!

Give this a shot:
Code:
dfA['MATCH'] = dfA['TEXT'].apply(lambda x: min(len([ y for y in dfB['INFO'] if y in x]), 1))
Output:
ITEM ID TEXT MATCH
0 1 some random words 1
1 2 another word 0
2 3 blah 1
3 4 random words 1
It's a 0 if it's not a match, but that's easy enough to weed out.
There may be a better / faster native solution, but it gets the job done by iterating over both the 'TEXT' column and the 'INFO'. Depending on your use case, it may be fast enough.
Looks like .map() in lieu of .apply() would work just as well. Could make a difference in timing, again, based on your use case.

Updated to take into account string contains instead of exact match...
You could get the unique values from the column in the first dataframe convert them to list and then use eval method on the second with Column.str.contains on that list.
unique = df1['TEXT'].unique().tolist()
df2.eval("Match=Text.str.contains('|'.join(#unique))")

Related

Multiply string in column by how often it occurs in the corpus

I'm using character n-grams to detect a language, right now i mapped out all nouns and put them in a Pandas Dataframe.
right now i've got this :
word1=df['lemmastring1'][0]*df['count'][0]
How can i Iterrate this to get it for my whole Dataframe.
Picture of the dataframe
This is how i want it for all
Given:
string count
0 _jarr_- 500
1 _mens_- 200
Doing:
df['string']*df['count']
Output:
0 _jarr_-_jarr_-_jarr_-_jarr_-_jarr_-_jarr_-_jar...
1 _mens_-_mens_-_mens_-_mens_-_mens_-_mens_-_men...
dtype: object

pandas series row-wise comparison (preserve cardinality/indices of larger series)

I have two pandas series, both string dtypes.
reports['corpus'] has 1287 rows
0 point seem peaking effects drug unique compari...
1 mother god seen much difficult withstand spent...
2 getting weird half breakthrough feels like sec...
3 vomited three times bucket suddenly felt much ...
4 reached peak mild walk around without difficul...
labels['uniq_labels'] has 52 rows
0 amplification
1 enhancement
2 psychedelic
3 sensory
4 visual
I want to create a new series object equal to the size of reports['corpus']. In it, each row needs to contain a list of string matches (i.e. searching reports['corpus'] for exact string matches to strings in labels['uniq_labels']).
I have tried looping over the two series to check if a string from labels['uniq_labels'] is in a report from reports['corpus']. I split at the report iter and am able to return a list of the strings that match. Though I can't seem to preserve conditions such as: allocating string matches for a given report to the reports' index position (very important).
Edit (Adding example of the series objects):
reports_series = pd.Series(['This is a test first sentence. \
This is the first row of a pandas series.',
'Here is the second row. The row that means the most. The row that never goes away.',
'The third sentence. The third row to the example pandas series.',
'This is the fourth and only fourth row of the pandas series.',
'Here is the fifth row. The fifth row that means the most.'])
labels_series = pd.Series(['first', 'sentence', 'second row'])
Convert uniq_labels column from the labels dataframe to a list, and split the corpus column from reports dataframe on white space, and take the values that are in both the lists.
(reports['corpus']
.str.split(' ')
.apply(lambda x:[i for i in labels['uniq_labels'].tolist() if i in x]))
0 []
1 []
2 []
3 []
4 []
Name: corpus, dtype: object
In the sample you have mentioned above, no values actually match, so the output has empty list only.

flag strings based on previous values in pandas

I would like to flag sentences located in a pandas dataframe. As you can see in the example, some of the sentences are split into multiple rows (these are subtitles from an srt file that I would like to translate to a different language eventually, but first I need to put them in a single cell). The end of the sentence is determined by the period at the end. I want to create a column like the column sentence, where I number each sentence (it doesn't have to be a string, it could be a number too)
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.contains('\.')
df
output:
subtitle sentence_number presence_of_period
0 This is an example of subtitle. sentence_1 True
1 I want to group by sentences, which sentence_2 False
2 the end is determined by a period. sentence_2 True
3 row 0 should have sentece_1, rows 1 and 2 sentence_3 False
4 should have sentence_2. and this sentence_3 True
5 last row should have sentence_3. sentence_4 True
How can I create the sentence_number column since it has to read the previous cells on subtitle column? I was thinking of a window function or the shift() but couldn't figure out how to make it work. I added a column to show if the cell has a period, signifying the end of the sentence. Also, if possible, I would like to move the "and this" from row 4 to the beginning of row 5, since it is a new sentence (not sure if this one would require a different question).
Any thoughts?
To fix the sentence number, here's an option for you.
import pandas as pd
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.count('\.')
df['end'] = df.subtitle.str.endswith('.').astype(int)
df['sentence_#'] = 'sentence_' + (1 + df['presence_of_period'].cumsum() - df['end']).astype(str)
#print (df['subtitle'])
#print (df[['sentence_number','presence_of_period','end','sentence_#']])
df.drop(['presence_of_period','end'],axis=1, inplace=True)
print (df[['subtitle','sentence_#']])
The output will be as follows:
subtitle sentence_#
0 This is an example of subtitle. sentence_1
1 I want to group by sentences, which sentence_2
2 the end is determined by a period. sentence_2
3 row 0 should have sentece_1, rows 1 and 2 sentence_3
4 should have sentence_2. sentence_3
5 and this last row should have sentence_3. sentence_4
If you need to move the partial sentence to the next row, I need to understand a bit more details.
What do you want to do if there are more than two sentences in a row. For example, 'This is first sentence. This second. This is'.
What do you want to do in this case. Split the first one to a row, second to another row, and concatenate the third to the next row data?
Once I understand this, we can use the df.explode() to solve it.

Split sentences into substrings containing varying number of words using pandas

My question is related to this past of question of mine: Split text in cells and create additional rows for the tokens.
Let's suppose that I have the following in a DataFrame in pandas:
id text
1 I am the first document and I am very happy.
2 Here is the second document and it likes playing tennis.
3 This is the third document and it looks very good today.
and I want to split the text of each id in tokens of random number of words (varying between two values e.g. 1 and 5) so I finally want to have something like the following:
id text
1 I am the
1 first document
1 and I am very
1 happy
2 Here is
2 the second document and it
2 likes playing
2 tennis
3 This is the third
3 document and
3 looks very
3 very good today
Keep in mind that my dataframe may also have other columns except for these two which should be simply copied at the new dataframe in the same way as id above.
What is the most efficient way to do this?
Define a function to extract chunks in a random fashion using itertools.islice:
from itertools import islice
import random
lo, hi = 3, 5 # change this to whatever
def extract_chunks(it):
chunks = []
while True:
chunk = list(islice(it, random.choice(range(lo, hi+1))))
if not chunk:
break
chunks.append(' '.join(chunk))
return chunks
Call the function through a list comprehension to ensure least possible overhead, then stack to get your output:
pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']], index=df['id']
).stack()
id
1 0 I am the
1 first document and I
2 am very happy.
2 0 Here is the
1 second document and
2 it likes playing tennis.
3 0 This is the third
1 document and it looks
2 very good today.
You can extend the extract_chunks function to perform tokenisation. Right now, I use a simple splitting on whitespace which you can modify.
Note that if you have other columns you don't want to touch, you can do something like a melting operation here.
u = pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']])
(pd.concat([df.drop('text', 1), u], axis=1)
.melt(df.columns.difference(['text'])))

Finding rows with highest means in dataframe

I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)

Categories

Resources