Pattern matching in columns in python - python

I have two dataframes df and df1. I want to search pattern in df based on the values given in df1. DataFrames are given below:
import pandas as pd
data={"id":["I983","I873","I526","I721","I536","I327","I626","I213","I625","I524"],
"coltext":[ "I could take my comment back, I would do so in a second. I have addressed my teammates and coaches and while many understand my actions were totall", "We’re just trying to see if he can get on the field as a football player, and then we’ll make decision",
"TextNow offers low-cost, international calling to over 230 countries. Stay connected longer with rates starting at less than",
"Wi-Fi can provide you with added coverage in places where cell networks don't always work - like basements and apartments. No roaming fees for Wi-Fi connection",
"Send messages and make calls on your compute",
"even have a free, Wi-Fi only version of TextNow, available for download on you",
"the rest of the players accepted apologies this spring and are welcoming him back",
"was really looking at him and watching how much this really means to him and how much he really missed us",
"I’ll deal with the problem and I’ll remedy the problem",
"The first step was for him to be able to complete what we call our bottom line program which has been completed"]}
df=pd.DataFrame(data=data)
data1={"col1":["addressed teammates coaches","football player decision","watching really missed", "bottom line program","meassges make calls"],
"col2":["international calling over","download on you","rest players accepted","deal problem remedy","understand actions totall"],
"col3":["first step him","Wi-Fi only version","cell network works","accepted apologies","stay connected longer"]}
df1=pd.DataFrame(data=data1)
For example first element "addressed teammates coaches" from df1['col1'] lies in first element in df['coltext'] and likewise I want to search every element from every column in df1 in df['coltext']. If pattern is found, then create third col in df.
Desired Output:
id coltext patternMatch
I983 I could take my comment back, col1, col2
I873 We’re just trying to see if he can col1
I526 TextNow offers low-cost, col3, col2
I721 Wi-Fi can provide you with col3
I536 Send messages and make calls col1

There may be other efficient ways, one way may be as following:
# create dictionary of data1 such that values and keys are reversed
my_dict = {item:k for k, v in data1.items() for item in v}
# for column in df check if all words are in 'coltext' for each key in dictionary
df['patternMatch'] = df['coltext'].apply(lambda row:
{v for k, v in my_dict.items()
if all(word in row for word in k.split())})

Related

Having trouble creating script to find the name that shows up in each list in Python

Dr. Smith was killed in the studio with a knife by one of his heirs! Create a script to find the murderer! Make sure to show your answer.
The following people are Smith's heirs: Aiden, Tori, Lucas, Isabelle.
The following people were in the studio: Lucas, Natalie, Tori.
The following people own a knife: Isabelle, Tori, Natalie.
My code:
heirs = ["Aiden", "Tori", "Lucas", "Isabelle"]
ppleinstudio = ["Lucas", "Natalie", "Tori"]
knife = ["Isabelle", "Tori", "Natalie"]
# killer is the one who exists in three of the lists
# merge the lists
merged = [*heirs,*ppleinstudio,*knife]
L1=[]
for i in merged:
if i not in L1:
L1.append(i)
else:
print(i,end=' ')
output:
Lucas Tori Isabelle Tori Natalie
What am I missing to get it to look for the repeating name?
I am not sure that the code you implemented is doing what you wanted it to do, maybe you should try to check the contents of the merged list and see what happens as you iterate through the for loop.
Nevertheless, for the sake of providing a solution to your problem, if you are allowed to use sets you could easily solve this by doing the following:
heirs = ["Aiden", "Tori", "Lucas", "Isabelle"]
ppleinstudio = ["Lucas", "Natalie", "Tori"]
knife = ["Isabelle", "Tori", "Natalie"]
h_set = set(heirs)
s_set = set(ppleinstudio)
k_set = set(knife)
culprit = h_set.intersection(s_set.intersection(k_set)).pop()
print(culprit)
>> 'Tori'
But if this is some kind of homework you should probably try to work your way to a solution on paper/whiteboard first, and figure out why your approach is not working.
You could do something like this, cycle through each entry in the merged list, and break the three requirements into three boolean statements:
heirs = ["Aiden", "Tori", "Lucas", "Isabelle"]
ppleinstudio = ["Lucas", "Natalie", "Tori"]
knife = ["Isabelle", "Tori", "Natalie"]
# killer is the one who exists in three of the lists
# merge the lists
merged = [*heirs,*ppleinstudio,*knife]
for person in merged:
is_heir = person in heirs
is_in_studio = person in ppleinstudio
has_knife = person in knife
if(is_heir and is_in_studio and has_knife):
print(person)
break
Output:
Tori
This will be a little inefficient because if you print out the contents of merged, you'll notice that there are duplicate names, but seeing as your question doesn't mention anything about efficiency - this will get the job done just fine.
If you are concerned about this inefficiency you can use the set operator on the merged list and iterate over that instead:
merged = set(merged)

Is there a way for me to calculate an average similarity of strings, grouped by individuals?

Some context: I'm currently working on data on Fraud Detection and my data looks like the following with relevant columns:
referee
referee_phone_number
referred_by
referee_A
019500600
Person A_fraudster
referee_B
019500601
Person A_fraudster
referee_C
019500602
Person A_fraudster
referee_D
019500603
Person A_fraudster
referee_E
014501928
Person B_non_fraudster
referee_F
016779810
Person B_non_fraudster
Notice that the phone numbers referred by Person A are similar. Fraudsters tend to register phone numbers en masse.
One feature that I would like to engineer would be string similarity grouped by referred_by, which looks something like this, with approximates for the average value similarities to be fed into my decision tree, with the goal of using the value for training, since non-frausters will low phone number similarities:
referred_by
average_phone_number_similarity
Person A
0.90
Person B
0.20
I've tried exploring Levenstein Distance and have the following script using difflib , thinking that I could pit two lists together:
string_list_1 = ['02223785428', '02223785390', '02223784947', '0165104490']
string_list_2 = ['02223785428', '02223785390', '02223784947']
from difflib import SequenceMatcher
similar = []
not_similar = []
for item1 in string_list_1:
# Set the state as false
found = False
for item2 in string_list_2:
if SequenceMatcher(None, a=item1,b=item2).ratio() > 0.8:
similar.append(item1)
found = True
break
if not found:
not_similar.append(item1)
# Making sure that there is not duplcates in the list
print("Similar : ", list(dict.fromkeys(similar)))
print("Not Similar : ", list(dict.fromkeys(not_similar)))
>> Similar : ['02223785428', '02223785390', '02223784947']
Not Similar : ['0165104490']
However, it doesn't solve my issue, especially when it comes to computing similarity between each phone number, I don't think I am getting it quite right, can anyone lend a hand?
I'm not sure I understood you correctly.
But why don't you want to compare the difference between two integers? If you expect numbers in a row, then you can just store integers in the database and trying to find some close numbers for a period.

How can I use Python and Pandas to parse through text and return the strings I want in separate data cells?

So I have compiled a list of NFL game projections from the 2020 season for fantasy relevant players. Each row contains the team names, score, relevant players and their stats like in the text below. The problem is that each of the player names and stats are either different lengths or written out in slightly different ways.
`Bears 24-17 Jaguars
M.Trubisky- 234/2TDs
D.Montgomery- 113 scrim yards/1 rush TD/4 rec
A.Robinson- 9/114/1
C.Kmet- 3/35/0
G.Minshew- 183/1TD/2int
J.Robinson- 77 scrim yards/1 rush TD/4 rec
DJ.Chark- 3/36`
I'm trying to create a data frame that will split the player name, receptions, yards, and touchdowns into separate columns. Then I will able to compare these numbers to their actual game numbers and see how close the predictions were. Does anyone have an idea for a solution in Python? Even if you could point me in the right direction I'd greatly appreciate it!
You can get split the full string using the '-' (dash/minus sign) as the separator. Then use indexing to get different parts.
Using str.split(sep='-')[0] gives you the name. Here, the str would be the row, for example M.Trubisky- 234/2TDs.
Similarly, str.split(sep='-')[1]gives you everything but the name.
As for splitting anything after the name, there is no way of doing it unless they are in a certain order. If you are able to somehow achieve this, there is a way of splitting into columns.
I am going to assume that the trend here is yards / touchdowns / receptions, in which case, we can again use the str.split() method. I am also assuming that the 'rows' only belong to one team. You might have to run this script once for each team to create a dataframe, and then join all dataframes with a new feature called 'team_name'.
You can define lists and append values to them, and then use the lists to create a dataframe. This snippet should help you.
import re
names, scrim_yards, touchdowns, receptions = [], [], [], []
for row in rows:
# name = row.split(sep='-')[0] --> sample name: M.Trubisky
names.append(row.split(sep='-')[0])
stats = row.split(sep='-')[1].split(sep='/') # sample stats: [234, 2TDs ]
# Since we only want the 'numbers' from each stat, we can filter out what we want using regular expressions.
# This snippet was obtained from [here][1].
numerical_stats = re.findall(r'\b\d+\b', stats) # sample stats: [234, 2]
# now we use indexing again to get desired values
# If the
scrim_yards.append(numerical_stats[0])
touchdowns.append(numerical_stats[1])
receptions.append(numerical_stats[2])
# You can then create a pandas dataframe
nfl_player_stats = pd.DataFrame({'names': names, 'scrim_yards': scrim_yards, 'touchdowns': touchdowns, 'receptions': receptions})
As you are pointing out, often times the hardest part of processing a data file like this is handling all the variability and inconsistency in the file itself. There are a lot of things that can vary inside the file, and then sometimes the file also contains silly errors (typos, missing whitespace, and the like). Depending on the size of the data file, you might be better off simply hand-editing it to make it easier to read into Python!
If you tackle this directly with Python code, then it's a very good idea to be very careful to verify the actual data matches your expectations of it. Here are some general concepts on how to handle this:
First off, make sure to strip every line of whitespace and ignore blank lines:
for curr_line in file_lines:
curr_line = curr_line.strip()
if len(curr_line) > 0:
# Process the line...
Once you have your stripped, non-blank line, make sure to handle the "game" (matchup between two teams) line differently than the lines denoting players"
TEAM_NAMES = [ "Cardinals", "Falcons", "Panthers", "Bears", "Cowboys", "Lions",
"Packers", "Rams", "Vikings" ] # and 23 more; you get the idea
#...down in the code where we are processing the lines...
if any([tn in curr_line for tn in TEAM_NAMES]):
# ...handle as a "matchup"
else:
# ...handle as a "player"
When handling a player and their stats, we can use "- " as a separator. (You must include the space, otherwise players such as Clyde Edwards-Helaire will split the line in a way you did not want.) Here we unpack into exactly two variables, which gives us a nice error check since the code will raise an exception if the line doesn't split into exactly two parts.
p_name, p_stats = curr_line.split("- ")
Handling the stats will be the hardest part. It will all depend on what assumptions you can safely make about your input data. I would recommend being very paranoid about validating that the input data agrees with the assumptions in your code. Here is one notional idea -- an over-engineered solution, but that should help to manage the hassle of finding all the little issues that are probably lurking in that data file:
if "scrim yards" in p_stats:
# This is a running back, so "scrim yards" then "rush TD" then "rec:
rb_stats = p_stats.split("/")
# To get the number, just split by whitespace and grab the first one
scrim_yds = int(rb_stats[0].split()[0])
if len(rb_stats) >= 2:
rush_tds = int(rb_stats[1].split()[0])
if len(rb_stats) >= 3:
rec = int(rb_stats[2].split()[0])
# Always check for unexpected data...
if len(rb_stats) > 3:
raise Exception("Excess data found in rb_stats: {}".format(rb_stats))
elif "TD" in p_stats:
# This is a quarterback, so "yards"/"TD"/"int"
qb_stats = p_stats.split("/")
qb_yards = int(qb_stats[0]) # Or store directly into the DF; you get the idea
# Handle "TD" or "TDs". Personal preference is to avoid regexp's
if len(qb_stats) >= 2:
if qb_stats[1].endswidth("TD"):
qb_td = int(qb_stats[1][:-2])
elif qb_stats[1].endswith("TDs"):
qb_td = int(qb_stats[1][:-3])
else:
raise Exception("Unknown qb_stats: {}".format(qb_stats))
# Handle "int" if it's there
if len(qb_stats) >= 3:
if qb_stats[2].endswidth("int"):
qb_int = int(qb_stats[2][:-3])
else:
raise Exception("Unknown qb_stats: {}".format(qb_stats))
# Always check for unexpected data...
if len(qb_stats) > 3:
raise Exception("Excess data found in qb_stats: {}".format(qb_stats))
else:
# Must be a running back: receptions/yards/TD
rb_rec, rb_yds, rb_td = p_stats.split("/")

How to speed up the sum of presence of keys in the series of documents? - Pandas, nltk

I have a dataframe column with documents like
38909 Hotel is an old style Red Roof and has not bee...
38913 I will never ever stay at this Hotel again. I ...
38914 After being on a bus for -- hours and finally ...
38918 We were excited about our stay at the Blu Aqua...
38922 This hotel has a great location if you want to...
Name: Description, dtype: object
I have a bag of words like keys = ['Hotel','old','finally'] but the actual length of keys = 44312
Currently Im using
df.apply(lambda x : sum([i in x for i in keys ]))
Which gives the following output based on sample keys
38909 2
38913 2
38914 3
38918 0
38922 1
Name: Description, dtype: int64
When I apply this on actual data for just 100 rows timeit gives
1 loop, best of 3: 5.98 s per loop
and I have 50000 rows. Is there a faster way of doing the same in nltk or pandas.
EDIT :
In case looking for document array
array([ 'Hotel is an old style Red Roof and has not been renovated up to the new standard, but the price was also not up to the level of the newer style Red Roofs. So, in overview it was an OK stay, and a safe',
'I will never ever stay at this Hotel again. I stayed there a few weeks ago, and I had my doubts the second I checked in. The guy that checked me in, I think his name was Julio, and his name tag read F',
"After being on a bus for -- hours and finally arriving at the Hotel Lawerence at - am, I bawled my eyes out when we got to the room. I realize it's suppose to be a boutique hotel but, there was nothin",
"We were excited about our stay at the Blu Aqua. A new hotel in downtown Chicago. It's architecturally stunning and has generally good reviews on TripAdvisor. The look and feel of the place is great, t",
'This hotel has a great location if you want to be right at Times Square and the theaters. It was an easy couple of blocks for us to go to theater, eat, shop, and see all of the activity day and night '], dtype=object)
The following code is not exactly equivalent to your (slow) version, but it demonstrates the idea:
keyset = frozenset(keys)
df.apply(lambda x : len(keyset.intersection(x.split())))
Differences/limitation:
In your version a word is counted even if it is contained as a substring in a word in the document. For example, had your keys contained the word tyl, it would be counted due to occurrence of "style" in your first document.
My solution doesn't account for punctuation in the documents. For example, the word again in the second document comes out of split() with the full stop attached to it. That can be fixed by preprocessing the document (or postprocessing the result of the split()) with a function that removes the punctuation.
It seems you can just use np.char.count -
[np.count_nonzero(np.char.count(i, keys)) for i in arr]
Might be better to feed a boolean array for counting -
[np.count_nonzero(np.char.count(i, keys)!=0) for i in arr]
If need check only if present values of list:
from numpy.core.defchararray import find
v = df['col'].values.astype(str)
a = (find(v[:, None], keys) >= 0).sum(axis=1)
print (a)
[2 1 1 0 0]
Or:
df = pd.concat([df['col'].str.contains(x) for x in keys], axis=1).sum(axis=1)
print (df)
38909 2
38913 1
38914 1
38918 0
38922 0
dtype: int64

break paragraph into sentences in python and link back to an ID

I have two lists, one with ids and one with corresponding comments for each id.
list_responseid = ['id1', 'id2', 'id3', 'id4']
list_paragraph = [['I like working and helping them reach their goals.'],
['The communication is broken.',
'Information that should have come to me is found out later.'],
['Try to promote from within.'],
['I would relax the required hours to be available outside.',
'We work a late night each week.']]
The ResponseID 'id1' is related to the paragraph ('I like working and helping them reach their goals.') and so on.
I can break paragraph into sentences using the following function:
list_sentence = list(itertools.chain(*list_paragraph))
What would be the syntax to get the end result that is data frame (or list) with separate entry for a sentence and have an ID associated with that sentence (which is now linked to paragraph). The final result would look like this (I will convert list to panda data frame at the end).
id1 'I like working with students and helping them reach their goals.'
id2 'The communication from top to bottom is broken.'
id2 'Information that should have come to me is found out later and in some cases students know more about what is going on than we do!'
id3 'Try to promote from within.'
id4 'I would relax the required 10 hours to be available outside of 8 to 5 back to 9 to 5 like it used to be.'
id4 'We work a late night each week and rarely do students take advantage of those extended hours.'
Thanks.
If you do it often it would be clearer, and probably more efficient depending on the size of the arrays, if you make a dedicated function for that with two regular nested loops, but if you need a quick one liner for it (it's doing just that):
id_sentence_tuples = [(list_responseid[id_list_idx], sentence) for id_list_idx in range(len(list_responseid)) for sentence in list_paragraph[id_list_idx]]
id_sentence_tuples will then be a list of tupples where each of the elements is a pair like (paragraph_id, sentence) just as the result you expect.
Also i would advise you to check that both lists have the same length before doing it so in case they don't you get a meaningful error.
if len(list_responseid) != len(list_paragraph):
IndexError('Lists must have same cardinality')
I had a dataframe with an ID and a review (col = ['ID','Review']). If you can combine these lists to make a dataframe then you can use my approach. I split these reviews into sentences using nltk and then linked back the IDs within the loop. Following is the code that you can use.
## Breaking feedback into sentences
import nltk
count = 0
df_sentences = pd.DataFrame()
for index, row in df.iterrows():
feedback = row['Reviews']
sent_text = nltk.sent_tokenize(feedback) # this gives us a list of sentences
for j in range(0,len(sent_text)):
# print(index, "-", sent_text[j])
df_sentences = df_sentences.append({'ID':row['ID'],'Count':int(count),'Sentence':sent_text[j]}, ignore_index=True)
count = count + 1
print(df_sentences)

Categories

Resources