How can multiple matches be applied in dataframe? - python

I have a little performance issue with that problem:
There is a (3_000_000, 6) shape dataframe, call that A, and another one, that has (72_000, 6) shape, call that B. To keep it simple, suppose both of them have only string columns.
In the B dataframe, there are fields of rows where could be ?, so question mark in the value.
For example CITY column: New ?ork instead of New York. The task is to find the right string from the A dataframe.
So another example:
B Dataframe
CITY
ADDRESS
ADDRESS_TYPE
New ?ork
D?ck
str?et
A Dataframe
CITY
ADDRESS
ADDRESS_TYPE
New York
Duck
street
My plan was to make a multiprocess investigation that iterates over the B dataframe and every each row I make a multistep filter, where first I filter with A = A[A.CITY.str.match(city) == True], where the city is a regular expression New ?ork => New .ork.
So first I pre-filter with the city, after the address, and so on...
matched_rows = A[A.CITY.str.match(city) == True]
matched_rows = matched_rows[matched_rows.ADDRESS.str.match(address) == True]
...
Besides that, there is another really important thing: there is a unique ID ( not the index ) column in the A dataframe, that uniquely identifies the rows.
ABC123 - New York - Duck - Street
For me, it's really important to find the appropriate row of A dataframe for every each of B row.
The solution works but its performance is horrible. I don't really know how I should approach.
5 cores were able to decrease to ~200 minutes but I hope there is a better solution for that problem.
So, my question is, Is there any better, and more performance way to apply multiple matches to a dataframe?

As indicated by #dwhswenson, the only strategy that comes to mind is to reduce the size of the dataframes you test, which is not so much a programming problem as a data management problem. This is going to depend on your dataset and what kind of work you want to do, but one naive strategy would be to store indices of rows in which column values start with 'a', 'b', etc. and then select a dataframe to match over based on the query string. So you'd do something like
import itertools
alphabet = 'abcdefghijklmnopqrstuvwxyz'
keys = list(itertools.chain.from_iterable(([c, l] for c in A.columns) for l in alphabet))
A_indexes = {}
for k in keys:
begins_with = lambda x: (x[0] == k[1]) or (x[0] == k[1].upper())
A_indexes[k[0], k[1]] = A[k[0]].loc[A[k[0]].apply(begins_with)].index
Then make a function that takes a column name and a string to match and returns a view of A that contains only rows whos entries for that column begin with the same letter as the string to match:
def get_view(column, string_to_match):
return A.loc[A_indexes[[column, string_to_match[0].lower()]]]
You'd have to double the number of indices of course for the case in which the first letter of the string to match is a wildcard, but this is only an example of the kind of thing you could do to slice the dataset before doing a regex on every row.
if you want to work with the unique ID above, you could make a more complex index lookup dictionary for views into A:
import itertools
alphabet = 'abcdefghijklmnopqrstuvwxyz'
number_of_fields_in_ID = 6
separator = ' - '
keys = itertools.product(alphabet, repeat=number_of_fields_in_ID)
def check_ID(id, key):
id = id.split(separator)
matches_key = True
for i,first_letter in enumerate(key):
if id[i].lower() != first_letter:
matches_key = False
break
return matches_key
A_indexes = {}
for k in keys:
A_indexes[k] = A.loc[A['unique_ID'].apply(check_ID, args=(k,))].index
This will apply the kludgy function check_ID to your 3e6 element series A['unique_ID'] 26**number_of_fields_in_ID times, indexing the dataframe once per iteration (this is more than 4.5e5 iterations if the ID has 4 fields), so there may be significant up-front cost in compute time depending on your dataset. Whether or not this is worth it depends first on what you want to do afterwards (are you just doing a single set of queries to make a second dataset and you're done, or do you need to do a lot of arbitrary queries in the future?) and second on whether the first letters of each ID field are roughly evenly distributed over the alphabet (or decimal digits if you know a field is numeric, or both if it's alphanumeric). If you just have two or three cities, for instance, you wouldn't make indexes for the whole alphabet in that field. But again, doing the lookup this way is naive - if you go this route you'd come up with a lookup method that's tailored to your data.

Related

How to Filter Rows in a DataFrame Based on a Specific Number of Characters and Numbers

New Python user here, so please pardon my ignorance if my approach seems completely off.
I am having troubles filtering rows of a column based off of their Character/Number format.
Here's an example of the DataFrame and Series
df = {'a':[1,2,4,5,6], 'b':[7, 8, 9,10 ], 'target':[ 'ABC1234','ABC123', '123ABC', '7KZA23']
The column I am looking to filter is the "target" column based on their character/number combos and I am essentially trying to make a dict like below
{'ABC1234': counts_of_format
'ABC123': counts_of_format
'123ABC': counts_of_format
'any_other_format': counts_of_format}
Here's my progress so far:
col = df['target'].astype('string')
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
matches = re.findall(abc1234_pat, col)
I keep getting this error:
TypeError: expected string or bytes-like object
I've double checked the dtype and it comes back as string. I've researched the TypeError and the only solutions I can find it converting it to a string.
Any insight or suggestion on what I might be doing wrong, or if this is simply the wrong approach to this problem, will be greatly appreciated!
Thanks in advance!
I am trying to create a dict that returns how many times the different character/number combos occur. For example, how many time does 3 characters followed by 4 numbers occur and so on.
(Your problem would have been earlier and easier understood had you stated this in the question post itself rather than in a comment.)
By characters, you mean letters; by numbers, you mean digits.
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
Since you want to count occurrences of all character/number combos, this approach of using one concrete pattern would not lead very far. I suggest to transform the targets to a canonical form which serves as the key of your desired dict, e. g. substitute every letter with C and every digit with N (using your terms).
Of the many ways to tackle this, one is using str.translate together with a class which does the said transformation.
class classify():
def __getitem__(self, key):
return ord('C' if chr(key).isalpha() else 'N' if chr(key).isdigit() else None)
occ = df.target.str.translate(classify()).value_counts()#.todict()
Note that this will purposely raise an exception if target contains non-alphanumeric characters.
You can convert the resulting Series to a dict with .to_dict() if you like.

Search for specific strings in rows of dataframe and if strings exist then mark in another column in python

I have a dataframe with two columns
Current Dataframe
SE# Response COVID Words Mentioned
123456 As a merchant I appreciated your efforts in pricing with Covid-19
456789 you guys suck and didn't handle our relationship during this pandemic
347896 I love your company
Desired Dataframe
SE# Response COVID Words Mentioned
123456 As a merchant I appreciated your efforts in pricing with Covid-19 Y
456789 you guys suck and didn't handle our relationship during this pandemic Y
347896 I love your company N
terms = ['virus', 'Covid-19','covid19','flu','covid','corona','Corona','COVID-19','co-vid19','Coronavirus','Corona Virus','COVID','purell','pandemic','epidemic','coronaviru','China','Chinese','chinese','crona','korona']
These are the list of strings that need to be checked in each response. The goal is to be able to add or remove elements from the list of terms.
The above are just examples of records. I have a list of strings related to covid-19 that need to be searched in each response. If any of the strings exist, in the 'COVID Words Mentioned' column, mark a "Y" and "N" if the words do not show up.
How do I code this in python?
Much appreciated!
For each search term, set up a result vector:
d = {}
for i in LIST_OF_STRINGS:
d[i] = df['response'].str.contains(i, na=False)
I pass na=False because otherwise, Pandas will fill NA in cases where the string column is itself NA. We don't want that behaviour. The complexity of the operation increases rapidly with the number of search terms. Also consider changing this function if you want to match whole words, because contains matches sub-strings.
Regardless, take results and reduce them with bit-wise and. You need two imports:
from functools import reduce
from operator import and_
df[reduce(and_, d.values())]
The final line there selects the only elements with any of the words. You could alternatively try mapping the output of the reduction from {True, False} to {'Y', 'N'} using np.where.

How to create new column by manipulating another column? pandas

I am trying to make a new column depending on different criteria. I want to add characters to the string dependent on the starting characters of the column.
An example of the data:
RH~111~header~120~~~~~~~ball
RL~111~detailed~12~~~~~hat
RA~111~account~13~~~~~~~~~car
I want to change those starting with RH and RL, but not the ones starting with RA. So I want to look like:
RH~111~header~120~~1~~~~~ball
RL~111~detailed~12~~cancel~~~ball
RA~111~account~12~~~~~~~~~ball
I have attempted to use str split, but it doesn't seem to actually be splitting the string up
(np.where(~df['1'].str.startswith('RH'),
df['1'].str.split('~').str[5],
df['1']))
This is referencing the correct columns but not splitting it where I thought it would, and cant seem to get further than this. I feel like I am not really going about this the right way.
Define a function to replace element No pos in arr list:
def repl(arr, pos):
arr[pos] = '1' if arr[0] == 'RH' else 'cancel'
return '~'.join(arr)
Then perform the substitution:
df[0] = df[0].mask(df[0].str.match('^R[HL]'),
df[0].str.split('~').apply(repl, pos=5))
Details:
str.match provides that only proper elements are substituted.
df[0].str.split('~') splits the column of strings into a column
of lists (resulting from splitting of each string).
apply(repl, pos=5) computes the value to sobstitute.
I assumed that you have a DataFrame with a single column, so its column
name is 0 (an integer), instead of '1' (a string).
If this is not the case, change the column name in the code above.

How to remove extra quotes from strings in a set?

I am working on a method and trying to return a list in the format List[Tuple[Set[str], Set[str]]] but I can't figure out how to remove extra quotes. self._history[dates[dat]]._ridings is a list of strings.
what this method does is check the voting areas in a election then comparing those areas with next election or previous election to see which areas were in first but not in second and which were in second but not in first and those are the two elements of tuple. there could be more than one area which may or may not be present in this election. self._history[dates[dat]]._ridings is a list with certain areas in a election and i add 1 to dat so it compares with the next election so every election is compared with the previous one.
I have tried using split and replace methods but doesn't seems to be working since it is a set not string
list1 = []
dates = list(self._history)
for dat in range(0, len(dates) - 1):
a = set(self._history[dates[dat]]._ridings)
b = set(self._history[dates[dat + 1]]._ridings)
list1.append(tuple([(a-b), (b-a)]))
return list1
Expected : [({"St. Paul's...St. Paul's"})]
Actual : [({'"St. Paul.... Paul\'s"'})]
Use the substring function to remove the character at index 0 and the character at index (string length - 1)

How can I sort a complicated dictionary key

I have these really complicated data files that I have processed and as each file is processed I have used an orderedDictionary to capture the keys and values. Each orderedDictionary is appended to a list so my final result is a list of dictionaries. Because of the diversity in the data captured in these files, they have many keys in common but there are enough uncommon keys to make exporting the data to Excel more complicated than I was hoping for because I really need to push out the data in a consistent structure.
Each key has the structure like
Q_#_SUB_A_COLUMN_#_NUMB_#
so for example I have
Q_123_SUB_D_COLUMN_C_NUMB_17
We can translate the key as follows
Question 123
SubItem D
Column C
Instance 17
Because there is a SubItem D, column C and instance 17 there must be a SubItemA, Column B and Instance 16
However, one of the source files might be populated with data values (and keys that range up to the example above and some other source file might terminate with
Q_123_SUB_D_COLUMN_C_NUMB_13
so when I iterate through the list of dictionaries to pull all of the unique key instances so I can use them in csv.dictwriter as the column headings my plan was to sort the resulting list of unique column headings but I can't seem to make the sort work
specifically I need it to sort so that the results look like
Q_122_SUB_A_COLUMN_C_NUMB_1
Q_122_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_A_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_C_COLUMN_C_NUMB_1
Q_123_SUB_D_COLUMN_C_NUMB_1
dot
dot
dot
Q_123_SUB_A_COLUMN_C_NUMB_17
Q_123_SUB_B_COLUMN_C_NUMB_17
Q_123_SUB_C_COLUMN_C_NUMB_17
Q_123_SUB_D_COLUMN_C_NUMB_17
The big issue is that I do not know before I open any particular set of these files how many questions are answered, how many sub-questions are answered, how many columns are associated with each question or sub-question or how many instances exist of any particular combination of questions, sub-questions or columns, and I don't want to. Using Python I was able to reduce over 1,200 lines of SAS code to 95 but this last little bit before I start writing it out to a CSV file I can't seem to figure out.
Any observations would be appreciated.
My plan is to find all of the unique keys by iterating through the list of dictionaries and then sort these keys correctly so I can then create a csv file using the keys as column headings. I know that I can find the unique keys push that out and manually sort it and then read the sorted file back but that seems clumsy.
Just supply a sufficiently clever function as the key when sorting.
>>> (lambda x: tuple(y(z) for (y, z)
in zip((int, str, str, int),
x.split('_')[1::2])))('Q_122_SUB_A_COLUMN_C_NUMB_1')
(122, 'A', 'C', 1)
You could use a regular expression to extract the different parts of the key and use those to sort with.
e.g.,
import re
names = '''Q_122_SUB_A_COLUMN_C_NUMB_1
Q_122_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_A_COLUMN_C_NUMB_17
Q_123_SUB_D_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_17
Q_123_SUB_C_COLUMN_C_NUMB_1
Q_123_SUB_C_COLUMN_C_NUMB_17
Q_123_SUB_A_COLUMN_C_NUMB_1
Q_123_SUB_D_COLUMN_C_NUMB_17'''.split()
def key(name, match=re.compile(r'Q_(\d+)_SUB_(\w+)_COLUMN_(\w+)_NUMB_(\d+)').match):
# not sure what the actual order is, adjust the priorities accordingly
return tuple(f(value) for f, value in zip((str, int, int, str), match(name).group(3, 4, 1, 2)))
for name in names:
print name
names.sort(key=key)
print
for name in names:
print name
To explain the key-extracting process, we know the that the keys have a certain pattern. A regular expression works great here.
r'Q_(\d+)_SUB_(\w+)_COLUMN_(\w+)_NUMB_(\d+)'
# ^ ^ ^ ^
# digits letters letters digits
# group 1 group 2 group 3 group 4
In regular expressions, parts of the string wrapped in parens are groups. \d represents any decimal digit. + means that there should be one or more of the previous character. So \d+ means one or more decimal digits. \w corresponds to a letter.
Provided a string matches this pattern, we could get easy access to each grouping in that string using the group method. You could access multiple groups just by including more group numbers too
e.g.,
m = match('Q_122_SUB_B_COLUMN_C_NUMB_1')
# m.group(1) == '122'
# m.group(2) == 'B'
# m.group(3, 4) == ('C', '1')
This is similar to Ignacio's approach, only a lot more strict on the pattern. Once you can wrap your head around this, creating the appropriate key for sorting should be simple.
Assuming the keys are contained in a list, say keyList
list_to_sort=[]
for key in keyList:
sortKeys=key.split('_')
keyTuple=(sortKeys[1],sortKeys[-1],sortKeys[3],sortKeys[5],key)
list_to_sort.append(keyTuple)
after this the items in the list are tuples that look like
(123,17,D,C,Q_123_SUB_D_COLUMN_C_NUMB_17)
from operator import itemgetter
list_to_sort.sort(key=itemgetter(0,1,2,3)
I am not sure exactly what itemgetter does but this works and seems simpler, but less elegant than the other two solutions.
Notice that I arranged the keys in the tuple to sort in an order that was different than the way the keys appear live. That was not necessary I could have done
for key in keyList:
sortKeys=key.split('_')
keyTuple=(sortKeys[1],sortKeys[3],sortKeys[5],sortKeys[7],key)
list_to_sort.append(keyTuple)
and then done the sort like so
list_to_sort.sort(key=itemgetter(0,3,1,2)
It was easier for me to track the first one through

Categories

Resources