Merge on multiple or conditions in python - python

I have two dataframes. one has product ID information and the other is a master data frame with a zone mapping along with a mapping ID.
# this is dummy dataframe for example
product_df= pd.DataFrame([['abc','+1235']['cshdgs','+1352648'],['gdsfsn','+1232455']],columns='['roll','prod_id'])
master_df=pd.DataFrame([['AZ','32'],['WW','123'],['RT','12'],['PO','13'],['SZ','1352']], columns=['Zone','match_id']
I want to get Zone information alongside the product_df records. and the logic for that is (inSQL):
select product_df.* left join master_df.Zone where '+'||match_id=substr(prod_id,1,2) or '+'||match_id=substr(prod_id,1,3) or '+'||match_id=substr(prod_id,1,4) or '+'||match_id=substr(prod_id,1,5)
Basically this will be a merge on condition situation.
I know, that due to this joining logic, multiple zones may be mapped to the same roll. but that is the ask.
I am trying to implement the same in Python. I am using the following code:
master_df_dict=master_df.sert_index('match_id')['Zone'].to_dict
keys_list=['+' + key for key in master_df_dict.keys()]
def zone(pr_id):
if pr_id[0:2] in keys_list:
C=keys_list[pr_id[1:2]] # basically getting the zone information using the matched key, and as
#first character is always plus, starting the index at 1
elif pr_id[0:3] in keys_list:
C=keys_list[pr_id[1:3]]
elif pr_id[0:4] in keys_list:
C=keys_list[pr_id[1:4]]
elif pr_id[0:5] in keys_list:
C=keys_list[pr_id[1:5]]
else:
C=''
return C
product_df['Zone_info']=product_df['prod_id].apply(zone)
There are two problems with this approach:
It will only give me the first matched code, even if later conditions will also match, it will come out of the loop as soon as it matches with a condition.
Using this approach, it is taking me around 45 mins to parse 1700 records.
I need help in the following areas:
why is it taking so long for the above code to work? How can i make it run faster
Can I get exact python execution for the sql logic mentioned above that is joining based on multiple conditions? As far as i have searched, python does not have the fucntionality to merge on conditions as is in sql. Can we have a workaround for this?
If there is no way to merge on all the or conditions, is there a way to merge atleast the first matching condition and make it run faster?
Please help!!

Your matching criteria is not really a key. In these cases I form a Cartesian product, then establish a way to find unique rows required. This is pretty equivalent to a way that a relational database join works and especially in join is an inefficient expression.
Cartesian product idiom
derive 3 additional columns a) expr - regular expression to match match_id to prod_id b) just length to be used on sort_values() c) join true if match_id lines up with prod_id on the criteria you specified
sort and take first interesting record
reset values where no join was found
drop the columns used to get this working
Better solution would to have a reliable join key...
import re
product_df= pd.DataFrame(
[['abc','+1235'],['cshdgs','+1352648'],['gdsfsn','+1232455']],columns=['roll','prod_id'])
master_df=pd.DataFrame([['AZ','32'],['WW','123'],['RT','12'],['PO','13'],['SZ','1352']], columns=['Zone','match_id'])
cp = product_df.assign(foo=1).merge(master_df.assign(foo=1)).drop("foo",1)
cp["len"] = cp.match_id.str.len()
cp["expr"] = cp.apply(lambda r: "^[+]" + "".join([f"[{c}]{'' if i<2 else '?'}" for i, c in enumerate(r.match_id)]), axis=1)
cp["join"] = cp.apply(lambda r: re.search(r.expr, r.prod_id) is not None, axis=1)
cp = cp.sort_values(["Zone", "join", "len"], ascending=[True, False, False]).reset_index()\
.groupby(["Zone"]).first().reset_index()
cp.loc[~cp["join"],("roll","prod_id")] = ""
cp.drop(["len","expr","join","index"], axis=1)
output
Zone roll prod_id match_id
0 AZ 32
1 PO cshdgs +1352648 13
2 RT abc +1235 12
3 SZ cshdgs +1352648 1352
4 WW abc +1235 123

Related

Dataframes - equivalent to JOIN with LIKE condition, or value in sublist

I have 2 dataframes from 2 different sources.
System A
system_a_id
designation
A10001
Catalog A1234
A10002
Catalog A1235
System B
system_b_id
name
other_ids
B20008
Thing_B20008
Yabbadabbadoo, Bender, Catalog A1234
B20009
Thing_B20009
Snark Snark, Catalog A1235, Leela
I would like to be able to join these together, into one row with all columns, based on 'designation' being found as a substring within 'other_ids'
In SQL this would be written simply as:
SELECT A.*, B.*
FROM A
LEFT JOIN B
ON B.other_ids LIKE CONCAT('%', A.designation, '%')
Now, I imagine that there is either substring 'contains' search to use here, or I could break B.other_ids into its own list and try to do a apply function of some kind - but I'm struggling on syntax for either method, yet alone performance. (This is going to be a LOT of records joined - half a million)
Looks like "designation" appears after comma in "other_ids". You could split "other_ids" on comma, take the second parts and assign it as "designation" column to systemB. Then merge it with systemA on "designation":
out = systemB.assign(designation=systemB['other_ids'].str.split(', ').str[-1]).merge(systemA, on='designation', how='left')
In general, you could use str.contains to identify the indices that match, rearrange then join:
idx = A['designation'].apply(lambda x: B.index[B['other_ids'].str.contains(x)][0])
out = B.loc[idx].join(A)
Output:
system_b_id name other_ids \
0 B20008 Thing_B20008 Yabbadabbadoo, Bender, Catalog A1234
1 B20009 Thing_B20009 Snark Snark, Leela, Catalog A1235
designation system_a_id
0 Catalog A1234 A10001
1 Catalog A1235 A10002

Python DataFrames: finding *almost" identical rows

I have a DF loaded with orders. Some of them contains negative quantities, and the reason for that is that they are actually cancellations of prior orders.
Problem, there is no unique key that can help me find back which order corresponds to which cancellation.
So I've built the following code ('cancelations' is a subset of the original data containing only the rows that correspond to... well... cancelations):
for i, item in cancelations.iterrows():
#find a row similar to the cancelation we are currently studying:
#We use item[1] to access second value of the tuple given back by iterrows()
mask1 = (copy['CustomerID'] == item['CustomerID'])
mask2 = (copy['Quantity'] == item['Quantity'])
mask3 = (copy['Description'] == item['Description'])
subset = copy[ mask1 & mask2 & mask3]
if subset.shape[0] >0: #if we find one or several corresponding orders :
print('possible corresponding orders:', subset.index.tolist())
copy = copy.drop(subset.index.tolist()[0]) #retrieve only the first ot them from the copy of the data
So, this works, but :
first, it takes forever to run; and second, I read somewhere that whenever you find yourself writing complex code to manipulate dataframes, there's already a method for it.
So perhaps one of you know something that could help me ?
thank you for your time !
edit : note that sometimes, we can have several orders that could correspond to the cancelation at hand. This is why I didn't use drop_duplicates with only some columns specified... because it eliminates all duplicates (or all but one) : I need to drop only one of them.

Python Data Analysis from SQL Query

I'm about to start some Python Data analysis unlike anything I've done before. I'm currently studying numpy, but so far it doesn't give me insight on how to do this.
I'm using python 2.7.14 Anaconda with cx_Oracle to Query complex records.
Each record will be a unique individual with a column for Employee ID, Relationship Tuples (Relationship Type Code paired with Department number, may contain multiple), Account Flags (Flag strings, may contain multiple). (3 columns total)
so one record might be:
[(123456), (135:2345678, 212:4354670, 198:9876545), (Flag1, Flag2, Flag3)]
I need to develop a python script that will take these records and create various counts.
The example record would be counted in at least 9 different counts
How many with relationship: 135
How many with relationship: 212
How many with relationship: 198
How many in Department: 2345678
How many in Department: 4354670
How many in Department: 9876545
How many with Flag: Flag1
How many with Flag: Flag2
How many with Flag: Flag3
The other tricky part of this, is I can't pre-define the relationship codes, departments, or flags What I'm counting for has to be determined by the data retrieved from the query.
Once I understand how to do that, hopefully the next step to also get how many relationship X has Flag y, etc., will be intuitive.
I know this is a lot to ask about, but If someone could just point me in the right direction so I can research or try some tutorials that would be very helpful. Thank you!
At least you need to structurate this data to make a good analysis, you can do it in your database engine or in python (I will do it by this way, using pandas like SNygard suggested).
At first, I create some fake data(it was provided by you):
import pandas as pd
import numpy as np
from ast import literal_eval
data = [[12346, '(135:2345678, 212:4354670, 198:9876545)', '(Flag1, Flag2, Flag3)'],
[12345, '(136:2343678, 212:4354670, 198:9876541, 199:9876535)', '(Flag1, Flag4)']]
df = pd.DataFrame(data,columns=['id','relationships','flags'])
df = df.set_index('id')
df
This return a dataframe like this:
raw_pandas_dataframe
In order to summarize or count by columns, we need to improve our data structure, in some way that we can apply group by operations with department, relationships or flags.
We will convert our relationships and flags columns from string type to a python list of strings. So, the flags column will be a python list of flags, and the relationships column will be a python list of relations.
df['relationships'] = df['relationships'].str.replace('\(','').str.replace('\)','')
df['relationships'] = df['relationships'].str.split(',')
df['flags'] = df['flags'].str.replace('\(','').str.replace('\)','')
df['flags'] = df['flags'].str.split(',')
df
The result is:
dataframe_1
With our relationships column converted to list, we can create a new dataframe with as much columns
as relations in that lists we have.
rel = pd.DataFrame(df['relationships'].values.tolist(), index=rel.index)
After that we need to stack our columns preserving its index, so we will use pandas multi_index: the id and the relation column number(0,1,2,3)
relations = rel.stack()
relations.index.names = ['id','relation_number']
relations
We get: dataframe_2
At this moment we have all of our relations in rows, but still we can't group by using
relation_type feature. So we will split our relations data in two columns: relation_type and department using :.
clear_relations = relations.str.split(':')
clear_relations = pd.DataFrame(clear_relations.values.tolist(), index=clear_relations.index,columns=['relation_type','department'])
clear_relations
The result is
dataframe_3_clear_relations
Our relations are ready to analyze, but our flags structure still is very useless. So we will convert the flag list, to columns and after that we will stack them.
flags = pd.DataFrame(df['flags'].values.tolist(), index=rel.index)
flags = flags.stack()
flags.index.names = ['id','flag_number']
The result is dataframe_4_clear_flags
Voilá!, It's all ready to analyze!.
So, for example, how many relations from each type we have, and wich one is the biggest:
clear_relations.groupby('relation_type').agg('count')['department'].sort_values(ascending=False)
We get: group_by_relation_type
All code: Github project
If you're willing to consider other packages, take a look at pandas which is built on top of numpy. You can read sql statements directly into a dataframe, then filter.
For example,
import pandas
sql = '''SELECT * FROM <table> WHERE <condition>'''
df = pandas.read_sql(sql, <connection>)
# Your output might look like the following:
0 1 2
0 12346 (135:2345678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag3)
1 12345 (136:2343678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag4)
# Format your records into rows
# This part will take some work, and really depends on how your data is formatted
# Do you have repeated values? Are the records always the same size?
# Select only the rows where relationship = 125
rel_125 = df[df['Relationship'] = 125]
The pandas formatting is more in depth than fits in a Q&A, but some good resources are here: 10 Minutes to Pandas.
You can also filter the rows directly, though it may not be the most efficient. For example, the following query selects only the rows where a relationship starts with '212'.
df[df['Relationship'].apply(lambda x: any(y.startswith('212') for y in x))]

Pandas: query string where column name contains special characters

I am working with a data frame that has a structure something like the following:
In[75]: df.head(2)
Out[75]:
statusdata participant_id association latency response \
0 complete CLIENT-TEST-1476362617727 seeya 715 dislike
1 complete CLIENT-TEST-1476362617727 welome 800 like
stimuli elementdata statusmetadata demo$gender demo$question2 \
0 Sample B semi_imp complete male 23
1 Sample C semi_imp complete female 23
I want to be able to run a query string against the column demo$gender.
I.e,
df.query("demo$gender=='male'")
But this has a problem with the $ sign. If I replace the $ sign with another delimited (like -) then the problem persists. Can I fix up my query string to avoid this problem. I would prefer not to rename the columns as these correspond tightly with other parts of my application.
I really want to stick with a query string as it is supplied by another component of our tech stack and creating a parser would be a heavy lift for what seems like a simple problem.
Thanks in advance.
With the most recent version of pandas, you can esscape a column's name that contains special characters with a backtick (`)
df.query("`demo$gender` == 'male'")
Another possibility is clean the columns names as a previous step in your process, replacing special characters by some other more appropriate.
For instance:
(df
.rename(columns = lambda value: value.replace('$', '_'))
.query("demo_gender == 'male'")
)
For the interested here is a simple proceedure I used to accomplish the task:
# Identify invalid column names
invalid_column_names = [x for x in list(df.columns.values) if not x.isidentifier() ]
# Make replacements in the query and keep track
# NOTE: This method fails if the frame has columns called REPL_0 etc.
replacements = dict()
for cn in invalid_column_names:
r = 'REPL_'+ str(invalid_column_names.index(cn))
query = query.replace(cn, r)
replacements[cn] = r
inv_replacements = {replacements[k] : k for k in replacements.keys()}
df = df.rename(columns=replacements) # Rename the columns
df = df.query(query) # Carry out query
df = df.rename(columns=inv_replacements)
Which amounts to identifying the invalid column names, transforming the query and renaming the columns. Finally we perform the query and then translate the column names back.
Credit to #chrisb for their answer that pointed me in the right direction
The current implementation of query requires the string to be a valid python expression, so column names must be valid python identifiers. Your two options are renaming the column, or using a plain boolean filter, like this:
df[df['demo$gender'] =='male']

Union all type query with python pandas

I am attempting to use pandas to perform data analysis on a flat source of data. Specifically, what I'm attempting to accomplish is the equivalent of a Union All query in SQL.
I am using the read_csv() method to input the data and the output has unique integer indices and approximately 30+ columns.
Of these columns, several contain identifying information, whilst others contain data.
In total, the first 6 columns contain identifying informations which uniquely identifies an entry. Following these 6 columns there are a range of columns (A,B... etc) which reference the value. Some of these columns are linked together in sets, for example (A,B,C) belong together, as do (D,E,F).
However, (D,E,F) are also related to (A,B,C) as follows ((A,D),(B,E),(C,F)).
What I am attempting to do is take my data set which has as follows:
(id1,id2,id3,id4,id5,id6,A,B,C,D,E,F)
and return the following
((id1,id2,id3,id4,id5,id6,A,B,C),
(id1,id2,id3,id4,id5,id6,D,E,F))
Here, as A and D are linked they are contained within the same column.
(Note, this is a simplification, there are approximately 12 million unique combinations in the total dataset)
I have been attempting to use the merge, concat and join functions to no avail. I feel like I am missing something crucial as in an SQL database I can simply perform a union all query (which is quite slow admittedly) to solve this issue.
I have no working sample code at this stage.
Another way of writing this problem based upon some of the pandas docs.
left = key lval
right = key rval
merge(left, right, on=key) = key, lval, rval
Instead I want:
left = kev, lval
right = key, lval
union(left, right) = key, lval
key, rval
I'm not sure if a new indexing key value would need to be created for this.
I have been able to accomplish what I initially asked for.
It did require a bit of massaging of column names however.
Solution (using pseudo code):
Set up dataframes with the relevant data. e.g.
left = (id1,id2,id3,id4,id5,id6,A,B,C)
right = (id1,id2,id3,id4,id5,id6,D,E,F)
middle = (id1,id2,id3,id4,id5,id6,G,H,I)
Note, here, that for me dataset this resulted in my having non-unique indexing keys for each of the ids. That is, a key is present for each row in left and right.
Rename the column names.
col_names = [id1,id2,id3,id4,id5,id6,val1,val2,val3]
left.columns = col_names
right.columns = col_names
middle.columns = col_names
Concatenate these
pieces = [left, right, middle]
new_df = concat(pieces)
Now, this will create a new dataframe which contains x unique indexing values and 3x entries. This isn't quite ideal but it will do for now, the major shortfall of this is that you cannot uniquely access a single entry row anymore, they will come in triples. To access the data you can create a new dataframe based on the unique id values.
e.g.
check_df = new_df[(new_df[id1] == 'id1') & (new_df[id2] == 'id2') ... etc])
print check_df
key, id1, id2, id3, id4, id5, id6, A, B, C
key, id1, id2, id3, id4, id5, id6, D, E, F
key, id1, id2, id3, id4, id5, id6, G, H, I
Now, this isn't quite ideal but it's the format I needed for some of my other analysis. It may not be applicable for all parties.
If anyone has a better solution please do share it, I'm relatively new to using pandas with python.

Categories

Resources