Joining Two Tables in Python Based on Condition [duplicate] - python

This question already has answers here:
How to join two dataframes for which column values are within a certain range?
(9 answers)
Closed 4 years ago.
I have two tables in pandas:
df1: Containing User IDs and IP_Addresses for 150K users.
|---------------|---------------|
| User_ID | IP_Address |
|---------------|---------------|
| U1 | 732758368.8 |
| U2 | 350311387.9 |
| U3 | 2621473820 |
|---------------|---------------|
df2: Containing IP Address range and country it belongs to, 139K records
|---------------|-----------------|------------------|
| Country | Lower_Bound_IP | Upper_Bound_IP |
|---------------|-----------------|------------------|
| Australia | 1023787008 | 1023791103 |
| USA | 3638734848 | 3638738943 |
| Australia | 3224798976 | 3224799231 |
| Poland | 1539721728 | 1539721983 |
|---------------|-----------------|------------------|
My objective is to create a country column in df1 such that IP_Address of df1 lies between the range of Lower_Bound_IP and Upper_Bound_IP of that country in df2.
|---------------|---------------|---------------|
| User_ID | IP_Address | Country |
|---------------|---------------|---------------|
| U1 | 732758368.8 | Indonesia |
| U2 | 350311387.9 | Australia |
| U3 | 2621473820 | Albania |
|---------------|---------------|---------------|
My first approach was to do a cross join (cartesian product) of the two tables and then filter to the relevant records. However, a cross join using pandas.merge() is not feasible, since it will create 21 billion records. The code crashes every time. Could you please suggest an alternative solution which is feasible?

I'm not really sure how to deal with doing this with pandas.where, but with numpy.where you can do
idx = numpy.where((df1.Ip_Address[:,None] >= df2.Lower_Bound_IP[None,:])
& (df1.IP_Address[:,None] <= df2.Upper_Bound_IP[None,:]))[1]
df1["Country"] = df2.Country[idx]
numpy.where gives the indices where the given condition is True. & corresponds to 'and', and the whole [:,None] bit adds a dummy axis where None is located. This makes sure that for each User_ID, the indices in df2 is found where the IP_Address is within the range. [1] gives the indices in df2 where the condition is True.
This will break down if there's overlap in your ranges in df2.
This might still cause you to have memory issues, but you could add a loop such that you do this comparison in batches. E.g.
batch_size = 1000
n_batches = df1.shape[0] // batch_size
# Integer division rounds down, so if the number
# of User_ID's is not divisable by the batch_size,
# we need to add 1 to n_batches
if n_batches * batch_size < df1.shape[0]:
n_batches += 1
indices = []
for i in range(n_batches):
idx = numpy.where((df1.Ip_Address[i*batch_size:(i+1)*batch_size,None]
>= df2.Lower_Bound_IP[None,:]) &
(df1.IP_Address[i*batch_size:(i+1)*batch_size,None]
<= df2.Upper_Bound_IP[None,:]))[1]
indices.extend(idx.tolist())
df1["Country"] = df2.Country[np.asarray(indices)]

Related

How to create a lambda function to sum dataframe values based on criteria and presence in a list

I have one dataframe containing a daily employee list, and another containing a series of sales.
daily_employee_df:
| EE_ID| Date |
| -----| ----------|
| 101| 20220904 |
| 102| 20220904 |
| 106| 20220904 |
| 102| 20220905 |
| 103| 20220905 |
| 104| 20220905 |
all_sales_df:
| Sale_ID | Date | Sale_Amt| EEs_Present |
| ------- | --------|---------|----------------|
| 0001| 20220904| 100.04| [101, 102, 106]|
| 0002| 20220905| 998.06| [102, 103, 104]|
What is an efficient way to sum the Sale_Amt values each employee was present for on each day and add that sum to daily_employee_df? I'm dealing with thousands of sales each day.
I was able to get the number of sales for each employee and day using the following:
daily_employee_df['EE_Sales'] = daily_employee_df.apply(lambda x: len(all_sales_df[(all_sales_df['Date'] == x['Date']) & ([str(x['EE_ID']) in c for c in list(all_sales_df['EEs_Present'])])]), axis = 1)
But I have not been able to sum the sale total in a similar way. I tried wrapping it with sum, but the syntax didn't seem to work.
Thanks for any help!
Very close - you can use sum() and add the column you're summing at the end with ['Sale_Amt']
Count of sales (already done in the question):
daily_employee_df['EE_Sales_Count'] = daily_employee_df.apply(lambda x: len(all_sales_df[(all_sales_df['Date'] == x['Date']) & ([str(x['EE_ID']) in c for c in list(all_sales_df['EEs_Present'])])]), axis = 1)
Sum of sales:
daily_employee_df['EE_Sales_Sum'] = daily_employee_df.apply(lambda x: sum(all_sales_df[(all_sales_df['Date'] == x['Date']) & ([str(x['EE_ID']) in c for c in list(all_sales_df['EEs_Present'])])]['Sale_Amt']), axis = 1)

pyspark create all possible combinations of column values of a dataframe

I want to get all the possible combinations of size 2 of a column in pyspark dataframe.
My pyspark dataframe looks like
| id |
| 1 |
| 2 |
| 3 |
| 4 |
For above input, I want to get output as
| id1 | id2 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 3 |
and so on..
One way would be to collect the values and get them into python iterable (list, pandas df) and use itertools.combinations to generate all combinations.
values = df.select(F.collect_list('id')).first()[0]
combns = list(itertools.combinations(values, 2))
However, I want to avoid collecting the dataframe column to the driver since the rows can be extremely large. Is there a better way to achieve this using spark APIs?
You can use the crossJoin method, and then cull the lines with id1 > id2.
df = df.toDF('id1').crossJoin(df.toDF('id2')).filter('id1 < id2')

Split and use values from Excel columns

I'm really new to coding. I have 2 columns in Excel - one for ingredients and other the ratio.
Like this:
ingredients [methanol/ipa,ethanol/methanol,ethylacetate]
spec[90/10,70/30,100]
qty[5,6,10]
So this data is entered continuously. I want to get the total amount of ingredients, by eg from first column methanol will be 5x90 and ipa will be 10x5.
I tried to split them based on / and use a for loop to iterate
import pandas as pd
solv={'EA':0,'M':0,'AL':0,'IPA':0}
data_xls1=pd.read_excel(r'C:\Users\IT123\Desktop\Solvent stock.xlsx',Sheet_name='PLANT',index_col=None)
sz=range(len(data_xls1.index))
a=data_xls1.Solvent.str.split('/',0).tolist()
b=data_xls1.Spec.str.split('/',0).tolist()
print(a)
for i in sz:
print(b[i][0:1])
print(b[i][1:2])
I want to split the ingredients and spec column multiply with qty and store in a solve dictionary
Error right now is float object is not subscript able
You have already found the key part, namely using the str.split function.
I would suggest that you bring the data to a a long format like this:
| | Transaction | ingredients | spec | qty |
|---:|--------------:|:--------------|-------:|------:|
| 0 | 0 | methanol | 90 | 4.5 |
| 1 | 0 | ipa | 10 | 0.5 |
| 2 | 1 | ethanol | 70 | 4.2 |
| 3 | 1 | methanol | 30 | 1.8 |
| 4 | 2 | ethylacetate | 100 | 10 |
The following code produces that result:
import pandas as pd
d = {"ingredients":["methanol/ipa","ethanol/methanol","ethylacetate"],
"spec":["90/10","70/30","100"],
"qty":[5,6,10]
}
df = pd.DataFrame(d)
df.index = df.index.rename("Transaction") # Add sensible name to the index
#Each line represents a transcation with one or more ingridients
#Following lines split the lines by the delimter. Stack Functinos moves them to long format.
ingredients = df.ingredients.str.split("/", expand = True).stack()
spec = df.spec.str.split("/", expand = True).stack()
Each of them will look like this:
| TrID, |spec |
|:-------|----:|
| (0, 0) | 90 |
| (0, 1) | 10 |
| (1, 0) | 70 |
| (1, 1) | 30 |
| (2, 0) | 100 |
Now we just need to put everything together:
df_new = pd.concat([ingredients, spec], axis = "columns")
df_new.columns = ["ingredients", "spec"]
#Switch from string to float
df_new.spec = df_new.spec.astype("float")
#Multiply by the quantity,
#Pandas automatically uses Transaction (Index of both frames) to filter accordingly
df_new["qty"] = df_new.spec * df.qty / 100
#As long as you are not comfortable to work with multiindex, just run this line:
df_new = df_new.reset_index(level = 0, drop = False).reset_index(drop = True)
The good thing about this format is that you can have a multiple-way splits for your ingredients, str.split will work without a problem, and summing up is straightforward.
I should have posted this first bur this is what my input excel sheet looks like

Optimal database lookup and update in pandas/python

theoretical database/coding query here - Python / Pandas dataframe related. I'm dealing with up to 50k rows in a table so optimal solutions seem... erm, optimal. And I'm no coding expert either, so, bear with me.
I have a table with unique child code/country pair rows, some with matching parent codes.Eg:
Index | Parent | Child | Country | NewValue
0 | A | A-1 | X | Null
1 | A | A-1 | Y | Null
2 | A | A-2 | X | Null
3 | B | B-1 | X | Null
4 | B | B-2 | Y | Null
I need to update every Parent / Country pair with a calculated unique value (NewValue). What's the best approach to finding and updating each pair over every row?
So far I'm generating a seperate list of unique Parent / Country pairs (to avoid calculating NewValue for every row needlessly; I just itterate through this list generating NewValue for each pair), eg:
Parent | Country
A | X
A | Y
B | X
B | Y
Now, is it better to simply do a lookup in the first table for every given parent/country match, get the row index for any matching rows, and then update via the row index?
Or, generate the second table in a way that includes any relevant indexes to start with, and use these to update the first table? Eg:
Parent | Country | Index(s)
A | X | 0,2
A | Y | 1
B | X | 3
B | Y | 4
If 2, how? Because I'm using df.unique() to generate the second table, I only get one index per pair, not any mathing indexes (and I'm not sure how they'd show up if I did). And I'm not sure if either way is particularly good, but it's the best I've come up with in a day :o)
Thanks,
Christopher / pepsi_max2k
You might want to look at the merge function.
What you have to do in your case is
df_children.merge(df_parent, on=["Parent","Country"])
where df_children is your table with [Index | Parent | Child | Country] columns and df_parent has [Parent | Country | NewValue]

Performant alternative to constructing a dataframe by applying repeated pivots

I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.

Categories

Resources