Optimal database lookup and update in pandas/python - python

theoretical database/coding query here - Python / Pandas dataframe related. I'm dealing with up to 50k rows in a table so optimal solutions seem... erm, optimal. And I'm no coding expert either, so, bear with me.
I have a table with unique child code/country pair rows, some with matching parent codes.Eg:
Index | Parent | Child | Country | NewValue
0 | A | A-1 | X | Null
1 | A | A-1 | Y | Null
2 | A | A-2 | X | Null
3 | B | B-1 | X | Null
4 | B | B-2 | Y | Null
I need to update every Parent / Country pair with a calculated unique value (NewValue). What's the best approach to finding and updating each pair over every row?
So far I'm generating a seperate list of unique Parent / Country pairs (to avoid calculating NewValue for every row needlessly; I just itterate through this list generating NewValue for each pair), eg:
Parent | Country
A | X
A | Y
B | X
B | Y
Now, is it better to simply do a lookup in the first table for every given parent/country match, get the row index for any matching rows, and then update via the row index?
Or, generate the second table in a way that includes any relevant indexes to start with, and use these to update the first table? Eg:
Parent | Country | Index(s)
A | X | 0,2
A | Y | 1
B | X | 3
B | Y | 4
If 2, how? Because I'm using df.unique() to generate the second table, I only get one index per pair, not any mathing indexes (and I'm not sure how they'd show up if I did). And I'm not sure if either way is particularly good, but it's the best I've come up with in a day :o)
Thanks,
Christopher / pepsi_max2k

You might want to look at the merge function.
What you have to do in your case is
df_children.merge(df_parent, on=["Parent","Country"])
where df_children is your table with [Index | Parent | Child | Country] columns and df_parent has [Parent | Country | NewValue]

Related

pyspark create all possible combinations of column values of a dataframe

I want to get all the possible combinations of size 2 of a column in pyspark dataframe.
My pyspark dataframe looks like
| id |
| 1 |
| 2 |
| 3 |
| 4 |
For above input, I want to get output as
| id1 | id2 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 3 |
and so on..
One way would be to collect the values and get them into python iterable (list, pandas df) and use itertools.combinations to generate all combinations.
values = df.select(F.collect_list('id')).first()[0]
combns = list(itertools.combinations(values, 2))
However, I want to avoid collecting the dataframe column to the driver since the rows can be extremely large. Is there a better way to achieve this using spark APIs?
You can use the crossJoin method, and then cull the lines with id1 > id2.
df = df.toDF('id1').crossJoin(df.toDF('id2')).filter('id1 < id2')

How do I avoid getting a NaN throughout a column in pandas?

My end goal: I want to upload a new table file (be it excel, pdf, txt, csv) to a master excel spreadsheet. Then, pick out several columns from that master datasheet and then graph and group them by their state: are these samples falling under categories X or Y? And then plot them looking at each sample.
Datawise, I have samples that fall under X category or Y category. They are either X or Y, and have names associated with the samples as well as sample counts (e.g. sample #abc falls under category X and has 30 counts with different values).
I am using pandas to open the data and manipulate the tables. Here is the section of my code that is giving me issues. All other issues I have found workarounds. This one I cannot. I have tried doing fillna(df_choice.column) and I have found that it doesnt replace the NaN. Tried doing reset index or set index, doesnt really help. A sample output would have X (if I entered "X" first) or Y (if I entered "y" first) as NAN, yet the other three columns have no issues!
df_choice is a previous dataframe where I selected two columns from a master datasheet. In this case, it'd be columns X and Y.
df_new= {'X':[], 'Y':[], 'Property of X (units)':[], 'Property of Y (units)':[]} #setup dict
df_new= pd.DataFrame.from_dict(df_new) #dict to df
for df_choice.column in df_choice.columns: #list columns in previous dataset
print(df_choice.column)
state = input('Is this sample considered X or Y? Input X or Y, or quit to exit the loop') #ask if column in previous dataset falls under X or Y
if state == 'quit':
break
elif state == 'x':
df_new['Property of X (units)'] = df_choice[df_choice.column] #takes data from old dataframe into new
df_new['X'] = 'df_choice.column' #fills column X with column name from df_choice
elif state == 'y':
df_new['Y'] = 'df_choice.column'
df_new['Property of Y (units)'] = df_choice[df_choice.column]
else:
print('Not a valid response')
df_new #prints new df
What I see (only showing 4 rows, but imagine every row as NaN):
+-----+------------+----------------+----------------+
| X | Y | Property of X | Property of Y |
+-----+------------+----------------+----------------+
| NaN | Sample123 | 4 | 3 |
| NaN | Sample123 | 5 | 4 |
| NaN | Sample123 | 3 | 6 |
| NaN | Sample123 | 4 | 1 |
+-----+------------+----------------+----------------+
What I should get:
+-----------+------------+----------------+----------------+
| X | Y | Property of X | Property of Y |
+-----------+------------+----------------+----------------+
| SampleABC | Sample123 | 4 | 3 |
| SampleABC | Sample123 | 5 | 4 |
| SampleABC | Sample123 | 3 | 6 |
| SampleABC | Sample123 | 4 | 1 |
+-----------+------------+----------------+----------------+
Eventually, I assume I'd want to df_new.melt() to eventually graph them, grouping bars or boxplots by X or Y,
+-----------+-------+-----------+
| Sample | Type | Property |
+-----------+-------+-----------+
| SampleABC | X | 4 |
| SampleABC | X | 5 |
| SampleABC | X | 3 |
| SampleABC | X | 4 |
| Sample123 | Y | 3 |
| Sample123 | Y | 4 |
| Sample123 | Y | 6 |
| Sample123 | Y | 1 |
+-----------+-------+-----------+
I am a month or so into self-taught coding, so I apologize if my code is inefficient or not very clever, I come across issues and look up how other people do them and see what works for me. No formal training and I am a material scientist by training. I don't know a whole lot and I figured the best way to learn is get some fundamentals down, and then make something genuinely useful to me.
You should pass the name of your column as a vector, rather than a single string when inserting into an empty dataframe.
Think about it this way: you're creating a column in an empty dataframe by passing a single string to it. But how can Pandas know what length the column should have?
The variable in your for-loop also has a bit confusing name: the dot in "df_choice.column" looks as if you're accessing a dataframe.
Putting it together:
for colname in df_choice.columns:
#...#
elif state == 'x':
#takes data from old dataframe into new
df_new['Property of X (units)'] = df_choice[colname]
#fills column X with column name from df_choice
df_new['X'] = np.repeat(colname, df_choice.shape[0])
elif state == 'y':
df_new['Y'] = np.repeat(colname, df_choice.shape[0])
df_new['Property of Y (units)'] = df_choice[colname]
Notice that I replaced the line for your "Y" variable as well, just in case it comes up before "X".
To use np.repeat import the library
import numpy as np

Add column from one dataframe to another WITHOUT JOIN

Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows
Let's say I have a dataframe of M features id, salary, age, etc.
+----+--------+------------+--------------+
| id | salary | age | zone | ....
+----+--------+------------+--------------+
I have perform certain operations on each feature to arrive at something like this
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
| id | salary | bin_salary | start_salary | end_salary | count_salary | stat1_salary | stat2_slaary |
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
Each feature is processed independently, with the same list of rows
+----+--------+------------+--------------+------------+
| id | salary | stat1_salary | stat2_salary | stat3_salary|
+----+--------+------------+--------------+------------+
| 301 | x1 | x | x | x |
| 302 | null | x | x | x |
| 303 | x3 | x | x | x |
+----+--------+------------+--------------+
| id | age | stat1_age | stat2_age
+----+--------+------------+--------------+
| 301 | null | x | x
| 302 | x2 | x | x
| 303 | x3 | x | x
In the end, I would like to combine them into the final dataframe with all attributes of each features, by joining on unique ID of effectively hundreds to thousand of table, each for one feature. This final dataframe is my feature vector
| id | salary | stat1_salary | stat2_salary | stat3_salary| age | stat1_age | stat2_age
I am hitting some Memory limit that cause Out Of Memory exception. Raising executor and driver memory seems to only be a temporary solution, and limited by admin.
JOIN is expensive and limited by resource in pyspark, and I wonder if it's possible to pre-sort each feature table independently, then keep that order and just APPEND the entire column next to one another instead of performing expensive JOIN. I can manage to keep all the same list of rows for each feature table. I hope to have no join nor lookup because my set of Id is the same.
How is it achievable ? As far as I understand, even if I sort each table by Id, Spark distribute them for storage and the retrieval (if I want to query back to append) does not guarantee to have that same order.
There doesn't seem to be a spark function to append a column from one DF to another directly except 'join'.
If you are starting from only one dataframe and trying to generate new features from each original column of the dataframe.
I would suggest to use 'pandas_udf', where the new features can be appended in the 'udf' for all the original columns.
This will avoid using 'join' at all.
To control the memory usage, choose the 'group' column where we make sure that each group is within executor memory specification.

Performant alternative to constructing a dataframe by applying repeated pivots

I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.

Joining Two Tables in Python Based on Condition [duplicate]

This question already has answers here:
How to join two dataframes for which column values are within a certain range?
(9 answers)
Closed 4 years ago.
I have two tables in pandas:
df1: Containing User IDs and IP_Addresses for 150K users.
|---------------|---------------|
| User_ID | IP_Address |
|---------------|---------------|
| U1 | 732758368.8 |
| U2 | 350311387.9 |
| U3 | 2621473820 |
|---------------|---------------|
df2: Containing IP Address range and country it belongs to, 139K records
|---------------|-----------------|------------------|
| Country | Lower_Bound_IP | Upper_Bound_IP |
|---------------|-----------------|------------------|
| Australia | 1023787008 | 1023791103 |
| USA | 3638734848 | 3638738943 |
| Australia | 3224798976 | 3224799231 |
| Poland | 1539721728 | 1539721983 |
|---------------|-----------------|------------------|
My objective is to create a country column in df1 such that IP_Address of df1 lies between the range of Lower_Bound_IP and Upper_Bound_IP of that country in df2.
|---------------|---------------|---------------|
| User_ID | IP_Address | Country |
|---------------|---------------|---------------|
| U1 | 732758368.8 | Indonesia |
| U2 | 350311387.9 | Australia |
| U3 | 2621473820 | Albania |
|---------------|---------------|---------------|
My first approach was to do a cross join (cartesian product) of the two tables and then filter to the relevant records. However, a cross join using pandas.merge() is not feasible, since it will create 21 billion records. The code crashes every time. Could you please suggest an alternative solution which is feasible?
I'm not really sure how to deal with doing this with pandas.where, but with numpy.where you can do
idx = numpy.where((df1.Ip_Address[:,None] >= df2.Lower_Bound_IP[None,:])
& (df1.IP_Address[:,None] <= df2.Upper_Bound_IP[None,:]))[1]
df1["Country"] = df2.Country[idx]
numpy.where gives the indices where the given condition is True. & corresponds to 'and', and the whole [:,None] bit adds a dummy axis where None is located. This makes sure that for each User_ID, the indices in df2 is found where the IP_Address is within the range. [1] gives the indices in df2 where the condition is True.
This will break down if there's overlap in your ranges in df2.
This might still cause you to have memory issues, but you could add a loop such that you do this comparison in batches. E.g.
batch_size = 1000
n_batches = df1.shape[0] // batch_size
# Integer division rounds down, so if the number
# of User_ID's is not divisable by the batch_size,
# we need to add 1 to n_batches
if n_batches * batch_size < df1.shape[0]:
n_batches += 1
indices = []
for i in range(n_batches):
idx = numpy.where((df1.Ip_Address[i*batch_size:(i+1)*batch_size,None]
>= df2.Lower_Bound_IP[None,:]) &
(df1.IP_Address[i*batch_size:(i+1)*batch_size,None]
<= df2.Upper_Bound_IP[None,:]))[1]
indices.extend(idx.tolist())
df1["Country"] = df2.Country[np.asarray(indices)]

Categories

Resources