Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows
Let's say I have a dataframe of M features id, salary, age, etc.
+----+--------+------------+--------------+
| id | salary | age | zone | ....
+----+--------+------------+--------------+
I have perform certain operations on each feature to arrive at something like this
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
| id | salary | bin_salary | start_salary | end_salary | count_salary | stat1_salary | stat2_slaary |
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
Each feature is processed independently, with the same list of rows
+----+--------+------------+--------------+------------+
| id | salary | stat1_salary | stat2_salary | stat3_salary|
+----+--------+------------+--------------+------------+
| 301 | x1 | x | x | x |
| 302 | null | x | x | x |
| 303 | x3 | x | x | x |
+----+--------+------------+--------------+
| id | age | stat1_age | stat2_age
+----+--------+------------+--------------+
| 301 | null | x | x
| 302 | x2 | x | x
| 303 | x3 | x | x
In the end, I would like to combine them into the final dataframe with all attributes of each features, by joining on unique ID of effectively hundreds to thousand of table, each for one feature. This final dataframe is my feature vector
| id | salary | stat1_salary | stat2_salary | stat3_salary| age | stat1_age | stat2_age
I am hitting some Memory limit that cause Out Of Memory exception. Raising executor and driver memory seems to only be a temporary solution, and limited by admin.
JOIN is expensive and limited by resource in pyspark, and I wonder if it's possible to pre-sort each feature table independently, then keep that order and just APPEND the entire column next to one another instead of performing expensive JOIN. I can manage to keep all the same list of rows for each feature table. I hope to have no join nor lookup because my set of Id is the same.
How is it achievable ? As far as I understand, even if I sort each table by Id, Spark distribute them for storage and the retrieval (if I want to query back to append) does not guarantee to have that same order.
There doesn't seem to be a spark function to append a column from one DF to another directly except 'join'.
If you are starting from only one dataframe and trying to generate new features from each original column of the dataframe.
I would suggest to use 'pandas_udf', where the new features can be appended in the 'udf' for all the original columns.
This will avoid using 'join' at all.
To control the memory usage, choose the 'group' column where we make sure that each group is within executor memory specification.
Related
First of I'm writing this post on my phone while on the road. Sorry for lack of info just trying to get a head start for when I get home.
I have 2 csv files, both of which contain a different amount of columns and a different amount of records. The first file has about 150k records and the second has about 1.2mil records. The first file the first column has values that are both in a column in the second file and values that are not in the second file. What i intend to do is to check if the value in column one of the first file is in the first column of the second file. If so check if the first files second column is less than our greater than a value of a different column in the second file where the first columns match. If so update the first files second column to the new value.
Side note I don't need the fastest or most efficient way I just need a working solution for now. Iwill optimize later. Code will be ran once a month to update csv file.
Currently I am attempting to accomplish this using pandas and loading each file into a dataframe. I am struggling to make this work. If this is the best way could you help me do this? Once I figure out how to do this I can figure out the rest I'm just stuck.
What I thought of before I posted this question that I might try to make a third dataframe containing the columns that hold material values and DCost values where Item column and Material columns match. The looping through the dataframe and if value from Item and Material column match updat cost column in csv file
I didn't know if uploading the csv files to a database and running queries to accomplish this would be easier?
Would converting the dataframes to dicts work with this much data?
File 1
+--------+-------+--------+
| Item | Cost | Price |
+--------+-------+--------+
| Labor | 0 | 100.00 |
| 785342 | 12.54 | 24.76 |
| 620388 | 15.78 | 36.99 |
+--------+-------+--------+
File 2
+----------+--------+-----------+
| Material | DCost | List Cost |
+----------+--------+-----------+
| 10C0024 | .24 | 1.56 |
| 785342 | 12.54 | 23.76 |
| 620388 | 16.99 | 36.99 |
| 2020101 | 100.76 | 267.78 |
+----------+--------+-----------+
Intended result to export to csv.
+--------+-------+--------+
| Labor | Cost | Price |
+--------+-------+--------+
| Labor | 0 | 100.00 |
| 785342 | 12.54 | 23.76 |
| 620388 | 16.99 | 36.99 |
+--------+-------+--------+
I have a very large dataframe in pyspark. It has over 10 million rows and over 30 columns.
What is the best and efficient method to search the entire dataframe for a given list of values and remove the row which contains that value?
The given list of values:
list=['1097192','10727550','1098754']
The dataframe(df) is :
+---------+--------------+---------------+---------+------------+
| id | first_name | last_name | Salary | Verifycode |
+---------+--------------+---------------+---------+------------+
| 1986 | Rollie | Lewin | 1097192 | 42254172 | -Remove Row
| 289743 | Karil | Sudron | 2785190 | 3703538 |
| 3864 | Massimiliano | Dallicott | 1194553 | 23292573 |
| 49074 | Gerry | Grinnov | 1506584 | 62291161 |
| 5087654 | Nat | Leatherborrow | 1781870 | 55183252 |
| 689 | Thaine | Tipple | 2150105 | 40583249 |
| 7907 | Myrlene | Croley | 2883250 | 70380540 |
| 887 | Nada | Redier | 2676139 | 10727550 | -Remove Row
| 96533 | Sonny | Bosden | 1050067 | 13110714 |
| 1098754 | Dennie | McGahy | 1804487 | 927935 | -Remove Row
+---------+--------------+---------------+---------+------------+
If it was a smaller dataframe I could use collect() or toLocalIterator() functions and then iterate over the rows and remove it based on list values.
Since it is a very large dataframe what is the best way to solve this?
I have come up with this solution now but is there a better way:
column_names = df.schema.names
for name in column_names:
df = df.filter(~col(name).isin(list))
You got the correct approach of filtering the Dataframe using filter and isin function. You can use isin function if the list is small (in few thousands not millions). Also make sure that your dataframe is partitioned to at least 3*number of CPUs on the executors. It is a must to have lot of partitions without that parallelism will suffer.
I am comfortable with Scala so please take the concept from the below code. You need to build a Column object by combining all columns to be filtered on. Then provide that final column object on the dataframe.filter
column_names = df.schema.names
colFinal // initialize with 1 column name as col("colName").isin(list)
for name in column_names:
colFinal = colFinal.or(col(name).isin(list))
df = df.filter(!colFinal) // apply negation of final column object
theoretical database/coding query here - Python / Pandas dataframe related. I'm dealing with up to 50k rows in a table so optimal solutions seem... erm, optimal. And I'm no coding expert either, so, bear with me.
I have a table with unique child code/country pair rows, some with matching parent codes.Eg:
Index | Parent | Child | Country | NewValue
0 | A | A-1 | X | Null
1 | A | A-1 | Y | Null
2 | A | A-2 | X | Null
3 | B | B-1 | X | Null
4 | B | B-2 | Y | Null
I need to update every Parent / Country pair with a calculated unique value (NewValue). What's the best approach to finding and updating each pair over every row?
So far I'm generating a seperate list of unique Parent / Country pairs (to avoid calculating NewValue for every row needlessly; I just itterate through this list generating NewValue for each pair), eg:
Parent | Country
A | X
A | Y
B | X
B | Y
Now, is it better to simply do a lookup in the first table for every given parent/country match, get the row index for any matching rows, and then update via the row index?
Or, generate the second table in a way that includes any relevant indexes to start with, and use these to update the first table? Eg:
Parent | Country | Index(s)
A | X | 0,2
A | Y | 1
B | X | 3
B | Y | 4
If 2, how? Because I'm using df.unique() to generate the second table, I only get one index per pair, not any mathing indexes (and I'm not sure how they'd show up if I did). And I'm not sure if either way is particularly good, but it's the best I've come up with in a day :o)
Thanks,
Christopher / pepsi_max2k
You might want to look at the merge function.
What you have to do in your case is
df_children.merge(df_parent, on=["Parent","Country"])
where df_children is your table with [Index | Parent | Child | Country] columns and df_parent has [Parent | Country | NewValue]
I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.
I know the mysql database is not in any particular oder, but I need it to behave in an ordered way. I have a database, example below, which currently seems to be in the order of column 1 (jobID, auto increment, primary key). What I need to do is to to change the order by, for example, moving the 3rd row up one position, so essentially changing the position of the 3rd and 2nd row, but I am unsure of how to do this. The reason is I am accessing this database via a python app which is grabbing jobs from a list, and I need to change the priority order sometimes. What would be the best way to do this please?
+-------+---------+----------+---------+--------+---------------------+
| jobID | location| mode | process | status | submitTime |
+-------+---------+----------+---------+--------+---------------------+
| 1 | /let111/| Verify | 1 | Failed | 2014-06-25 12:24:38 |
| 2 | /let114/| Verify | 1 | Passed | 2014-06-25 12:37:31 |
| 3 | /let112/| Verify | 1 | Failed | 2014-06-25 14:48:55 |
| 4 | /let117/| Verify | 2 | Passed | 2014-06-25 14:49:01 |
| 5 | /let113/| Verify | 2 | Passed | 2014-06-25 14:49:13 |
+-------+---------+----------+---------+--------+---------------------+
If you want to order by status first and then time, use an order by clause:
select t.*
from table t
order by status, submitTime;
In general, tables in relational databases in general and MySQL in particular have no particular order because a table models an unordered set of tuples. You can get results back in a particular order by using order by.
Note that the above is just one possibility that accomplishes what you are asking for. What you really want might involve other columns.