First of I'm writing this post on my phone while on the road. Sorry for lack of info just trying to get a head start for when I get home.
I have 2 csv files, both of which contain a different amount of columns and a different amount of records. The first file has about 150k records and the second has about 1.2mil records. The first file the first column has values that are both in a column in the second file and values that are not in the second file. What i intend to do is to check if the value in column one of the first file is in the first column of the second file. If so check if the first files second column is less than our greater than a value of a different column in the second file where the first columns match. If so update the first files second column to the new value.
Side note I don't need the fastest or most efficient way I just need a working solution for now. Iwill optimize later. Code will be ran once a month to update csv file.
Currently I am attempting to accomplish this using pandas and loading each file into a dataframe. I am struggling to make this work. If this is the best way could you help me do this? Once I figure out how to do this I can figure out the rest I'm just stuck.
What I thought of before I posted this question that I might try to make a third dataframe containing the columns that hold material values and DCost values where Item column and Material columns match. The looping through the dataframe and if value from Item and Material column match updat cost column in csv file
I didn't know if uploading the csv files to a database and running queries to accomplish this would be easier?
Would converting the dataframes to dicts work with this much data?
File 1
+--------+-------+--------+
| Item | Cost | Price |
+--------+-------+--------+
| Labor | 0 | 100.00 |
| 785342 | 12.54 | 24.76 |
| 620388 | 15.78 | 36.99 |
+--------+-------+--------+
File 2
+----------+--------+-----------+
| Material | DCost | List Cost |
+----------+--------+-----------+
| 10C0024 | .24 | 1.56 |
| 785342 | 12.54 | 23.76 |
| 620388 | 16.99 | 36.99 |
| 2020101 | 100.76 | 267.78 |
+----------+--------+-----------+
Intended result to export to csv.
+--------+-------+--------+
| Labor | Cost | Price |
+--------+-------+--------+
| Labor | 0 | 100.00 |
| 785342 | 12.54 | 23.76 |
| 620388 | 16.99 | 36.99 |
+--------+-------+--------+
Related
I would like to add a dependent calculation to my pandas data frame. The result of the calculation needs to be added to a value in the next column so the calculation can be computed using the new result. A scheme for this flow is:
Beginning Inventory | Beginning Inventory
| | (Ending Inventory from Prev Col)
| | |
do math | do math
| | |
| | |
Ending Inventory | Ending Inventory
The "do math" section contains simple addition/subtraction of inventoried items. How do I calculate the Ending Inventory in such a way that it is carried over to the Beginning Inventory in the next column?
The requested output would look like the following:
1/2020 | 2/2020 | 3/2020
Beginning Inventory 10000 | 10250 | 10550
Inventory Received 500 | 500 | 300
Inventory Used -250 | -200 | -400
Ending Inventory 10250 | 10550 | 10450
I have tried using df.shift and apply but my understanding of apply is incomplete as it returns errors about number of elements.
Any help is appreciated.
df["Beginning Inventory"] = df.apply(do math)
It would help if you posted some code. Example below is illustrative...
If you use dates in long format in your dataframe (df), i.e. dates going down the column, then you can iterate on rows using df.iterrows
inv_accumlator=[]
closing_inv_qty=last_closing_inventory
for index, row in df.iterrows():
closing_inv_qty= closing_inv_qty + row["inventory received"]-row["inventory_used"]
inv_accumlator.append(closing_inv_qty)
df['ending_inventory']=inv_accumlator
I have a very large dataframe in pyspark. It has over 10 million rows and over 30 columns.
What is the best and efficient method to search the entire dataframe for a given list of values and remove the row which contains that value?
The given list of values:
list=['1097192','10727550','1098754']
The dataframe(df) is :
+---------+--------------+---------------+---------+------------+
| id | first_name | last_name | Salary | Verifycode |
+---------+--------------+---------------+---------+------------+
| 1986 | Rollie | Lewin | 1097192 | 42254172 | -Remove Row
| 289743 | Karil | Sudron | 2785190 | 3703538 |
| 3864 | Massimiliano | Dallicott | 1194553 | 23292573 |
| 49074 | Gerry | Grinnov | 1506584 | 62291161 |
| 5087654 | Nat | Leatherborrow | 1781870 | 55183252 |
| 689 | Thaine | Tipple | 2150105 | 40583249 |
| 7907 | Myrlene | Croley | 2883250 | 70380540 |
| 887 | Nada | Redier | 2676139 | 10727550 | -Remove Row
| 96533 | Sonny | Bosden | 1050067 | 13110714 |
| 1098754 | Dennie | McGahy | 1804487 | 927935 | -Remove Row
+---------+--------------+---------------+---------+------------+
If it was a smaller dataframe I could use collect() or toLocalIterator() functions and then iterate over the rows and remove it based on list values.
Since it is a very large dataframe what is the best way to solve this?
I have come up with this solution now but is there a better way:
column_names = df.schema.names
for name in column_names:
df = df.filter(~col(name).isin(list))
You got the correct approach of filtering the Dataframe using filter and isin function. You can use isin function if the list is small (in few thousands not millions). Also make sure that your dataframe is partitioned to at least 3*number of CPUs on the executors. It is a must to have lot of partitions without that parallelism will suffer.
I am comfortable with Scala so please take the concept from the below code. You need to build a Column object by combining all columns to be filtered on. Then provide that final column object on the dataframe.filter
column_names = df.schema.names
colFinal // initialize with 1 column name as col("colName").isin(list)
for name in column_names:
colFinal = colFinal.or(col(name).isin(list))
df = df.filter(!colFinal) // apply negation of final column object
Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows
Let's say I have a dataframe of M features id, salary, age, etc.
+----+--------+------------+--------------+
| id | salary | age | zone | ....
+----+--------+------------+--------------+
I have perform certain operations on each feature to arrive at something like this
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
| id | salary | bin_salary | start_salary | end_salary | count_salary | stat1_salary | stat2_slaary |
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
Each feature is processed independently, with the same list of rows
+----+--------+------------+--------------+------------+
| id | salary | stat1_salary | stat2_salary | stat3_salary|
+----+--------+------------+--------------+------------+
| 301 | x1 | x | x | x |
| 302 | null | x | x | x |
| 303 | x3 | x | x | x |
+----+--------+------------+--------------+
| id | age | stat1_age | stat2_age
+----+--------+------------+--------------+
| 301 | null | x | x
| 302 | x2 | x | x
| 303 | x3 | x | x
In the end, I would like to combine them into the final dataframe with all attributes of each features, by joining on unique ID of effectively hundreds to thousand of table, each for one feature. This final dataframe is my feature vector
| id | salary | stat1_salary | stat2_salary | stat3_salary| age | stat1_age | stat2_age
I am hitting some Memory limit that cause Out Of Memory exception. Raising executor and driver memory seems to only be a temporary solution, and limited by admin.
JOIN is expensive and limited by resource in pyspark, and I wonder if it's possible to pre-sort each feature table independently, then keep that order and just APPEND the entire column next to one another instead of performing expensive JOIN. I can manage to keep all the same list of rows for each feature table. I hope to have no join nor lookup because my set of Id is the same.
How is it achievable ? As far as I understand, even if I sort each table by Id, Spark distribute them for storage and the retrieval (if I want to query back to append) does not guarantee to have that same order.
There doesn't seem to be a spark function to append a column from one DF to another directly except 'join'.
If you are starting from only one dataframe and trying to generate new features from each original column of the dataframe.
I would suggest to use 'pandas_udf', where the new features can be appended in the 'udf' for all the original columns.
This will avoid using 'join' at all.
To control the memory usage, choose the 'group' column where we make sure that each group is within executor memory specification.
I have one problem with one of my projects at school. I am attempting to change the order of my data.
You are able to appreciate how the data is arranged
this picture contains a sample of the data I am referring to
This is the format I am attempting to reach:
Company name | activity description | year | variable 1 | variable 2 |......
company 1 | | 2011 | | |
company 1 | | 2012 | | |
..... (one row for every year ( from 2014 to 2015 inclusive))
company 2 | | 2011 | | |
company 2 | | 2012 | | |
..... (one row for every year ( from 2014 to 2015 inclusive))
for ever single one of the 10 companies. this is a sample of my whole data-set, which contains more than 15000 companies. I attempted creating a dataframe of the size I want but I have problems filling it with the data I want and in the format I want. I am fairly new to python. Could anyone help me, please?
I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.