I would like to add a dependent calculation to my pandas data frame. The result of the calculation needs to be added to a value in the next column so the calculation can be computed using the new result. A scheme for this flow is:
Beginning Inventory | Beginning Inventory
| | (Ending Inventory from Prev Col)
| | |
do math | do math
| | |
| | |
Ending Inventory | Ending Inventory
The "do math" section contains simple addition/subtraction of inventoried items. How do I calculate the Ending Inventory in such a way that it is carried over to the Beginning Inventory in the next column?
The requested output would look like the following:
1/2020 | 2/2020 | 3/2020
Beginning Inventory 10000 | 10250 | 10550
Inventory Received 500 | 500 | 300
Inventory Used -250 | -200 | -400
Ending Inventory 10250 | 10550 | 10450
I have tried using df.shift and apply but my understanding of apply is incomplete as it returns errors about number of elements.
Any help is appreciated.
df["Beginning Inventory"] = df.apply(do math)
It would help if you posted some code. Example below is illustrative...
If you use dates in long format in your dataframe (df), i.e. dates going down the column, then you can iterate on rows using df.iterrows
inv_accumlator=[]
closing_inv_qty=last_closing_inventory
for index, row in df.iterrows():
closing_inv_qty= closing_inv_qty + row["inventory received"]-row["inventory_used"]
inv_accumlator.append(closing_inv_qty)
df['ending_inventory']=inv_accumlator
Related
I'm trying to create a new column that puts rows in my pyspark dataframe into groups based on observed rank values. For example, I'd like the first 100,000 ranks to be group 1, the next 100,000 to be group 2, and so on, up to an arbitrary number of ranks (it needs this flexibility as the size of my data, and number of overall ranks is likely to change)
Does anyone know how to achieve this? This is what my intended output looks like
--------------------------------------
| id. | rank | segment |
--------------------------------------
| 100 | 1 | 1 |
| 200 | 100,002 | 2 |
| 300 | 900,007 | 9 |
--------------------------------------
The only help I can find from browsing is for splitting the ranks into some kind of quantile, but I need guarantees that my segments are of size 100,000.
Does anyone have any tips as to how to achieve this outcome?
Some sample code here if it helps
import pandas as pd
spark.createDataFrame(pd.DataFrame({
"id": [100,200,300], "rank": [1, 100002, 900007]
}))
You can use the ceil function.
import pyspark.sql.functions as F
......
step = 100000
df = df.withColumn('segment', F.expr(f'ceil(rank / {step})'))
df.show(truncate=False)
First of I'm writing this post on my phone while on the road. Sorry for lack of info just trying to get a head start for when I get home.
I have 2 csv files, both of which contain a different amount of columns and a different amount of records. The first file has about 150k records and the second has about 1.2mil records. The first file the first column has values that are both in a column in the second file and values that are not in the second file. What i intend to do is to check if the value in column one of the first file is in the first column of the second file. If so check if the first files second column is less than our greater than a value of a different column in the second file where the first columns match. If so update the first files second column to the new value.
Side note I don't need the fastest or most efficient way I just need a working solution for now. Iwill optimize later. Code will be ran once a month to update csv file.
Currently I am attempting to accomplish this using pandas and loading each file into a dataframe. I am struggling to make this work. If this is the best way could you help me do this? Once I figure out how to do this I can figure out the rest I'm just stuck.
What I thought of before I posted this question that I might try to make a third dataframe containing the columns that hold material values and DCost values where Item column and Material columns match. The looping through the dataframe and if value from Item and Material column match updat cost column in csv file
I didn't know if uploading the csv files to a database and running queries to accomplish this would be easier?
Would converting the dataframes to dicts work with this much data?
File 1
+--------+-------+--------+
| Item | Cost | Price |
+--------+-------+--------+
| Labor | 0 | 100.00 |
| 785342 | 12.54 | 24.76 |
| 620388 | 15.78 | 36.99 |
+--------+-------+--------+
File 2
+----------+--------+-----------+
| Material | DCost | List Cost |
+----------+--------+-----------+
| 10C0024 | .24 | 1.56 |
| 785342 | 12.54 | 23.76 |
| 620388 | 16.99 | 36.99 |
| 2020101 | 100.76 | 267.78 |
+----------+--------+-----------+
Intended result to export to csv.
+--------+-------+--------+
| Labor | Cost | Price |
+--------+-------+--------+
| Labor | 0 | 100.00 |
| 785342 | 12.54 | 23.76 |
| 620388 | 16.99 | 36.99 |
+--------+-------+--------+
I am trying to annotate a queryset with an aggregate of a subset of previous rows. Take the following example table of a player's score in a particular game, with the column, last_2_average_score being the rolling average from the previous two games score for a particular player.
+----------+-----------+---------+-------------------------+
| date | player | score | last_2_average_score |
+----------+-----------+---------+-------------------------+
| 12/01/19 | 1 | 10 | None |
| 12/02/19 | 1 | 9 | None |
| 12/03/19 | 1 | 8 | 9.5 |
| 12/04/19 | 1 | 7 | 8.5 |
| 12/05/19 | 1 | 6 | 7.5 |
+----------+-----------+---------+-------------------------+
In order to accomplish this, i wrote the following query, trying to annotate each "row" with the corresponding 2 game average for their score
ScoreModel.objects.annotate(
last_two_average_score=Subquery(
ScoreModel.objects.filter(
player=OuterRef("player"), date__lt=OuterRef("date")
)
.order_by("-date")[:2]
.annotate(Avg("score"))
.values("score__avg")[:1],
output_field=FloatField(),
)
)
This query however, does not output the correct result. In fact the result is just every record annotated with
{'last_two_average_score': None}
I have tried a variety of different combinations of the query, and cannot find the correct combination. Any advice that you can give would be much appreciated!
Instead of trying to address the problem from the ORM first, I ended up circling back and first trying to implement the query in raw SQL. This immediately lead me to the concept of WINDOW functions, which when I looked in Django's ORM for, found very quickly.
https://docs.djangoproject.com/en/3.0/ref/models/expressions/#window-functions
For this interested, the resulting query looks something like this, which was much simpler than what I was trying to accomplish with Subquery
ScoreModel.objects.annotate(
last_two_average=Window(
expression=Avg("score"),
partition_by=[F("player")],
order_by=[F("date").desc()],
frame=RowRange(start=-2, end=0),
)
)
Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows
Let's say I have a dataframe of M features id, salary, age, etc.
+----+--------+------------+--------------+
| id | salary | age | zone | ....
+----+--------+------------+--------------+
I have perform certain operations on each feature to arrive at something like this
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
| id | salary | bin_salary | start_salary | end_salary | count_salary | stat1_salary | stat2_slaary |
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
Each feature is processed independently, with the same list of rows
+----+--------+------------+--------------+------------+
| id | salary | stat1_salary | stat2_salary | stat3_salary|
+----+--------+------------+--------------+------------+
| 301 | x1 | x | x | x |
| 302 | null | x | x | x |
| 303 | x3 | x | x | x |
+----+--------+------------+--------------+
| id | age | stat1_age | stat2_age
+----+--------+------------+--------------+
| 301 | null | x | x
| 302 | x2 | x | x
| 303 | x3 | x | x
In the end, I would like to combine them into the final dataframe with all attributes of each features, by joining on unique ID of effectively hundreds to thousand of table, each for one feature. This final dataframe is my feature vector
| id | salary | stat1_salary | stat2_salary | stat3_salary| age | stat1_age | stat2_age
I am hitting some Memory limit that cause Out Of Memory exception. Raising executor and driver memory seems to only be a temporary solution, and limited by admin.
JOIN is expensive and limited by resource in pyspark, and I wonder if it's possible to pre-sort each feature table independently, then keep that order and just APPEND the entire column next to one another instead of performing expensive JOIN. I can manage to keep all the same list of rows for each feature table. I hope to have no join nor lookup because my set of Id is the same.
How is it achievable ? As far as I understand, even if I sort each table by Id, Spark distribute them for storage and the retrieval (if I want to query back to append) does not guarantee to have that same order.
There doesn't seem to be a spark function to append a column from one DF to another directly except 'join'.
If you are starting from only one dataframe and trying to generate new features from each original column of the dataframe.
I would suggest to use 'pandas_udf', where the new features can be appended in the 'udf' for all the original columns.
This will avoid using 'join' at all.
To control the memory usage, choose the 'group' column where we make sure that each group is within executor memory specification.
theoretical database/coding query here - Python / Pandas dataframe related. I'm dealing with up to 50k rows in a table so optimal solutions seem... erm, optimal. And I'm no coding expert either, so, bear with me.
I have a table with unique child code/country pair rows, some with matching parent codes.Eg:
Index | Parent | Child | Country | NewValue
0 | A | A-1 | X | Null
1 | A | A-1 | Y | Null
2 | A | A-2 | X | Null
3 | B | B-1 | X | Null
4 | B | B-2 | Y | Null
I need to update every Parent / Country pair with a calculated unique value (NewValue). What's the best approach to finding and updating each pair over every row?
So far I'm generating a seperate list of unique Parent / Country pairs (to avoid calculating NewValue for every row needlessly; I just itterate through this list generating NewValue for each pair), eg:
Parent | Country
A | X
A | Y
B | X
B | Y
Now, is it better to simply do a lookup in the first table for every given parent/country match, get the row index for any matching rows, and then update via the row index?
Or, generate the second table in a way that includes any relevant indexes to start with, and use these to update the first table? Eg:
Parent | Country | Index(s)
A | X | 0,2
A | Y | 1
B | X | 3
B | Y | 4
If 2, how? Because I'm using df.unique() to generate the second table, I only get one index per pair, not any mathing indexes (and I'm not sure how they'd show up if I did). And I'm not sure if either way is particularly good, but it's the best I've come up with in a day :o)
Thanks,
Christopher / pepsi_max2k
You might want to look at the merge function.
What you have to do in your case is
df_children.merge(df_parent, on=["Parent","Country"])
where df_children is your table with [Index | Parent | Child | Country] columns and df_parent has [Parent | Country | NewValue]