Python - Ways to create dataframe with multiple sources and conditions

Python - Ways to create dataframe with multiple sources and conditions - python

Let's say we have several dataframes that contain relevant information that need to be compiled into one single dataframe. There are several conditions involved in choosing which pieces of data can be brought over to the results dataframe.
Here are 3 dataframes (columns only) that we need to pull and compile data from:
df1 = ["Date","Order#","Line#","ProductID","Quantity","Sale Amount"]
df2 = ["Date","PurchaseOrderID","ProductID","Quantity","Cost"]
df3 = ["ProductID","Quantity","Location","Cost"]
df3 is the only table in this set that actually contains a unique non-repeating key "productid". The other two dataframes have keys, but they can repeat. the only way to find uniqueness is to refer to date and the other foreign keys.
Now, we'd like the desired result to show which all products grouped by product where df1.date after x date, where df2.quantity<5, where df3.quantity>0. Ideally the results would show the df3.quantity, df.cost (sum both in grouping), most recent purchase date from df2.date, and total number of sale by part from df1.count where all above criteria met.
This is the quickest example I could come up with on this issue. I'm able to accomplish this in VBA with only one problem... it's EXCRUCIATINGLY slow. I understand how list comprehension and perhaps other means of completing this task would be faster than VBA (maybe?), but it would still take a while with all of the logic and decision making that happens behind the scenes.
This example doesn't exactly show the complexities but any advice or direction you have to offer may help me and others understand how to treat these kinds of problems in Python. Any expert opinion, advice, direction is very much appreciated.

If I understand correctly:
You simply need to apply the conditions as filters on each dataframe, then group by ProductID and put it together.
df1 = df1[df1.Date > x].groupby('ProductID').agg({'Quantity':'sum','Sale Amount':'sum'})
df2 = df2.groupby('ProductID').agg({'Date':'max','Quantity':'sum','Cost':'sum'})
df2 = df2[df2.Quantity > 5].copy()
df3 = df3[df3.Quantity > 0].copy()
Once you have all of those, probably something like:
g = [i for i in list(df3.index) if i in list(df2.index) and i in list(df1.index)]
df = df3.loc[g] #use df3 as a frame, with only needed indexes
I am not sure what you want to pull from df1 and df2 - but it will look something like:
df = df.join(df2['col_needed'])
You may need to rename columns to avoid overlap.
This avoids inefficient looping and should be orders of magnitude faster than a loop in VBA.

Related

Merging two DFs by date range

I was asked to do a merge between two dataframes, The first contains sales orders (1m rows) and the second contains discounts. applied to that sales orders, established according to a date range.
Both are joined by an ID (A975 ID) but I can't find an efficient way to do that merge without running out of memory.
My idea so far is to do an "outer" merge and once that's done(*), filter the "Sales Order Date" column by the value that is within the date range "DATAB"(start) - "DATBI"(end)
I was able to do it in PowerQuery with "SelectRows" but it takes too long, if I can replicate the same procedure in Python and gain a few minutes of processing it would be more than enough.
I know that an outer join generates tons of duplicated rows with only the date changed, but I don't know what else to do.
IMG Table to Merge

The solution was merging and pd.query
df = pd.merge(all_together, A904, how="outer", left_on='ID A904', right_on='ID A904')\
.query('DATAB <= ERDAT <= DATBI')

I need help concatenating 1 csv file and 1 pandas dataframe together without duplicates

My code currently looks like this:
df1 = pd.DataFrame(statsTableList)
df2 = pd.read_csv('StatTracker.csv')
result = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
I get an error and I'm not sure why.
The goal of my program is to pull data from an API, and then write it all to a file for analyzing. df1 is the lets say the first 100 games written to the csv file as the first version. df2 is me reading back those first 100 games the second time around and comparing it to that of df1 (new data, next 100 games) to check for duplicates and delete them.
The part that is not working is the drop duplicates part. It gives me an error of unhashable list, I would assume that's because its two dataframes that are lists of dictionaries. The goal is to pull 100 games of data, and then pull the next 50, but if I pull number 100 again, to drop that one, and just add 101-150 and then add it all to my csv file. Then if I run it again, to pull 150-200, but drop 150 if its a duplicate, etc etc..

Based from your explanation, you can use this one liner to find unique values in df1:
df_diff = df1[~df1.apply(tuple,1)\
.isin(df2.apply(tuple,1))]
This code checks if the rows is exists in another dataframe. To do the comparision it converts each row to tuple (apply tuple conversion along 1 (row) axis).
This solution is indeed slow because its compares each row inside df1 to all rows in df2. So it has time complexity n^2.
If you want more optimised version, try to use pandas built in compare method
df1.compare(df2)

Struggling with iterating through dataframes

I'm a newbie to Python, and have done my absolute best to exhaust all resources before posting here looking for assistance. I have spent all weekend and all day today trying to come up with what I feel ought a straightforward scenario to code for using two dataframes, but, for the life of me, I am spinning my wheels and not making any significant progress.
The situation is there is one dataframe with Sales Data:
CUSTOMER ORDER SALES_DATE SALES_ITEM_NUMBER UNIT_PRICE SALES_QTY
001871 225404 01/31/2018 03266465555 1 200
001871 225643 02/02/2018 03266465555 2 600
001871 225655 02/02/2018 03266465555 3 1000
001956 228901 05/29/2018 03266461234 2.2658 20
and a second dataframe with Purchasing Data:
PO_DATE PO_ITEM_NUMBER PO_QTY PO_PRICE
01/15/2017 03266465555 1000 1.55
01/25/2017 03266465555 500 5.55
02/01/2017 03266461234 700 4.44
02/01/2017 03266461234 700 2.22
All I'm trying to do is to figure out what the maximum PO_PRICE could be for each of the lines on the Sales Order dataframe, because I'm trying to maximize the difference between what I bought it for, and what I sold it for.
When I first looked at this, I figured a straightforward nested for loop would do the trick, and increment the counters. The issue though is that I'm not well versed enough in dataframes, and so I keep getting hung up trying to access the elements within them. The thing to keep in mind as well is that I've sold 1800 of the first item, but, only bought 1500 of them. So, as I iterate through this:
For the first Sales Order row, I sold 200. The Max_PO_PRICE = $5.55 [for 500 of them]. So, I need to deduct 200 from the PO_QTY dataframe, because I've now accounted for them.
For the second Sales Order row, I sold 600. There are still 300 I can claim that I bought for $5.55, but, then I've exhausted all of those 500, and so the best I can no do is dip into the other row, which has the Max_PO_PRICE = $1.55 (for 1,000 of them). So for this one, I'd be able to claim 300 at $5.55, and the other $300 at $1.55. I can't claim for more than I bought.
Here's the code I've come up with, and, I think I may have gone about this all wrong, but, some guidance and advice would be beyond incredibly appreciated and helpful.
I'm not asking anyone to write my code for me, but, simply to advise what approach you would have taken, and if there is a better way. I figure there has to be....
Thanks in advance for your feedback and assistance.
-Clare
for index1,row1 in sales.iterrows():
SalesQty = sales.loc[index1]["SALES_QTY"]
for index2,row2 in purchases.iterrows():
if (row1['SALES_ITEM_NUMBER']==row2['PO_ITEM_NUMBER']) and (row2['PO_QTY']>0):
# Find the Maximum PO Price in the result set
max_PO_Price = abc["PO_PRICE"].max()
xyz = purchases.loc[index2]
abc = abc.append(xyz)
if(SalesQty <= Purchase_Qty):
print("Before decrement, PO_QTY = ",??????? *<==== this is where I'm struggle busing****)
print()
+index2
#Drop the data from the xyz DataFrame
xyz=xyz.iloc[0:0]
#Drop the data from the abc DataFrame
abc=abc.iloc[0:0]
+index1

This looks like something SQL would elegantly handle through analytical functions. Fortunately Pandas comes with most (but not all) of this functionality and it's a lot faster than doing nested iterrows. I'm not a Pandas expert by any means but I'll give it a whizz. Apologies if I've misinterpreted the question.
Makes sense to group the SALES_QTY, we'll use this to track how much QTY we have:
sales_grouped = sales.groupby(["SALES_ITEM_NUMBER"], as_index = False).agg({"SALES_QTY":"sum"})
Let's group the table into one so we can iterate over one table instead of two. We can use a JOIN action on the common column "PO_ITEM_NUMBER" and "SALES_ITEM_NUMBER", or what Pandas likes to call it "merge". While we're at it let's sort the table categorised by "PO_ITEM_NUMBER" and with the most expensive "PO_PRICE" on the top, this is and the next code block is the equivalent of a FN OVER PARTITION BY ORDER BY SQL analytical function.
sorted_table = purchases.merge(sales_grouped,
how = "left",
left_on = "PO_ITEM_NUMBER",
right_on = "SALES_ITEM_NUMBER").sort_values(by = ["PO_ITEM_NUMBER", "PO_PRICE"],
ascending = False)
Let's create a column CUM_PO_QTY with the cumulative sum of the PO_QTY (partitioned/grouped by PO_ITEM_NUMBER). We'll use this to mark when we go over the max SALES_QTY.
sorted_table["CUM_PO_QTY"] = sorted_table.groupby(["PO_ITEM_NUMBER"], as_index = False)["PO_QTY"].cumsum()
This is where the custom part comes in, we can integrate custom functions to apply row-by-row (or by column even) along the dataframe using apply(). We're creating two columns TRACKED_QTY which is simply the SALES_QTY minus CUM_PO_QTY so we know when we have run into the negative, and PRICE_SUM which will eventually be the maximum value gained or spent. But for now: If the TRACKED_QTY is less than 0 we multiply by the PO_QTY else the SALES_QTY for conservation purposes.
sorted_table[["TRACKED_QTY", "PRICE_SUM"]] = sorted_table.apply(lambda x: pd.Series([x["SALES_QTY"] - x["CUM_PO_QTY"],
x["PO_QTY"] * x["PO_PRICE"]
if x["SALES_QTY"] - x["CUM_PO_QTY"] >= 0
else x["SALES_QTY"] * x["PO_PRICE"]]), axis = 1)
To handle the trailing TRACKED_QTY negatives, we can filter the positive using a conditional mask, and groupby the negative revealing only the maximum PRICE_SUM value.
Then simply append these two tables and sum them.
evaluated_table = sorted_table[sorted_table["TRACKED_QTY"] >= 0]
evaluated_table = evaluated_table.append(sorted_table[sorted_table["TRACKED_QTY"] < 0].groupby(["PO_ITEM_NUMBER"], as_index = False).max())
evaluated_table = evaluated_table.groupby(["PO_ITEM_NUMBER"], as_index = False).agg({"PRICE_SUM":"sum"})
Hope this works for you.

Make piece of code efficient for big data

I have the following code:
new_df = pd.DataFrame(columns=df.columns)
for i in list:
temp = df[df["customer id"]==i]
new_df = new_df.append(temp)
where list is a list of customer id's for the customers that meet a criteria chosen before. I use the temp dataframe because there are multiple rows for the same customer.
I consider that I know how to code, but I have never learnt how to code for big data efficiency. In this case, the df has around 3 million rows and list contains around 100,000 items. This code ran for more than 24h and it was still not done, so I need to ask, am I doing something terribly wrong? Is there a way to make this code more efficient?

list is a type in Python. You should avoid naming your variables with built-in types or functions. I simulated the problem with 3 million rows and a list of customer id of size 100000. It took only a few seconds using isin.
new_df = df[ df['customer id'].isin(customer_list) ]

You can try this code below, which should make things faster.
new_df = df.loc[df['customer id'].isin(list)]

How to create a new python DataFrame with multiple columns of differing row lengths?

I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.

You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.

I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.