data structure: pandas dataframe or relational database for my model? - python

I want to build a model which is supposed to calculate the value of a parameter for the receiving process to optimize a warehouse’s capacity.
The parameter makes the decision in the receiving process where a SKU is going to be stored - as a palett in the high-bay racking (more expensive) or as a carton in the automatic carton shelf (less expensive).
The parameter is set based on this data:
Concerning the sum of all SKUs:
the capacity of the high-bay racking
the capacity of the carton shelf.
The capacity of the racking and shelf depend on the current inventory level of all SKUs and the volume leaving the storage (because the SKUs are sold).
Concerning the single values for each SKU and each day (20.000 SKUs and 365 days):
number of products of this specific SKU received per day
number of products of this specific SKU sold per day
predicted number of products of this specific SKU to be sold in the x upcoming days
volume already stored in the automatic carton shelf of this specific SKU
Now, I wonder which data structure I should use in order to import and use the data in my process in Python as the data encompasses four values for 20.000 SKUs and 365 days each.
I thought that I should use the Pandas Dataframe because it is very powerful in building models and visualization. But as the tabular form only has a kind of 2D nature, as I understand, I would not be able to model the data for 20.000 SKUs and all 365 days because this is kind of more 3D.
Therefore, I wonder whether I have to use a relational database, where each of the above mentioned data sets (received volume per SKU, sold volume per SKU, predicted number to be sold per SKU, volume in carton shelf per SKU) would make up a table.
I found the following set of questions in the answer to one question here, which I feel are important to answer my questions. Here are my answers:
1) Size of data, # of rows, columns, types of columns; are you appending rows, or just columns?
number of rows: 20.000 SKUs
number of columns: if you take separate tables for each data set, than it is 365 columns (=days); if it is one table, than it is 365 * 4 (365 days * received volume per SKU, sold volume per SKU, predicted number to be sold per SKU, volume in carton shelf per SKU)
types of columns: floats, boolean
As I understand, I am not appending data but I use the data to calculate values for each SKU, and then from the bottom (the detailed data on the SKU) to the top (the sum of all SKUs = capacity, inventory level)
2) What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these.
sum, subtraction, multiplication, division, bigger than, smaller than, equal to ...
3) Giving a toy example could enable us to offer more specific recommendations.
Example:
SKU 123456:
has 200 liters of inventory in carton shelf
1000 liters are received today
300 liters will be sold today
predicted sales for x days is 250 liters (should be in carton shelf)
the parameter is set to 600 liters (if the volume of the receiving is higher, it goes in the palett racking, otherwise in the carton shelf)
so you need to store the following volume:
200 liters in inventory + 1000 liters received = 1200 of inventory
1200 liters - 300 liters sold = 900 of inventory
250 liters needed in carton shelf = 650 liters left
as 650 > 600, 250 liters are stored in the carton shelf, the other 650 in the high-bay racking
Overall sum:
inventory high bay racking after the receiving of this SKU is + 650 liters
inventory carton shelf is + 50 liters
If the capacity of the high-bay racking is already full, and +650 liters is not possible, the parameter has to be recalculated, so that the total on this day fits in.
-> the calculation is proceeded for the next 364 days …
4) After that processing, then what do you do? Is step 2 ad hoc, or repeatable?
repeatable, as it needs to be done for every day
5) Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file?
I guess they need to be organized by SKUs and days
6) Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
yes -> it always checks whether the capacity is met; whether there is a need to put some volume into the carton shelf, …
7) Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)?
I guess, mostly, there are calculations made on the data, so it is not just keeping the data around…
Thank you so much upfront!

Related

Is there a way in pandas (python) to calculate the days of inventory grouped by material number

With the given data frame the calculated measure would be the DOI (aka how many days into the future will the inventory last; based on the demand. Note: The figures populated in the DOI need to be programmatically calculated and grouped on the material.
Calculation of DOI: Let us take the first row belonging to material A1. The dates are on weekly basis.
Inventory = 1000
Days into the future till when the inventory would last: 300 + 400 + part of 500. This means the DOI is 7 + 7 + (1000-300-400)/500 = 14.6 [aka 26.01.2023 - 19-01-2023]; [09.02.2023 - 02.02.2023]
An important point to note is that the demand figure of the concerned row is NOT taken into account while calculating DOI.
I have tried to calculate the cumulative demand without taking the first row for each material (here A1 and B1).
inv_network['cum_demand'] = 0
for i in range(inv_network.shape[0]-1):
if inv_network.loc[i+1,'Period'] > inv_network.loc[0,'Period']:
inv_network.loc[i+1,'cum_demand']= inv_network.loc[i,'cum_demand'] + inv_network.loc[i+1,'Primary Demand']
print(inv_network)
However, this piece of code is taking a lot of time, with the increase in the number of records.
As part of the next step when I am trying to calculate DOI, I am running into issues to get the right value.

Simulating the number of contract to maintain constant value

I have a set of say 10,000 contracts ( Loan account numbers). The contracts are for specific durations say 24,48,84 months. My pool consists of a mix of these contracts for a different duration. Assume that at the start of this month I have 100 contracts amounting to 10,000 USD . After 2 months a few accounts/contracts are closed prematurely (early pay off) and few are extended. I need to simulate the data to maintain a constant value (amount of 10,000 USD). It means I need to know how many new contracts that I need to add say 2 months from now so that the value of my portfolio remains at 10,000 USD. Can someone help me if there is any technique to simulate this? Preferably in R, Python or SAS
Add a payoff date element to each contract object. Then, do:
need = 0
for c in contracts:
need += int( (date today + 2 mo) > c.payoff_date or c.paid_off == True)
print 10000 - len(contracts) + need

How to get statistics of once column of dataframe using data from a second column?

I'm trying to write a program to give a deeper analysis of stock trading data but am coming up against a wall. I'm pulling all trades for a given timeframe and creating a new CSV file in order to use that file as the input for a predictive neural network.
The dataframe I currently have has three values: (1) the price of the stock; (2) the number of shares sold at that price; and (3) the unix timestamp of that particular trade. I'm having trouble getting any accurate statistical analysis of the data. For example, if I use .median(), the program only looks at the number of values listed rather than the fact that each value may have been traded hundreds of times based on the volume column.
As an example, this is the partial trading history for one of the stocks that I'm trying to analyze.
0 227.60 40 1570699811183
1 227.40 27 1570699821641
2 227.59 50 1570699919891
3 227.60 10 1570699919891
4 227.36 100 1570699967691
5 227.35 150 1570699967691 . . .
To better understand the issue, I've also grouped it by price and summed the other columns with groupby('p').sum(). I realize this means the timestamp is useless, but it makes visualization easier.
227.22 2 1570700275307
227.23 100 1570699972526
227.25 100 4712101657427
227.30 105 4712101371199
227.33 50 1570700574172
227.35 4008 40838209836171 . . .
Is there any way to use the number from the trade volume column to perform a statistical analysis of the price column? I've considered creating a new dataframe where each price is listed the number of times that it is traded, but am not sure how to do this.
Thanks in advance for any help!

Dataset with lots of zero value as missing value. What should I do?

I am currently working on IMDB 5000 movie dataset for a class project. The budget variable has a lot of zero values.
They are missing entries. I cannot drop them because they are 22% of my entire data.
What should I do in Python? Some suggested binning? Could you provide more details?
Well there are a few options.
Take an average of the non zero values and fill all the zeros with the average. This yields 'tacky' results and is not best practice a few outliers can throw off the whole.
Use the median of the non zero values, also not a super option but less likely to be thrown off by outliers.
Binning would be taking the sum of the budgets then say splitting the movies into a certain number of groups like say budgets over or under a million, take the average budget then divide that by the amount of groups you want then use the intervals created from the average if they fall in group 0 give them a zero if group 1 a one etc.
I think finding the actual budgets for the movies and replacing the bad itemized budgets with the real budget would be a good option depending on the analysis you are doing. You could take the median or average of each feature column of the budget make that be the percent of each budget for a movie then fill in the zeros with the percent of budget the median occupies. If median value for the non zero actor_pay column is budget/actor_pay=60% then filling the actor_pay column of a zeroed value with 60 percent of the budget for that movie would be an option.
Hard option create a function that takes the non zero values of a movies budgets and attempts to interpolate the movies budget based upon the other movies data in the table. This option is more like its own project and the above options should really be tried first.

Merging large data in Python in local machine

I have 140 csv files. Each file has 3 variables and is about 750 GB. Number of observation varies from 60 to 90 million.
I also have another small file, treatment_data - with 138000 row (for each unique ID) and 21 column (01 column for ID and 20 columns of 1s and 0s indicating whether the ID was given a particular treatment or not.
The variables are,
ID_FROM: A Numeric ID
ID_TO: A Numeric ID
DISTANCE: A numeric variable of physical distance between ID_FROM and ID_TO
(So in total, I have 138000*138000 (= 19+ Billion)rows - for every possible bi-lateral combination all ID, divided across these 140 files.
Research Question: Given a distance, how many neighbors (of each treatment type) an ID has.
So I need help with a system (preferably in Pandas) where
the researcher will input a distance
the program will look over all the files and filter out the the
rows wither DISTANCE between ID_FROM and ID_TO is less than
the given distance
output a single dataframe. (DISTANCE can be dropped at this
point)
merge the dataframe with the treatment_data by matching ID_TO
with ID. (ID_TO can be dropped at this point)
collapse the data by ID_FROM (group_by and sum the 1s, across
20 treatment variable.
(In the Final output dataset, I will have 138000 row and 21 column. 01 column for ID. 20 column for each different treatment type. So, for example, I will be able to answer the question, "Within '2000' meter, How many neighbors of '500' (ID) is in 'treatment_media' category?"
IMPORTANT SIDE NOTE:
The DISTANCE variable range between 0 to roughly the radius of an
average sized US State (in meter). Researcher is mostly interested to
see what happens with in 5000 meter. Which usually drops 98% of
observations. But sometimes, he/she will check for longer distance
measure too. So I have to keep all the observations available.
Otherwise, I could have simply filtered out the DISTANCE more than
5000 from the raw input files and made my life easier. The reason I
think this is important is that, the data are sorted based in
ID_FROM across 140 files. If I could somehow rearrange these 19+
billion rows based on DISTANCE and associate them have some kind of
dictionary system, then the program does not need to go over all the
140 files. Most of the time, the researcher will be looking into only
2 percentile of DISTANCE range. It seems like a colossal waste of
time that I have to loop over 140 files. But this is a secondary
thought. Please do provide answer even if you can't use this
additional side-note.
I tried looping over 140 files for a particular distance in Stata, It
takes 11+ hour to complete the task. Which is not acceptable as the
researcher will want to vary the distance with in 0 to 5000 range.
But, most of the computation time is wasted on reading each dataset
on memory (that is how Stata do it). That is why I am seeking help in
Python.
Is there a particular reason that you need to do the whole thing in Python?
This seems like something that a SQL database would be very good at. I think a basic outline like the following could work:
TABLE Distances {
Integer PrimaryKey,
String IdFrom,
String IdTo,
Integer Distance
}
INDEX ON Distances(IdFrom, Distance);
TABLE TreatmentData {
Integer PrimaryKey,
String Id,
String TreatmentType
}
INDEX ON TreatmentData(Id, TreatmentType);
-- How many neighbors of ID 500 are within 2000 meters and have gotten
-- the "treatment_media" treatment?
SELECT
d.IdFrom AS Id,
td.Treatment,
COUNT(*) AS Total
FROM Distances d
JOIN TreatmentData td ON d.IdTo = td.Id
WHERE d.IdFrom = "500"
AND d.Distance <= 2000
AND td.TreatmentType = "treatment_media"
GROUP BY 1, 2;
There's probably some other combination of indexes that would give better performance, but this seems like it would at least answer your example question.

Categories

Resources