What better way to implement this idea on a Pandas dataframe?

What better way to implement this idea on a Pandas dataframe? - python

I have a pandas dataframe called Df which contains 26,000 rows. This dataframe includes 10 columns called "first price", "second price" and .... "tenth price".
I want to create a new column called "y" to this dataframe like this
For example, the 26th row of the "y" column indicates the name of the column whose value in row 26 of that column is closer to the number of the first column (the column whose name is the first price) of the 27th row(26+1) than the elements of the other columns in the 26th row.
I implemented this code with an algorithm, but this algorithm works very slowly for a sample of 1000, let alone 26,000!
y=[]
for i in range(1000):
y.append((abs(df[df.index==(i)]-df["first price"][i+1])).idxmin(axis=1)[i])
for i in range(1000,len(df)):
y.append(0)
df["y"]=y
Do you know a better way?

You want to reshape the data to make it tidy. It's not good to have a bunch of columns all with the same value type (first price, second price, etc.). Better to have the type in its own column and the price beside it. Since you are comparing everything to the first price, you can leave it in its own index column and melt the rest of the columns into pairs of 'price number' and 'price' before finding the minimum of each 'item' (or what you had as rows in your example):
# example data:
np.random.seed(11)
df = (pd.DataFrame(np.random.choice(range(100), (6,4)),
columns=['first', 'second', 'third', 'fourth'])
.rename_axis('item_id')
.reset_index())
# reshape data to make it easier to work with
df = df.melt(id_vars=['item_id', 'first'], var_name='price_number', value_name='price')
# calculate price differences
df['price_diff'] = (df.price - df['first']).abs()
# find the minimum price difference for each item
df_closest = df.loc[df.groupby('item_id')['price_diff'].idxmin()]

Related

Pick the smallest or highest number in accordance with another column

One column in a data frame has names, each name is repeated at least 10 times so there are many names.
In another column, I have numbers.
I want to add two new columns, one displaying the lowest number for a specific name (that appears in the NUMBERS column), and the second displaying the highest number.
This is a dummy data that is similar to my real data, just to make my question more clear :

Sample data:
df = pd.DataFrame({'name':['a','b','a','b','b','c','a','c'],
'val':[1,2,3,4,5,6,7,8]})
Using groupby, transform and apply:
df['min'] = df.groupby('name')[['val']].transform(lambda g: g.min())
df['max'] = df.groupby('name')[['val']].transform(lambda g: g.max())

IIUC, you can try
out = df.merge(df.groupby('Names')['NUMBERS']
.agg(**{'Lowest Number': 'min', 'Highest Number': 'max'}).reset_index())

How to change DataFrame column values so that mean is modified accordingly?

I have a Pandas DataFrame extracted from Estespark Weather for the dates between Sep-2009 and Oct-2018, and the mean of the Average windspeed column is 4.65. I am taking a challenge where there is a sanity check where the mean of this column needed to be 4.64. How can I modify the values of this column so that the mean of this column becomes 4.64? Is there any code solution for this, or do we have to do it manually?

I can see two solutions:
Substract 0.01 (4.65 - 4.64) to every value of that column like:
df['AvgWS'] -= 0.01
2 If you dont want to alter all rows: find wich rows you can remove to give you the desired mean (if there are any):
current_mean = 4.65
desired_mean = 4.64
n_rows = len(df['AvgWS'])
df['can_remove'] = df['AvgWS'].map(lambda x: (current_mean*n_rows - x)/(n_rows-1) == 4.64)
This will create a new boolean column in your dataframe with True in the rows that, if removed, make the rest of the column's mean = 4.64. If there are more than one you can analyse them to choose which one seems less important and then remove that one.

How can I obtain the top n groups in pandas?

I have a pandas dataframe. The final column in the dataframe is the max value of the RelAb column for each unique group (in this case, a species assignment) in the dataframe as obtained by:
df_melted['Max'] = df_melted.groupby('Species')['RelAb'].transform('max')
As you can see, the max value is represented in all rows of the group. Each group contains a large number of rows. I have the df sorted by max values, for which there are about 100 rows per max value. My goal is to obtain the top 20 groups based on the max value (i.e. a df with 100 X 20 rows - 2000 rows). I do not want to drop individual rows from groups in the dataframe, rather entire groups.
I am pasting a subset of the dataframe where the max for a group changes from one "Max" value to the next:
My feeling is that I need to convert the max so that the one value represents the entire group and then sort based on that column, perhaps as such?
For context, the reason I am doing this is because I am planning to make a stacked barchart with the most abundant species in the table for each sample. Right now, there are just way too many species, so it makes the stacked bar chart uninformative.

One way to do it:
aux = (df_melted.groupby('Species')['RelAb']
.max()
.nlargest(20, keep='all')
.to_list())
top20 = df_melted.loc[df_melted['Max'].isin(aux), :].copy()

Create multiple columns within groupby for calculation?

I have a dataset like this:
SKU,Date,Inventory,Sales,Incoming
2010,2017-01-01 0:00:00,85,126,252
2010,2017-02-01 0:00:00,382,143,252
2010,2017-03-01 0:00:00,414,139,216
2010,2017-04-01 0:00:00,468,120,216
7770,2017-01-01 0:00:00,7,45,108
7770,2017-02-01 0:00:00,234,64,216
7770,2017-03-01 0:00:00,160,69,36
7770,2017-04-01 0:00:00,150,50,72
7870,2017-01-01 0:00:00,41,29,36
7870,2017-02-01 0:00:00,95,18,36
7870,2017-03-01 0:00:00,112,16,36
7870,2017-04-01 0:00:00,88,19,0
Inventory Quantity is the "actual" recorded quantity, which may differ from the hypothetical remaining quantity, which is what I am trying to calculate.
Sales Quantity actually extends much longer into the future. In those rows, the other two columns will have NA.
I want to create the following:
Take only the first Inventory value of each SKU
Use the first value to calculate the hypothetical remaining quantity by using a recursive formula [Earliest inventory] - [Sales for that month] - [Incoming qty for that month] (Note: Earliest inventory is a fixed quantity for each SKU). Store the output in a column called "End of Month Part 1".
Create another column called "Buy Quantity" with the following criteria: If remaining quantity is less than 50, then create a new column that indicates the buy amount (let's say it's 30 for all 3 SKUs) (i.e. increase the quantity by 30). If the remaining quantity is more than 50, then the buy amount is zero.
Create another column called "End of Month Part 2" that adds "End of Month Part 1" with "Buy Quantity"
I am able to obtain the first quantity of each SKU using the following code, and merge it into a column called first_qty into the dataset
first_qty_series = dataset.groupby(by=['SKU']).nth(0)['Inventory']
first_qty_series = pd.DataFrame(dataset).reset_index().rename(columns={'Inventory': 'Earliest inventory'})
dataset = pd.merge(dataset, pd.DataFrame(first_qty_series), on='SKU' )
As for the remainder quantity, I thought of using cumsum() on the two columns dataset['Sales'] and dataset['Incoming'] but I think it won't work because the cumsum() will sum across ALL SKUs.
That's why I think I need to perform the calculation in groupby. But I don't know what else to do.
(Edit:) Expected output is:
Thank you guys!

Here is a way to do the 4 columns you want.
1 - Another method, using loc and drop_duplicates to fill the first row for each 'SKU' with the value from 'Inventory', and then use ffill to fill the following rows, but your method is good.
dataset.loc[dataset.drop_duplicates(['SKU']).index,'Earliest inventory'] = dataset['Inventory']
dataset['Earliest inventory'] = dataset['Earliest inventory'].ffill().astype(int)
2 - Indeed you need cumsum and groupby to create the column 'End of Month Part 1', not on the column 'Earliest inventory' as the value is the same on every row for a same 'SKU'. Note: according to your result (and logic), I change the - with + before the column 'Incoming', and if I misunderstood the problem, just change the sign.
dataset['End of Month Part 1'] = (dataset['Earliest inventory']
- dataset.groupby('SKU')['Sales'].cumsum()
+ dataset.groupby('SKU')['Incoming'].cumsum())
3 - The column 'Buy Quantity' can be implemented using loc again meeting the condition on value less than 50 in column 'End of Month Part 1', then fillna with 0
dataset.loc[dataset['End of Month Part 1'] <= 50, 'Buy Quantity'] = 30
dataset['Buy Quantity'] = dataset['Buy Quantity'].fillna(0).astype(int)
4 - Finally the last column is just adding the two lasts created
dataset['End of Month Part 2'] = dataset['End of Month Part 1'] + dataset['Buy Quantity']
If I understood well the 4 points, you should get the dataset with the new columns

What is pandas syntax for lookup based on existing columns + row values?

I'm trying to recreate a bit of a convoluted scenario, but I will do my best to explain it:
Create a pandas df1 with two columns: 'Date' and 'Price' - done
I add two new columns: 'rollmax' and 'rollmin', where the 'rollmax' is an 8 days rolling maximum and 'rollmin' is a
rolling minimum. - done
Now I need to create another column 'rollmax_date' that would get
populated through a look up rule:
for the row n, go to the column 'Price' and parse through the values
for the last 8 days and find the maximum, then get the value of the
corresponding column 'Price' and put this value in the column 'rollingmax_date'.
the same logic for the 'rollingmin_date', but instead of rolling maximum date, we look for the rolling minimum date.
Now I need to find the previous 8 days max and min for the same rolling window of 8 days that I have already found.
I did the first two and tried the third one, but I'm getting wrong results.
The code below gives me only dates where on the same row df["Price"] is the same as df['rollmax'], but it doesn't bring all the corresponding dates from 'Date' to 'rollmax_date'
df['rollmax_date'] = df.loc[(df["Price"] == df.rollmax), 'Date']
This is an image with steps for recreating the lookup

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What better way to implement this idea on a Pandas dataframe? - python

Related

Pick the smallest or highest number in accordance with another column

How to change DataFrame column values so that mean is modified accordingly?

How can I obtain the top n groups in pandas?

Create multiple columns within groupby for calculation?

What is pandas syntax for lookup based on existing columns + row values?

Categories

Resources