Sorting by two columns in pandas series - python

Putting a slight variation on a question I previously asked. I managed to get a solution to sorting values by a particular column in my pandas series. However, the problem is that sorting purely by time doesn't allow me to factor in different dates in which the time occurred. I understand that I could potentially hard code the order and use .loc to apply the order but wanted to find out if there was a simpler solution to sort primarily by week (earliest week first) and by time (0-23hours for each week).
Here is a sample of the dataframe I have again:
weeknum time_hour
16-22Jun 0.0 5
2-8Jun 0.0 3
23-29Jun 0.0 11
9-15Jun 0.0 3
16-22Jun 1.0 3
2-8Jun 1.0 6
23-29Jun 1.0 3
9-15Jun 1.0 8
16-22Jun 2.0 3
2-8Jun 2.0 6
23-29Jun 2.0 3
16-22Jun 3.0 4
2-8Jun 3.0 2
23-29Jun 3.0 3
9-15Jun 3.0 4
16-22Jun 4.0 2
2-8Jun 4.0 7
23-29Jun 4.0 1
9-15Jun 4.0 7
16-22Jun 5.0 2
2-8Jun 5.0 9
23-29Jun 5.0 9
9-15Jun 5.0 12
16-22Jun 6.0 5
2-8Jun 6.0 12
23-29Jun 6.0 6
9-15Jun 6.0 14
16-22Jun 7.0 12
2-8Jun 7.0 17
23-29Jun 7.0 19
This is my code:
merged_clean.groupby('weeknum')['time_hour'].value_counts().sort_index(level=['time_hour'])

Use function sorted by multiple keys for sorting MultiIndex with convert first number before - and for change order use DataFrame.reindex:
s = merged_clean.groupby('weeknum')['time_hour'].value_counts()
idx = sorted(s.index, key = lambda x: (int(x[0].split('-')[0]), x[1]))
s = s.reindex(idx)
print (s)
weeknum time_hour
2-8Jun 0.0 3
1.0 6
2.0 6
3.0 2
4.0 7
5.0 9
6.0 12
7.0 17
9-15Jun 0.0 3
1.0 8
3.0 4
4.0 7
5.0 12
6.0 14
16-22Jun 0.0 5
1.0 3
2.0 3
3.0 4
4.0 2
5.0 2
6.0 5
7.0 12
23-29Jun 0.0 11
1.0 3
2.0 3
3.0 3
4.0 1
5.0 9
6.0 6
7.0 19
Name: a, dtype: int64

Related

Pandas: Fill gaps in a series with mean

Given df
df = pd.DataFrame({'distance': [0,1,2,np.nan,3,4,5,np.nan,np.nan,6]})
distance
0 0.0
1 1.0
2 2.0
3 NaN
4 3.0
5 4.0
6 5.0
7 NaN
8 NaN
9 6.0
I want to replace the nans with the inbetween mean
Expected output:
distance
0 0.0
1 1.0
2 2.0
3 2.5
4 3.0
5 4.0
6 5.0
7 5.5
8 5.5
9 6.0
I have seen this_answer but it's for a grouping which isn't my case and I couldn't find anything else.
If you don't want df.interpolate you can compute the mean of the surrounding values manually with df.bfill and df.ffill
(df.ffill() + df.bfill()) / 2
Out:
distance
0 0.0
1 1.0
2 2.0
3 2.5
4 3.0
5 4.0
6 5.0
7 5.5
8 5.5
9 6.0
How about using linear interpolation?
print(df.distance.interpolate())
0 0.000000
1 1.000000
2 2.000000
3 2.500000
4 3.000000
5 4.000000
6 5.000000
7 5.333333
8 5.666667
9 6.000000
Name: distance, dtype: float64

Substraction between two dataframe's column

I have different dataset total product data and selling data. I need to find out the Remaining products from product data comparing selling data. So, for that, I have done some general preprocessing and make both dataframe ready to use. But can't get it how to compare them.
DataFrame 1:
Item Qty
0 BUDS2 1.0
1 C100 4.0
2 CK1 5.0
3 DM10 10.0
4 DM7 2.0
5 DM9 9.0
6 HM12 6.0
7 HM13 4.0
8 HOCOX25(CTYPE) 1.0
9 HOCOX30USB 1.0
10 RM510 8.0
11 RM512 8.0
12 RM569 1.0
13 RM711 2.0
14 T2C 1.0
and
DataFrame 2 :
Item Name Quantity
0 BUDS2 2.0
1 C100 5.0
2 C101CABLE 1.0
3 CK1 8.0
4 DM10 12.0
5 DM7 5.0
6 DM9 10.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 9.0
10 HM13 8.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 3.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 11.0
17 RM512 10.0
18 RM569 2.0
19 RM711 3.0
20 T2C 1.0
21 Y1 3.0
22 ZIRCON 1.0
I want to see the available quantity for each item. And I want to get an output like dataframe 2 but the Quantity column values will be changed after doing the subtraction operation. How can I do that ??
Expected Output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 2.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0
This can help by merging two dataframe:
df_new = df_2.merge(df_1,'left',left_on='Item Name',right_on='Item').fillna(0)
df_new.Quantity = df_new.Quantity - df_new.Qty
df_new = df_new.drop(['Item','Qty'],axis=1)
df_new output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0

How to calculate totals in dataframe by pandas

I have got this dataframe:
Date Trader1 Trader2 Trader3
01/04/2020 4 6 8
02/04/2020 4 6 8
03/04/2020 4 7 8
04/04/2020 4 7 8
05/04/2020 3 5 7
06/04/2020 2 4 7
07/04/2020 2 3 6
08/04/2020 3 3 6
09/04/2020 3 5 7
10/04/2020 3 5 7
11/04/2020 3 5 6
I would like to get Totals for each column by using python/pandas library. When I apply a.loc['Total'] = pd.Series(a.sum()) I can get result as Totals for each column, but it also adds together values of Date column (dates). How can I calculate totals only for needed columns?
You can select only numeric columns by DataFrame.select_dtypes:
a.loc['Total'] = a.select_dtypes(np.number).sum()
You can remove column Date by DataFrame.drop:
a.loc['Total'] = a.drop('Date', axis=1).sum()
Or select all columns without first by positions by DataFrame.iloc:
a.loc['Total'] = a.iloc[:, 1:].sum()
print (a)
Date Trader1 Trader2 Trader3
0 01/04/2020 4.0 6.0 8.0
1 02/04/2020 4.0 6.0 8.0
2 03/04/2020 4.0 7.0 8.0
3 04/04/2020 4.0 7.0 8.0
4 05/04/2020 3.0 5.0 7.0
5 06/04/2020 2.0 4.0 7.0
6 07/04/2020 2.0 3.0 6.0
7 08/04/2020 3.0 3.0 6.0
8 09/04/2020 3.0 5.0 7.0
9 10/04/2020 3.0 5.0 7.0
10 11/04/2020 3.0 5.0 6.0
Total NaN 35.0 56.0 78.0
data[['Trader1','Trader2','Trader3']].sum()
I just saw your comment, There may be better ways, but I think this should work
data[data.columns[1:]].sum()
you have to provide the range in last line.

How to add the sum of some column at the end of dataframe

I have a pandas dataframe with 11 columns. I want to add the sum of all values of columns 9 and column 10 to the end of table. So far I tried 2 methods:
Assigning the data to the cell with dataframe.iloc[rownumber, 8]. This results in an out of bound error.
Creating a vector with some blank: ' ' by using the following code:
total = ['', '', '', '', '', '', '', '', dataframe['Column 9'].sum(), dataframe['Column 10'].sum(), '']
dataframe = dataframe.append(total)
The result was not nice as it added the total vector as a vertical vector at the end rather than a horizontal one. What can I do to solve the issue?
You need use pandas.DataFrame.append with ignore_index=True
so use:
dataframe=dataframe.append(dataframe[['Column 9','Column 10']].sum(),ignore_index=True).fillna('')
Example:
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['col1']=[1,2,3,4]
df['col2']=[2,3,4,5]
df['col3']=[5,6,7,8]
df['col4']=[5,6,7,8]
Using Append:
df=df.append(df[['col2','col3']].sum(),ignore_index=True)
print(df)
col1 col2 col3 col4
0 1.0 2.0 5.0 5.0
1 2.0 3.0 6.0 6.0
2 3.0 4.0 7.0 7.0
3 4.0 5.0 8.0 8.0
4 NaN 14.0 26.0 NaN
Whitout NaN values:
df=df.append(df[['col2','col3']].sum(),ignore_index=True).fillna('')
print(df)
col1 col2 col3 col4
0 1 2.0 5.0 5
1 2 3.0 6.0 6
2 3 4.0 7.0 7
3 4 5.0 8.0 8
4 14.0 26.0
Create new DataFrame with sums. This example DataFrame has columns 'a' and 'b'. df1 is the DataFrame what need to be summed up and df3 is one line DataFrame only with sums:
data = [[df1.a.sum(),df1.b.sum()]]
df3 = pd.DataFrame(data,columns=['a','b'])
Then append it to end:
df1.append(df3)
simply try this:(replace test with your dataframe name)
row wise sum(which you have asked for):
test['Total'] = test[['col9','col10']].sum(axis=1)
print(test)
column wise sum:
test.loc['Total'] = test[['col9','col10']].sum()
test.fillna('',inplace=True)
print(test)
IICU , this is what you need (change numbers 8 & 9 to suit your needs)
df['total']=df.iloc[ : ,[8,9]].sum(axis=1) #horizontal sum
df['total1']=df.iloc[ : ,[8,9]].sum().sum() #Vertical sum
df.loc['total2']=df.iloc[ : ,[8,9]].sum() # vertical sum in rows for only columns 8 & 9
Example
a=np.arange(0, 11, 1)
b=np.random.randint(10, size=(5,11))
df=pd.DataFrame(columns=a, data=b)
0 1 2 3 4 5 6 7 8 9 10
0 0 5 1 3 4 8 6 6 8 1 0
1 9 9 8 9 9 2 3 8 9 3 6
2 5 7 9 0 8 7 8 8 7 1 8
3 0 7 2 8 8 3 3 0 4 8 2
4 9 9 2 5 2 2 5 0 3 4 1
**output**
0 1 2 3 4 5 6 7 8 9 10 total total1
0 0.0 5.0 1.0 3.0 4.0 8.0 6.0 6.0 8.0 1.0 0.0 9.0 48.0
1 9.0 9.0 8.0 9.0 9.0 2.0 3.0 8.0 9.0 3.0 6.0 12.0 48.0
2 5.0 7.0 9.0 0.0 8.0 7.0 8.0 8.0 7.0 1.0 8.0 8.0 48.0
3 0.0 7.0 2.0 8.0 8.0 3.0 3.0 0.0 4.0 8.0 2.0 12.0 48.0
4 9.0 9.0 2.0 5.0 2.0 2.0 5.0 0.0 3.0 4.0 1.0 7.0 48.0
total2 NaN NaN NaN NaN NaN NaN NaN NaN 31.0 17.0 NaN NaN NaN

Pandas: Group sums of n consecutive elemens in a dataframe column

Let df be a pandas dataframe of the following form:
n days
1 9.0
2 4.0
3 5.0
4 1.0
5 4.0
6 1.0
7 7.0
8 3.0
For given N, and every row i>=N I want to sum the values indf.days.iloc[i-N+1:i+1], and write them into a new column, in row i.
The result should look like this (e.g., for N = 3):
n days loc_sum
1 9.0 NaN
2 4.0 NaN
3 5.0 18.0
4 1.0 10.0
5 4.0 10.0
6 1.0 6.0
7 7.0 12.0
8 3.0 11.0
Of course, I could simply loop through all i, and insert df.days.iloc[i-N+1:i+1].sum() for every i.
My question is: Is there a more elegant way, using pandas functionality? Especially for large datasets, looping through the rows seems to be a very slow option.
Use rolling with a windows equal to 3 and function sum:
df['loc_sum'] = df['days'].rolling(3).sum()
Output:
n days loc_sum
0 1 9.0 NaN
1 2 4.0 NaN
2 3 5.0 18.0
3 4 1.0 10.0
4 5 4.0 10.0
5 6 1.0 6.0
6 7 7.0 12.0
7 8 3.0 11.0

Categories

Resources