I want to combine the five rows of the same dataset into a single dataset
I have 700 rows and i want to combining every five rows
A B C D E F G
1 10,11,12,13,14,15,16
2 17,18,19,20,21,22,23
3 24,25,26,27,28,29,30
4 31,32,33,34,35,36,37
5 38,39,40,41,42,43,44
.
.
.
.
.
700
After combining the first five rows.. My first row should look like this:
A B C D E F G A B C D E F G A B C D E F G A B C D E F G A B C D E F G
1 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44
If you can guarantee that the total number of rows you have is a multiple of 5, dipping into numpy will be the most efficient way to solve this problem:
import numpy as np
import pandas as pd
data = np.arange(70).reshape(-1, 7)
df = pd.DataFrame(data, columns=[*'ABCDEFG'])
print(df)
A B C D E F G
0 0 1 2 3 4 5 6
1 7 8 9 10 11 12 13
2 14 15 16 17 18 19 20
3 21 22 23 24 25 26 27
4 28 29 30 31 32 33 34
5 35 36 37 38 39 40 41
6 42 43 44 45 46 47 48
7 49 50 51 52 53 54 55
8 56 57 58 59 60 61 62
9 63 64 65 66 67 68 69
out = pd.DataFrame(
df.to_numpy().reshape(-1, df.shape[1] * 5),
columns=[*df.columns] * 5
)
print(out)
A B C D E F G A B C D E F ... B C D E F G A B C D E F G
0 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 22 23 24 25 26 27 28 29 30 31 32 33 34
1 35 36 37 38 39 40 41 42 43 44 45 46 47 ... 57 58 59 60 61 62 63 64 65 66 67 68 69
[2 rows x 35 columns]
You can do:
cols = [col for v in [df.columns.tolist()]*len(df) for col in v]
dfs = [df[i:min(i+5,len(df))].reset_index(drop=True) for i in range(0,len(df),5)]
df2 = pd.concat([pd.DataFrame(df.stack()).T for df in dfs])
df2.columns = cols
df2.reset_index(drop=True, inplace=True)
see if this helps answer your question
unstack turns the columns into rows, and once we have the data in a column, we just need it transposed. reset_index makes the resulting series into a dataframe. the original columns names are made into an index, so when we transpose we have the columns as you had stated in your columns.
df.unstack().reset_index().set_index('level_0')[[0]].T
level_0 A A A A A B B B B B ... F F F F F G G G G G
0 10 17 24 31 38 11 18 25 32 39 ... 15 22 29 36 43 16 23 30 37 44
vote and/or accept if the answer helps
the easiest way is to convert your dataframe to a numpy array, reshape it then cast it back to a new dataframe.
Edit:
data= # your dataframe
new_dataframe=pd.DataFrame(data.to_numpy().reshape(len(data)//5,-1),columns=np.tile(data.columns,5))
Stacking and unstacking data in pandas
Data in tables are often presented multiple ways. Long form ("tidy data") refers to data that are stacked in a couple of columns. One of the columns will have categorical indicators about the values. In contrast, wide form ("stacked data") is where each category has it's own column.
In your example, you present the wide form of data, and you're trying to get it into long form. The pandas.melt, pandas.groupby, pandas.pivot, pandas.stack, pandas.unstack, and pandas.reset_index are the functions that help convert between these forms.
Start with your original dataframe:
df = pd.DataFrame({
'A' : [10, 17, 24, 31, 38],
'B' : [11, 18, 25, 32, 39],
'C' : [12, 19, 26, 33, 40],
'D' : [13, 20, 27, 34, 41],
'E' : [14, 21, 28, 35, 42],
'F' : [15, 22, 29, 36, 43],
'G' : [16, 23, 30, 37, 44]})
A B C D E F G
0 10 11 12 13 14 15 16
1 17 18 19 20 21 22 23
2 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37
4 38 39 40 41 42 43 44
Use pandas.melt to convert it to long form, then sort to get it how you requested the data: The ignore index option helps us to get it back to wide form later.
melted_df = df.melt(ignore_index=False).sort_values(by='value')
variable value
0 A 10
0 B 11
0 C 12
0 D 13
0 E 14
0 F 15
0 G 16
1 A 17
1 B 18
...
Use groupby, unstack, and reset_index to convert it back to wide form. This is often a much more difficult process that relies on grouping by the value stacked column, other columns, index, and stacked variable and then unstacking and resetting the index.
(melted_df
.reset_index() # puts the index values into a column called 'index'
.groupby(['index','variable']) #groups by the index and the variable
.value #selects the value column in each of the groupby objects
.mean() #since there is only one item per group, it only aggregates one item
.unstack() #this sets the first item of the multi-index to columns
.reset_index() #fix the index
.set_index('index') #set index
)
A B C D E F G
0 10 11 12 13 14 15 16
1 17 18 19 20 21 22 23
2 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37
4 38 39 40 41 42 43 44
This stuff can be quite difficult and requires trial and error. I would recommend making a smaller version of your problems and mess with them. This way you can figure out how the functions are working.
Try this using arange() with floordiv to get groups by every 5, then creating a new df with the groups. This should work even if your df is not divisible by 5.
l = 5
(df.groupby(np.arange(len(df.index))//l)
.apply(lambda x: pd.DataFrame([x.to_numpy().ravel()]))
.set_axis(df.columns.tolist() * l,axis=1)
.reset_index(drop=True))
or
(df.groupby(np.arange(len(df.index))//5)
.apply(lambda x: x.reset_index(drop=True).stack())
.unstack(level=[1,2])
.droplevel(0,axis=1))
Output:
A B C D E F G A B C ... E F G A B C D E F G
0 9 0 3 2 6 2 9 1 7 5 ... 2 5 9 5 4 9 7 3 8 9
1 9 5 0 8 1 5 8 7 7 7 ... 6 3 5 5 2 3 9 7 5 6
I have a pandas df as displayed I would like to calculate Avg Rate by DC by Brand column which is a similar to averageif in excel ,
I have tried methods like groupby mean() but that does not give correct results
Your question is not clear but you may be looking for:
df.groupby(['DC','Brand'])['Rate'].mean()
AVERAGEIF in excel returns a column which is the same size as your original data. So I think you're looking for pandas.transform():
# Sample DF
Brand Rate
0 A 45
1 B 100
2 C 28
3 A 92
4 B 2
5 C 79
6 A 48
7 B 97
8 C 72
9 D 14
10 D 16
11 D 64
12 E 85
13 E 22
Result:
df['Avg Rate by Brand'] = df.groupby('Brand')['Rate'].transform('mean')
print(df)
Brand Rate Avg Rate by Brand
0 A 45 61.666667
1 B 100 66.333333
2 C 28 59.666667
3 A 92 61.666667
4 B 2 66.333333
5 C 79 59.666667
6 A 48 61.666667
7 B 97 66.333333
8 C 72 59.666667
9 D 14 31.333333
10 D 16 31.333333
11 D 64 31.333333
12 E 85 53.500000
13 E 22 53.500000
I have 2 dataframes, df1 and df2, and df2 holds the min and max values for the corresponding columns.
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.randint(0,50,size=(10, 5)), columns=list('ABCDE'))
df2 = pd.DataFrame(np.array([[5,3,4,7,2],[30,20,30,40,50]]),columns=list('ABCDE'))
I would like to iterate through df1 and replace the cell values with those of df2 when the df1 cell value is below/above the respective columns' min/max values.
First dont loop/iterate in pandas, if exist some another better and vectorized solutions like here.
Use numpy.select with broadcasting for set values by conditions:
np.random.seed(123)
df1 = pd.DataFrame(np.random.randint(0,50,size=(10, 5)), columns=list('ABCDE'))
df2 = pd.DataFrame(np.array([[5,3,4,7,2],[30,20,30,40,50]]),columns=list('ABCDE'))
print (df1)
A B C D E
0 45 2 28 34 38
1 17 19 42 22 33
2 32 49 47 9 32
3 46 32 47 25 19
4 14 36 32 16 4
5 49 3 2 20 39
6 2 20 47 48 7
7 41 35 28 38 33
8 21 30 27 34 33
print (df2)
A B C D E
0 5 3 4 7 2
1 30 20 30 40 50
#for pandas below 0.24 change .to_numpy() to .values
min1 = df2.loc[0].to_numpy()
max1 = df2.loc[1].to_numpy()
arr = df1.to_numpy()
df = pd.DataFrame(np.select([arr < min1, arr > max1], [min1, max1], arr),
index=df1.index,
columns=df1.columns)
print (df)
A B C D E
0 30 3 28 34 38
1 17 19 30 22 33
2 30 20 30 9 32
3 30 20 30 25 19
4 14 20 30 16 4
5 30 3 4 20 39
6 5 20 30 40 7
7 30 20 28 38 33
8 21 20 27 34 33
9 12 20 4 40 5
Another better solution with numpy.clip:
df = pd.DataFrame(np.clip(arr, min1, max1), index=df1.index, columns=df1.columns)
print (df)
A B C D E
0 30 3 28 34 38
1 17 19 30 22 33
2 30 20 30 9 32
3 30 20 30 25 19
4 14 20 30 16 4
5 30 3 4 20 39
6 5 20 30 40 7
7 30 20 28 38 33
8 21 20 27 34 33
9 12 20 4 40 5
import pandas as pd
df= pd.DataFrame({'date':[1,2,3,4,5,1,2,3,4,5,1,2,3,4,5],
'name':list('aaaaabbbbbccccc'),
'v1':[10,20,30,40,50,10,20,30,40,50,10,20,30,40,50],
'v2':[10,20,30,40,50,10,20,30,40,50,10,20,30,40,50],
'v3':[10,20,30,40,50,10,20,30,40,50,10,20,30,40,50]})
a= list(set(list(df.name)))
plus=[]
for i in a:
sep=df[df.name==i]
sep2=sep[(sep.v1>=10)&(sep.v2>=20)&(sep.v3<=40)]
plus.append(sep2)
result=pd.concat(plus)
print(result)
I know this is not a good example anyway,
I would like to handle separately by name.
It takes too long in a big data
How can I extract data using the 'groupby'?
Even better if the function is used(def..apply...)
df.groupby(['name'])(df['v1']>20)...???? It cannot work...
looking at your desired data set i don't think you need to groupby your df, you can simply filter it:
In [112]: df.query('v1 >= 10 and v2 >= 20 and v3 <= 40')
Out[112]:
date name v1 v2 v3
1 2 a 20 20 20
2 3 a 30 30 30
3 4 a 40 40 40
6 2 b 20 20 20
7 3 b 30 30 30
8 4 b 40 40 40
11 2 c 20 20 20
12 3 c 30 30 30
13 4 c 40 40 40