python: divide a dataframe into the same intervals as another dataframe - python

I divided the following dataframe into 4 intervals according to the 'ages' column.
Let's say that I want another dataframe to have the same exact intervals, is there a quick way to do so?
In other words, the following lines
df1['age_groups'] = pd.cut(df1.ages,4)
print(df1['age_groups'])
divides the dataframe into the following intervals
(1.944, 16.0] 5
(16.0, 30.0] 3
(30.0, 44.0] 2
(44.0, 58.0] 2
but if I have a different dataframe with slightly different numbers in a column with the same name, the same code will produce different intervals.
How do I make sure I can subdivide other dataframes into the same intervals?
ages=[35.000000,
2.000000,
27.000000,
14.000000,
4.000000,
58.000000,
20.000000,
39.000000,
14.000000,
55.000000,
2.000000,
29.699118]
values=[1,0,1,1,0,0,0,1,0,0,1,1]
df1=pd.DataFrame()
df1['ages']=ages
df1['values']=values
#print(df1)
df1['age_groups'] = pd.cut(df1.ages,4)

Save the bins from the first DataFrame using the retbins keyword
Use it as the bins argument in for the second DataFrame:
df1['age_groups'], bins = pd.cut(df1["ages"], 4, retbins=True)
df2['age_groups'] = pd.cut(df2["ages"], bins=bins)
Working example:
import numpy as np
import pandas as pd
np.random.seed(100)
df1 = pd.DataFrame({"ages": np.random.randint(10, 80, 20)})
df2 = pd.DataFrame({"ages": np.random.randint(10, 80, 20)})
df1['age_groups'], bins = pd.cut(df1["ages"], 4, retbins=True)
df2['age_groups'] = pd.cut(df2["ages"], bins=bins)
>>> df1.head()
ages age_groups
0 18 (11.935, 28.25]
1 34 (28.25, 44.5]
2 77 (60.75, 77.0]
3 58 (44.5, 60.75]
4 20 (11.935, 28.25]
>>> df2.head()
ages age_groups
0 11 NaN
1 23 (11.935, 28.25]
2 14 (11.935, 28.25]
3 69 (60.75, 77.0]
4 77 (60.75, 77.0]

Related

Filling a DataFrame based on conditions for both columns and rows

I have a dataframe (df_1) which contains coordinates and value data with no order that looks like this:
x_grid
y_grid
n_value
0
204.0
32.0
45
1
204.0
33.0
32
2
204.0
34.0
94
3
204.0
35.0
92
4
204.0
36.0
84
I wanted to shape in into another dataframe (df_2) to be able to create a heatmap. So I created an empty dataframe where the column indexes are the x_grid values and row indexes are y_grid values.
Then in a for loop I tried I performed an operation where I tried if the row index is equal to x_grid value then change the column with the index of the y_grid value into the n_value.
Here is my code:
for i, row in enumerate(df_2.iterrows()):
row_ind = index_list[i]
for j, item in enumerate(df_1.iterrows()):
x_ind = item[1].x_grid
if row_ind == x_ind:
col_ind = item[1].y_grid
row[1].col_ind = item[1].n_value
What I run this loop I see that there are new values filling dataframe but it does not seem right. The coordinates and values in the second dataframe do not match with the first one.
Second dataframe (df_2) partially looks something like this:
0
25
26
27
0
0
0
27
0
195
0
0
32
36
196
0
65
0
0
197
0
0
0
24
198
0
73
58
0
Is it a better way to perform this? I would also appreciate any other methods to turn the initial dataframe into a heatmap.
IIUC:
df_2 = df_1.pivot('x_grid', 'y_grid', 'n_value') \
.reindex(index=pd.RangeIndex(0, df_1['y_grid'].max()+1),
columns=pd.RangeIndex(0, df_1['x_grid'].max()+1),
fill_value=0)
If you have duplicated values for the same (x, y), use pivot_table:
df_2 = df_1.pivot_table('n_value', 'x_grid', 'y_grid', aggfunc='mean') \
.reindex(index=pd.RangeIndex(df_1['y_grid'].min(), df_1['y_grid'].max()+1),
columns=pd.RangeIndex(df_1['x_grid'].min(), df_1['x_grid'].max()+1))
Example:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(2022)
df_1 = pd.DataFrame(np.random.randint(0, 20, (1000, 3)),
columns=['x_grid', 'y_grid', 'n_value'])
df_2 = df_1.pivot_table('n_value', 'x_grid', 'y_grid', aggfunc='mean') \
.reindex(index=pd.RangeIndex(df_1['y_grid'].min(), df_1['y_grid'].max()+1),
columns=pd.RangeIndex(df_1['x_grid'].min(), df_1['x_grid'].max()+1))
sns.heatmap(df_2, vmin=0, vmax=df_1['n_value'].max())
plt.show()

How to use for range loop for 20 time for one samples of DataFrame

I need to get 20 Samples of DataFrame
Here is my code to get 1 sample of 10 rows
df = pd.read_csv(filename)
df1 = df.iloc[:, -8:]
sample1 = df1.sample(10,replace=True,random_state=0) # this is for 1 sample of 10 rows
i need to use for range loop for 20 times and then return the mean of each column
I am not entirely sure why you'd want to do that, but here is a way:
pd.concat([
df1.sample(10, replace=True).mean()
for _ in range(20)
], axis=1).mean(axis=1)
Note: with random_state=0, you force all draws to be the same, and the mean is the same as that of a single draw.
Example after synthetic setup:
df = pd.DataFrame(np.random.uniform(0, 100, size=(100, 4)), columns=list('ABCD'))
df1 = df.iloc[:, -8:] # no-op in this case, since fewer than 8 columns
Example result of the code above:
A 54.476303
B 41.859940
C 50.512408
D 45.886166
dtype: float64
If instead you want to see the mean of each column for each draw:
out = pd.concat([
df1.sample(10, replace=True).mean()
for _ in range(20)
], axis=1).T
>>> out
A B C D
0 47.465985 50.129386 58.124864 56.518534
1 58.923649 48.446715 46.776693 53.650037
2 60.992973 56.601188 48.049008 44.983743
3 61.546340 45.256996 50.442885 55.271372
4 46.988532 37.723527 64.090468 49.795228
5 55.474868 40.939143 48.870670 61.436648
6 51.768746 43.840800 43.764986 48.645581
7 40.390841 59.571081 51.644671 47.765832
8 48.935542 48.042567 38.030456 47.531566
9 65.405356 56.511895 51.500633 54.639754
10 55.374030 57.356247 47.312623 48.489651
11 51.058319 60.779529 40.204563 66.387166
12 54.733305 60.229638 70.569112 51.509640
13 66.992088 64.504027 42.853642 46.030091
14 50.050447 60.265275 44.487474 44.355356
15 68.018903 70.280004 35.764564 51.583207
16 41.462822 50.420280 32.341020 62.575607
17 38.148091 54.204553 40.006434 52.940808
18 58.230119 54.001817 59.826057 37.026755
19 67.777483 51.038580 39.947926 40.842169

Pandas to create new rows from each exisitng rows

A short data frame and I want to create new rows from the existing rows.
What it does now is, each row, each column multiple a random number between 3 to 5:
import pandas as pd
import random
data = {'Price': [59,98,79],
'Stock': [53,60,60],
'Delivery': [11,7,6]}
df = pd.DataFrame(data)
for row in range(df.shape[0]):
new_row = round(df.loc[row] * random.randint(3,5))
new_row.name = 'new row'
df = df.append([new_row])
print (df)
Price Stock Delivery
0 59 53 11
1 98 60 7
2 79 60 6
new row 295 265 55
new row 294 180 21
new row 316 240 24
Is it possible that it can multiple different random numbers to each row? For example:
the 1st row 3 cells multiple (random) [3,4,5]
the 2nd row 3 cells multiple (random) [4,4,3] etc?
Thank you.
Change the random to numpy random.choice in your for loop
np.random.choice(range(3,5),3)
Use np.random.randint(3,6, size=3). Actually, you can do at once:
df * np.random.randint(3,6, size=df.shape)
You may also generate the multiplication coefficients with the same shape of df independently, and then concat the element-wise multiplied df * mul with the original df:
N.B. This method avoids the notoriously slow .append(). Benchmark: 10,000 rows finished almost instantly with this method, while .append() took 40 seconds!
import numpy as np
np.random.seed(111) # reproducibility
mul = np.random.randint(3, 6, df.shape) # 6 not inclusive
df_new = pd.concat([df, df * mul], axis=0).reset_index(drop=True)
Output:
print(df_new)
Price Stock Delivery
0 59 53 11
1 98 60 7
2 79 60 6
3 177 159 33
4 294 300 28
5 395 300 30
print(mul) # check the coefficients
array([[3, 3, 3],
[3, 5, 4],
[5, 5, 5]])

pandas replace all values of a column with a column values that increment by n starting at 0

Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.
it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90

Finding the indexes of the N maximum values across an axis in Pandas

I know that there is a method .argmax() that returns the indexes of the maximum values across an axis.
But what if we want to get the indexes of the 10 highest values across an axis?
How could this be accomplished?
E.g.:
data = pd.DataFrame(np.random.random_sample((50, 40)))
You can use argsort:
s = pd.Series(np.random.permutation(30))
sorted_indices = s.argsort()
top_10 = sorted_indices[sorted_indices < 10]
print(top_10)
Output:
3 9
4 1
6 0
8 7
13 4
14 2
15 3
19 8
20 5
24 6
dtype: int64
IIUC, say, if you want to get the index of the top 10 largest numbers of column col:
data[col].nlargest(10).index
Give this a try. This will take the 10 largest values across a row and put them into a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random_sample((50, 40)))
df2 = pd.DataFrame(np.sort(df.values)[:,-10:])

Categories

Resources