Classifying Data in a New Column

Classifying Data in a New Column - python

I have following df:
Column 1
1
2435
3345
104
505
6005
10000
80000
100000
4000000
4440
520
...
This structure is not the best to plot a histogram, which is the main purpose. Bins don't really solve the problem either, at least from what I've tested so far. That's why I like to create my own bins in a new column:
I basically want to assign every value within a certain range in column 1 a bucket in column2, so that it look like this:
Column 1 Column2
1 < 10000
2435 < 10000
3345 < 10000
104 < 10000
505 < 10000
6005 < 10000
10000 < 50000
80000 < 150000
100000 < 150000
4000000 < 250000
4440 < 10000
520 < 10000
...
Once I get there, creating a plot will be much easier.
Thanks!

There is a pandas equivalent to this cut there is a section describing this here. cut returns the open closed intervals for each value:
In [29]:
df['bin'] = pd.cut(df['Column 1'], bins = [0,10000, 50000, 150000, 25000000])
df
Out[29]:
Column 1 bin
0 1 (0, 10000]
1 2435 (0, 10000]
2 3345 (0, 10000]
3 104 (0, 10000]
4 505 (0, 10000]
5 6005 (0, 10000]
6 10000 (0, 10000]
7 80000 (50000, 150000]
8 100000 (50000, 150000]
9 4000000 (150000, 25000000]
10 4440 (0, 10000]
11 520 (0, 10000]
The dtype of the column is a Category and can be used for filtering, counting, plotting etc.

numpy.histogram takes a bins parameter which can be an integer array, and returns an array of the counts within those bins. So, if you run
import numpy as np
counts, _ = np.histogram(df[`Column 1`].values, [10000, 50000, 150000, 250000])
You will have the bins you want. From here, you can do whatever you want, including plotting the number of counts within each bin:
plot(counts)

Related

Sliding windows in numpy with varying window size

I am generating data with a timestamp (counting up). I then want to seperate the array based on the timestamp and calculate the mean of the data in each window. My new array then has a new "timestamp" and the calculated mean data.
My Code is working as it is supposed to, but I do believe there is a more numpy-like way. I believe the while loop can be removed and np.where checking the whole array, as it is already sorted as-well.
Thanks for your help.
# generating test data, first row timestamps, always counting up and random data
data = np.array([np.cumsum(np.random.randint(100, size=20)), np.random.randint(1, 5, size=20)])
print(data)
window_size = 200
overlap = 100
i, l_lim, u_lim = 0, 0, window_size
timestamps = []
window_mean = []
while u_lim < data[0, -1]:
window_mean.append(np.mean(data[1, np.where((data[0, :] > l_lim) & (data[0, :] <= u_lim))]))
timestamps.append(i)
l_lim = u_lim - overlap
u_lim = l_lim + window_size
i += 1
print(np.array([timestamps, window_mean]))

While I may have reduced the number of lines of code, I do not think I have really improved it that much. The main difference is the method of iteration, and its use to define the number selection boundaries, but otherwise, I could not see any way to improve on your code. Here is my attempt for what it is worth:
Code:
import numpy as np
np.random.seed(5)
data = np.array([np.cumsum(np.random.randint(100, size=20)), np.random.randint(1, 5, size=20)])
print("Data:", data)
window_size = 200
overlap = 100
for i in range((max(data[0]) // (window_size-overlap)) + 1):
result = np.mean(data[1, np.where((data[0] > i*(window_size-overlap)) & (data[0] <= (i*(window_size-overlap)) + window_size))])
print(f"{i}: {result:.2f}")
Output:
Data: [[ 99 177 238 254 327 335 397 424 454 534 541 617 632 685 765 792 836 913 988 1053]
[ 4 3 1 3 2 3 3 2 2 3 2 2 3 2 3 4 1 3 2 3]]
0: 3.50
1: 2.33
2: 2.40
3: 2.40
4: 2.25
5: 2.40
6: 2.80
7: 2.67
8: 2.00
9: 2.67
10: 3.00

Binning pandas/numpy array in unequal sizes with approx equal computational cost

I have a problem where data must be processed across multiple cores. Let df be a Pandas DataFrameGroupBy (size()) object. Each value represent the computational "cost" each GroupBy has for the cores. How can I divide df into n-bins of unequal sizes and with the same (approx) computational cost?
import pandas as pd
import numpy as np
size = 50
rng = np.random.default_rng(2021)
df = pd.DataFrame({
"one": np.linspace(0, 10, size, dtype=np.uint8),
"two": np.linspace(0, 5, size, dtype=np.uint8),
"data": rng.integers(0, 100, size)
})
groups = df.groupby(["one", "two"]).sum()
df
one two data
0 0 0 75
1 0 0 75
2 0 0 49
3 0 0 94
4 0 0 66
...
45 9 4 12
46 9 4 97
47 9 4 12
48 9 4 32
49 10 5 45
People typically split the dataset into n-bins, such as the code below. However, splitting the dataset into n-equal parts is undesirable because the cores receives very unbalanced workload, e.g. 205 vs 788.
n = 4
bins = np.array_split(groups, n) # undesired
[b.sum() for b in bins] #undesired
[data 788
dtype: int64, data 558
dtype: int64, data 768
dtype: int64, data 205
dtype: int64]
A desired solution is splitting the data into bins of unequal sizes and with approximately equal large summed values. I.e. the difference between abs(743-548) = 195 is smaller than the previous method abs(205-788) = 583. The difference should be as small as possible. A simple list-example of how it should be achieved:
# only an example to demonstrate desired functionality
example = [[[10, 5], 45], [[2, 1], 187], [[3, 1], 249], [[6, 3], 262]], [[[9, 4], 153], [[4, 2], 248], [[1, 0], 264]], [[[8, 4], 245], [[7, 3], 326]], [[[5, 2], 189], [[0, 0], 359]]
[sum([size for (group, size) in test]) for test in t] # [743, 665, 571, 548]
Is there a more efficient method to split the dataset into bins as described above in pandas or numpy?
It is important to split/bin the GroupBy object, accessing the data in a similar way as returned by np.array_split().

I think a good approach has been found. Credits to a colleague.
The idea is to sort the group sizes (in descending order) and put groups into bins in a "backward S"-pattern. Let me illustrate with an example. Assume n = 3 (number of bins) and the following data:
groups
data
0 359
1 326
2 264
3 262
4 249
5 248
6 245
7 189
8 187
9 153
10 45
The idea is to put one group in one bin going "right to left" (and vica versa) between the bins in a "backward S"-pattern. First element in bin 0, second element in bin 1, etc. Then go backwards when reaching the last bin: fourth element in bin 2, fifth element in bin 1, etc. See below how the elements are put into bins by the group number in parenthesis. The values are the group sizes.
Bins: | 0 | 1 | 2 |
| 359 (0)| 326 (1)| 264 (2)|
| 248 (5)| 249 (4)| 262 (3)|
| 245 (6)| 189 (7)| 187 (8)|
| | 45(10)| 153 (9)|
The bins will have approximately the same number of values and, thus, approximately the same computational "cost". The bin sizes are: [852, 809, 866] for anyone interested. I have tried on a real-world dataset and the bins are of similar sizes. It is not guaranteed bins will be of similar size for all datasets.
The code can be made more efficient, but this is sufficient to get the idea out:
n = 3
size = 50
rng = np.random.default_rng(2021)
df = pd.DataFrame({
"one": np.linspace(0, 10, size, dtype=np.uint8),
"two": np.linspace(0, 5, size, dtype=np.uint8),
"data": rng.integers(0, 100, size)
})
groups = df.groupby(["one", "two"]).sum()
groups = groups.sort_values("data", ascending=False).reset_index(drop=True)
bins = [[] for i in range(n)]
backward = False
i = 0
for group in groups.iterrows():
bins[i].append(group)
i = i + 1 if not backward else i - 1
if i == n:
backward = True
i -= 1
if i == -1 and backward:
backward = False
i += 1
[sum([size[0] for (group, size) in bin]) for bin in bins]

Python numpy vectorization for heat dispersion

I'm supposed to write a code to represent heat dispersion using the finite difference formula given below.
𝑢(𝑡)𝑖𝑗=(𝑢(𝑡−1)[𝑖+1,𝑗] + 𝑢(𝑡−1) [𝑖−1,𝑗] +𝑢(𝑡−1)[𝑖,𝑗+1] + 𝑢(𝑡−1)[𝑖,𝑗−1])/4
The formula is supposed to produce the result only for a time step of 1. So, if an array like this was given:
100 100 100 100 100
100 0 0 0 100
100 0 0 0 100
100 0 0 0 100
100 100 100 100 100
The resulting array at time step 1 would be:
100 100 100 100 100
100 50 25 50 100
100 25 0 25 100
100 50 25 50 100
100 100 100 100 100
I know the representation using for loops would be as follows, where the array would have a minimum of 2 rows and 2 columns as a precondition:
h = np.copy(u)
for i in range(1,h.shape[0]-1):
for j in range (1, h.shape[1]-1):
num = u[i+1][j] + u[i-1][j] + u[i][j+1] + u[i][j-1]
h[i][j] = num/4
But I cannot figure out how to vectorize the code to represent heat dispersion. I am supposed to use numpy arrays and vectorization and am not allowed to use for loops of any kind, and I think I am supposed to rely on slicing, but I cannot figure out how to write it out and have started out with.
r, c = h.shape
if(c==2 or r==2):
return h
I'm sure that if the rows=2 or columns =2 then the array is returned as is, but correct me if Im wrong. Any help would be greatly appreciated. Thank you!

Try:
h[1:-1,1:-1] = (h[2:,1:-1] + h[:-2,1:-1] + h[1:-1,2:] + h[1:-1,:-2]) / 4
This solution uses slicing where:
1:-1 stays for indices 1,2, ..., LAST - 1
2: stays for 2, 3, ..., LAST
:-2 stays for 0, 1, ..., LAST - 2
During each iteration only the inner elements (indices 1..LAST-1) are updated

Separating my data to groups and combining them

I have a dataframe with data with 2 variables that have values from 0 to 1000. How can I efficiently split it into groups? I want to split x to 10 groups (0 to 100, 100 to 200 ... 900 to 1000 etc.) and then do the same thing for y (0 to 100, 100 to 200 ... 900 to 1000 etc.). And then I want to combine them and have every single combination. One group having 0
Example data
x y
102 602
224 340
368 756
487 305
568 310
510 911
10 50
340 519
9 282
10 150

Well, if you want to split them into groups of 10 remember that from 0 to 1000 we have 1001 numbers so the groups would be something like: 0 - 100, 101 - 200, 201 - 300,.....,901 - 1000. you will have 9 groups of 100 elements and 1 group of 101 elements. If you do the groups like your idea you will have ten groups of 101 elements each one beacuse you are repeating some numbers. So I don't know how you want to distribute the groups so Im going to work with 0 to 999 to have 10 groups of 100 elements each one. The possible permutations for the groups are 10x10 = 100 but if you want the permutations of the the elements as well it would be a huge number 100x100 = 10,000 x 100 = 1,000,000. So first you need to explain if you want the permutations of the groups only or you want to combine the elements as well. I don't know if you are using pandas so I'm going to give you the logic with pure Python.
To split x and y into groups of ten with the numbers from 0 to 999 you can do something like this:
x_group_list = [[i for i in range(k * 100,(k + 1)*100)] for k in range(0,10)]
y_group_list = [[i for i in range(k * 100,(k + 1)*100)] for k in range(0,10)]

Sum data points from individual pandas dataframes in a summary dataframe based on custom (and possibly overlapping) bins

I have many dataframes with individual counts (e.g. df_boston below). Each row defines a data point that is uniquely identified by its marker and its point. I have a summary dataframe (df_inventory_master) that has custom bins (the points above map to the Begin-End coordinates in the master). I want to add a column to this dataframe for each individual city that sums the counts from that city in a new column. An example is shown.
Two quirks are that the the bins in the master frame can be overlapping (the count should be added to both) and that some counts may not fall in the master (the count should be ignored).
I can do this in pure Python but since the data are in dataframes it would be helpful and likely faster to do the manipulations in pandas. I'd appreciate any tips here!
This is the master frame:
>>> df_inventory_master = pd.DataFrame({'Marker': [1, 1, 1, 2],
... 'Begin': [100, 300, 500, 100],
... 'End': [200, 600, 900, 250]})
>>> df_inventory_master
Begin End Marker
0 100 200 1
1 300 600 1
2 500 900 1
3 100 250 2
This is data for one city:
>>> df_boston = pd.DataFrame({'Marker': [1, 1, 1, 1],
... 'Point': [140, 180, 250, 500],
... 'Count': [14, 600, 1000, 700]})
>>> df_boston
Count Marker Point
0 14 1 140
1 600 1 180
2 1000 1 250
3 700 1 500
This is the desired output.
- Note that the count of 700 (Marker 1, Point 500) falls in 2 master bins and is counted for both.
- Note that the count of 1000 (Marker 1, Point 250) does not fall in a master bin and is not counted.
- Note that nothing maps to Marker 2 because df_boston does not have any Marker 2 data.
>>> desired_frame
Begin End Marker boston
0 100 200 1 614
1 300 600 1 700
2 500 900 1 700
3 100 250 2 0
What I've tried: I looked at the pd.cut() function, but with the nature of the bins overlapping, and in some cases absent, this does not seem to fit. I can add the column filled with 0 values to get part of the way there but then will need to find a way to sum the data in each frame, using bins defined in the master.
>>> df_inventory_master['boston'] = pd.Series([0 for x in range(len(df_inventory_master.index))], index=df_inventory_master.index)
>>> df_inventory_master
Begin End Marker boston
0 100 200 1 0
1 300 600 1 0
2 500 900 1 0
3 100 250 2 0

Here is how I approached it, basically a *sql style left join * using the pandas merge operation, then apply() across the row axis, with a lambda to decide if the individual records are in the band or not, finally groupby and sum:
df_merged = df_inventory_master.merge(df_boston, on=['Marker'],how='left')
# logical overwrite of count
df_merged['Count'] = df_merged.apply(lambda x: x['Count'] if x['Begin'] <= x['Point'] <= x['End'] else 0 , axis=1 )
df_agged = df_merged[['Begin','End','Marker','Count']].groupby(['Begin','End','Marker']).sum()
df_agged_resorted = df_agged.sort_index(level = ['Marker','Begin','End'])
df_agged_resorted = df_agged_resorted.astype(np.int)
df_agged_resorted.columns =['boston'] # rename the count column to boston.
print df_agged_resorted
And the result is
boston
Begin End Marker
100 200 1 614
300 600 1 700
500 900 1 700
100 250 2 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Classifying Data in a New Column - python

Related

Sliding windows in numpy with varying window size

Binning pandas/numpy array in unequal sizes with approx equal computational cost

Python numpy vectorization for heat dispersion

Separating my data to groups and combining them

Sum data points from individual pandas dataframes in a summary dataframe based on custom (and possibly overlapping) bins

Categories

Resources