Bin pandas dataframe by integer values - python

I have a pandas dataframe and I want to bin the data by values of a single column. E.G. 0-1, 1-2, etc, starting at 0 and ending at 1, with intervals of 0.1, taking the mean of each column within each bin.
I'm attempting to accomplish this using the .groupby functionality of pandas. See my code below:
import pandas as pd
import numpy as np
my_df = pd.DataFrame({"a": np.random.random(100),
"b": np.random.random(100),
"id": np.arange(100)})
bins = np.linspace(0, 1, 0.1)
groups = my_df.groupby(np.digitize(my_df.a, bins))
binned_data = groups.mean()
print binned_data
The print line then gives a single row with index "1", even though the data of column "a" should have a range of values for the bins specified.
I think its a problem with the creation of "bins" but I can't work out what.
I want 10 rows binned from 0 to 1 in 0.1 intervals. How can I accomplish this?
Many thanks.

Related

How to make the order of legend according to the average of data line

I have 8 data lines plotted using matplotlib, the legend are originally ordered related to filename as my code suggests.
I want to however to make the order of legend based on the average value of data line.(from average high to average low, but the legend names still have to be filenames, just change the order)
How do I achieve it?
my original code:
df.plot(x='INDEX', y=range(1, 9, 1))
a = {}
for z in range(0, len(files_in_dir), 1):
a_string = (files_in_dir[z])
a[z] = a_string
plt.legend(a.values(), loc = 1)
Although you could modify the legend object itself, I think a simpler workaround it to rearrange the columns in your DataFrame in descending order by their average value.
I created a DataFrame similar to yours with a column called 'INDEX' which I assume is similar to the format of your DataFrame (but including a sample of your DataFrame in the question would help), and other columns with average values intentionally out of order. Then we can sort columns whose values will be the y-values on the plot (I assume all of the columns except for 'INDEX' in your DataFrame) by their average value in descending order (credit goes to Andy Hayden's answer), and apply the same df.plot method.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
## reproduce some data where the columns have different averages and are out of order
np.random.seed(42)
df = pd.DataFrame({
'INDEX':list(range(1,65)),
'consumer aged 25C':np.random.normal(0, 0.1, 64),
'consumer aged 70C':np.random.normal(2, 0.1, 64),
'consumer fresh 25C':np.random.normal(1, 0.1, 64),
'consumer fresh 70C':np.random.normal(5, 0.1, 64)
})
# df.plot(x='INDEX', y=range(1,5))
## sort the non-INDEX columns of the DataFrame by their average value
df_sorted = df.reindex(
pd.Index(['INDEX'])
.append(df.loc[:, df.columns != 'INDEX']
.mean()
.sort_values(ascending=False).index), axis=1
)
df_sorted.plot(x='INDEX', y=range(1,5))
plt.show()

Pandas: removing everything in a column after first value above threshold

I'm interested in the first time a random process crosses a threshold. I am storing the results from observing the process in a dataframe, and have plotted how many times several realisations of that process cross 0.9 after I observe it a the end of 14 rounds.
This image was created with this code
import matplotlib.pyplot as plt
plt.style.use('ggplot')
fin = pd.DataFrame(data=np.random.uniform(size=(100, 13))).T
pos = (fin>0.9).astype(float)
ax=fin.loc[:, pos.loc[12, :] != 1.0].plot(figsize=(12, 6), color='silver', legend=False)
fin.loc[:, pos.loc[12, :] == 1.0].plot(figsize=(12, 6), color='indianred', legend=False, ax=ax)
where fin contained the random numbers, and pos was 1 every time that process crossed 0.9.
I would like to now plot the first time the process in fin crosses 0.9 for each realisation (columns represent realisations, rows represent observation times)
I can find the first occurence of a value above 0.9 with idxmax() but I'm stumped about how to remove everything in the dataframe after that in each column.
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.uniform(size=(100, 10)))
maxes = df.idxmax()
It's just that I'm having real difficulty thinking through this.
If I understand correctly, you can use
df = df[df.index < maxes[0]]
IIUC, we can use a boolean matrix with cumprod:
df.where((df < .9).cumprod().astype(bool)).plot()
Output:

Plot multiple values as ranges - matplotlib

I'm trying to determine the most efficient way to produce a group of line plots displayed as a range. I'm hoping to produce something like:
I'll try explain as much as possible. Sorry if I miss any information. I'm envisaging the x-axis to be a range timestamps of hours (8am-9am-10am etc). The total range would be between 8:00:00 and 27:00:00. The y-axis is a count of values occurring at any point in time. The range in the plot would represent the max, min, and average values occurring.
An example df is listed below:
import pandas as pd
import matplotlib.pyplot as plt
d = ({
'Time1' : ['8:00:00','9:30:00','9:40:00','10:25:00','12:30:00','1:31:00','1:35:00','2:45:00','4:50:00'],
'Occurring1' : ['1','2','3','4','5','5','6','6','7'],
'Time2' : ['8:10:00','9:34:00','9:48:00','10:40:00','1:30:00','2:31:00','3:35:00','3:45:00','4:55:00'],
'Occurring2' : ['1','2','2','3','4','5','5','6','7'],
'Time3' : ['9:00:00','9:34:00','9:58:00','10:45:00','10:50:00','12:31:00','1:35:00','2:15:00','3:55:00'],
'Occurring3' : ['1','2','3','4','4','5','6','7','8'],
})
df = pd.DataFrame(data = d)
So this df represents 3 different sets of data. The times, values occurring and even number of entries can vary.
Below is an initial example. Although I'm unsure if I need to rethink my approach. Would a rolling equation work here? Something that assesses the max, min, avg number of values occurring for each hour in a df (8:00:00-9:00:00).
Below is a full initial attempt:
import pandas as pd
import matplotlib.pyplot as plt
d = ({
'Time1' : ['8:00:00','9:30:00','9:40:00','10:25:00','12:30:00','1:31:00','1:35:00','2:45:00','4:50:00'],
'Occurring1' : ['1','2','3','4','5','5','6','6','7'],
'Time2' : ['8:10:00','9:34:00','9:48:00','10:40:00','1:30:00','2:31:00','3:35:00','3:45:00','4:55:00'],
'Occurring2' : ['1','2','2','3','4','5','5','6','7'],
'Time3' : ['9:00:00','9:34:00','9:58:00','10:45:00','10:50:00','12:31:00','1:35:00','2:15:00','3:55:00'],
'Occurring3' : ['1','2','3','4','4','5','6','7','8'],
})
df = pd.DataFrame(data = d)
fig, ax = plt.subplots(figsize = (10,6))
ax.plot(df['Time1'], df['Occurring1'])
ax.plot(df['Time2'], df['Occurring2'])
ax.plot(df['Time3'], df['Occurring3'])
plt.show()
To get the desired result, you'd need to jump through a few hoops. First you need to create a regular time grid, onto which you interpolate the y-data (the occurrences). Then, you can get the min, max, and mean of the interpolated data. The code below demonstrates how to do this:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import griddata
# Example data
d = ({
'Time1' : ['8:00:00','9:30:00','9:40:00','10:25:00','12:30:00','1:31:00','1:35:00','2:45:00','4:50:00'],
'Occurring1' : ['1','2','3','4','5','5','6','6','7'],
'Time2' : ['8:10:00','9:34:00','9:48:00','10:40:00','1:30:00','2:31:00','3:35:00','3:45:00','4:55:00'],
'Occurring2' : ['1','2','2','3','4','5','5','6','7'],
'Time3' : ['9:00:00','9:34:00','9:58:00','10:45:00','10:50:00','12:31:00','1:35:00','2:15:00','3:55:00'],
'Occurring3' : ['1','2','3','4','4','5','6','7','8'],
})
# Create dataframe, explicitly define dtypes
df = pd.DataFrame(data=d)
df = df.astype({
"Time1": np.datetime64,
"Occurring1": np.int,
"Time2": np.datetime64,
"Occurring2": np.int,
"Time3": np.datetime64,
"Occurring3": np.int,
})
# Create 1D vectors of time data
all_times = df[["Time1", "Time2", "Time3"]].values
# Representation of 1 minute in time
t_min = np.timedelta64(int(60*1e9), "ns")
# Create a regular time grid with 10 minute spacing
time_grid = np.arange(all_times.min(), all_times.max(), 10*t_min, dtype="datetime64")
# Storage buffer for interpolated occurring data
occurrences_grid = np.zeros((3, len(time_grid)))
# Loop over all occurrence data and interpolate to regular grid
for i in range(3):
occurrences_grid[i] = griddata(
points=df["Time%i" % (i+1)].values.astype("float"),
values=df["Occurring%i" % (i+1)],
xi=time_grid.astype("float"),
method="linear"
)
# Get min, max, and mean values of interpolated data
occ_min = np.min(occurrences_grid, axis=0)
occ_max = np.max(occurrences_grid, axis=0)
occ_mean = np.mean(occurrences_grid, axis=0)
# Plot interpolated data
plt.fill_between(time_grid, occ_min, occ_max, color="slategray")
plt.plot(time_grid, occ_mean, c="white")
plt.xticks(rotation=60)
plt.tight_layout()
plt.show()
Result (x-labels not formatted properly):

plotting multiple columns value in x-axis in python

I have a dataframe of size (3,100) that is filled with some random float values.
Here is a sample of how the data frame looks like
A B C
4.394966 0.580573 2.293824
3.136197 2.227557 1.306508
4.010782 0.062342 3.629226
2.687100 1.050942 3.143727
1.280550 3.328417 2.247764
4.417837 3.236766 2.970697
1.036879 1.477697 4.029579
2.759076 4.753388 3.222587
1.989020 4.161404 1.073335
1.054660 1.427896 2.066219
0.301078 2.763342 4.166691
2.323838 0.791260 0.050898
3.544557 3.715050 4.196454
0.128322 3.803740 2.117179
0.549832 1.597547 4.288621
This is how I created it
df = pd.DataFrame(np.random.uniform(0,5,size=(100, 3)), columns=list('ABC'))
Note: pd is pandas
I want to plot a bar chart that would have three segments in x-axis where each segment would have 2 bars. One would show number of values less than 2 and other greater than equal to 2.
So on x-axis there would be two bars attached for column A, one with total number of values less than 2 and one with greater than equal to 2, and same for B and C
Can anyone suggest anything?
I was thinking of using seaborn and setting hue value for differentiating two classes (less than 2 and greater than equal to 2) but then again hue attribute only works for categorical value and I can only set one column in x-axis attribute.
Any tips would be appreciated.
You must use a filter and then count them, then you must use plot(kind='bar')
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(0,5,size=(100, 3)), columns=list('ABC'))
dfout = pd.DataFrame({'minor' : df[df<= 2].count(),
'major' : df[df > 2].count() })
dfout.plot(kind='bar')
plt.show()

Separating out pandas series for pyplot

I currently have a set of series in pandas and each series is composed of two data sets. I need to separate out the two data sets into lists while retaining the series information, ie. the time and intensity data for 58V.
My current code looks like:
import numpy as numpy
import pandas as pd
xl = pd.ExcelFile("TEST_ATD.xlsx")
df = xl.parse("Sheet1")
series = xl.parse("Sheet1")
voltages = []
for item in df:
if "V" in item:
voltages.append(item)
data_list = []
for value in voltages:
print(df[value])
How do I select a particular data set from the series to extract them into a list? If I ask it to print(df[value]) returns my data sets, an example of which looks like:
Name: 58V, dtype: int64
0.000 0
0.180 1
0.360 1.2
0.540 1.5
0.720 1.2
..
35.277 0
35.457 0
35.637 0
NaN 0
Ultimately I plan to plot these data sets into a line graph with pyplot.
~~~ UPDATE ~~~
using
for value in voltages:
intensity=[]
for row in series[value].tolist():
intensity.append(row)
time=range(0,len(intensity))
pc_intensity = []
for item in intensity:
pc_intensity.append((100/max(intensity)*item))
plt.plot(time, pc_intensity)
axes = plt.gca()
axes.set_ylim([0,100])
plt.title(value)
plt.ylabel('Intensity')
plt.xlabel('Time')
plt.savefig(value +'.png')
plt.clf()
print(value)
I am able to get the plots of the first 8 data series (using arbitrary x axis), however, anything past the 8th series and my plots are empty? I have experimented and found this to be due to some of the series being different lengths. I'm confused as to why this would effect the plots as the x-axis is directly related to the length of the data set it is being plotted against?
I am not sure what you are trying to acheive but I'll take a guess
df = pd.DataFrame({'A': range(1, 10), 'B': range(1, 10), 'C': range(1, 10), 'D': range(1, 10), 'E': [1,1,1,2,2,2,2,3,4]})
for col in df.columns:
print(df[col].values.tolist())
this would print every columns of your dataframe as list
if you are just trying to plot something why not just use
df.plot()

Categories

Resources